Exploring Time Series Data

Table of contents
  1. What is Time Series Data?
    1. Univariate vs Multivariate
    2. Obtaining prepared data
  2. Collecting Time Series Data from Database
  3. Characteristics of Time Series Data
  4. Basic Exploratory Data Analysis on Time Series
    1. Line plot
    2. Difference histogram
    3. Scatter plot of features
    4. Rolling windows

What is Time Series Data?

Time series data is a sequence of observations.

It does not matter what the unit of time is, what matters is:

  • Unique and meaningful ordering
  • Intervals at which the observations were made

Univariate vs Multivariate

  • Univariate: If a single variable is measured against time
  • Multivariate: If multiple variables are measured against time. Useful for analysis of interrelations.

Obtaining prepared data

Easiest way to begin an analysis is to obtain prepared data:

  • Competition data (i.e. Kaggle)
  • Repository of research labs (i.e. UCI Machine Learning Repository)
  • Data from government agencies

For beginners, government data is not suitable for learning. Even experts devote their entire career to analyze the data due to its complexity. For starters, it is better to use them only for exploratory analysis or visualization.


Collecting Time Series Data from Database

It is common to collect time series data from sources not originally intended for time series.

Most common example is extracting analysis data from database.

Often called data wrangling or data munging.

The following are some of the DOs and DON’Ts:

  • Integrating data from various tables to create a single time series brings about interesting analysis topics, but caution is needed due to:
    • Disparate timestamp conventions
    • Different levels of granularity
  • Clarify whether each column of the data is really what you think it is
    • For example, does the column status refer to the current status or the status at the time of the observation?
    • If a column says time, does it refer to the created time or the updated time?
  • Avoid lookahead
  • Understand recording conventions
  • Understand whether null values are included in the records
    • If the DB does not record zero values, you may have to fill in the missing values with zeros.
  • Decide if you want to omit the start and end of the records
    • Often, the start and end of the records are anomalous which only brings noise to the analysis.
  • Beware of time zone differences
    • Although most databases store timestamps in UTC, it is not always the case.
  • Know whether the data was human-generated or machine-generated

Characteristics of Time Series Data


Basic Exploratory Data Analysis on Time Series

Various exploratory methods applied to time series data.

Line plot

The most obvious thing you would do.

Simply drawing a line plot of time series data actually reveals a lot of insight due to the temporal nature of the data.

Stacking multiple time series data on top of each other can reveal interesting patterns.

Difference histogram

When you use a value-frequency histogram for a time series, it is often beneficial to plot the histogram of the difference between adjacent values.

Because each value by itself may not be very informative.

It is also important to choose the right bin size.

Otherwise you’ll just get a bunch of meaningless spikes that mask the underlying distribution, which is not very informative.

Plotting the difference histogram can also be used to determine long-term bias in the data: whether the values tend to increase or decrease in the long run.

Scatter plot of features

If we have multiple time series data, we can match the data points by timestamp and plot them against each other using a scatter plot.

This can reveal correlations between different features or indexes.

However, just like the histogram, each value in a time series data are not very informative.

So it is often better to plot using the difference between adjacent values.

Instead of matching the data points by timestamp, you can do another correlation analysis by giving a time lag to one of the time series data.

Doing so can reveal whether one feature can be a predictor of another.

Rolling windows

The rolling window method can be used to identify important characteristics like stationarity, etc.