Time Series as Supervised Learning

Time series forecasting can be framed as a supervised learning problem. This re-framing of your time series data allows you access to the suite of standard linear and nonlinear machine learning algorithms on your problem.

Sliding Windows

Sliding windows in time series machine leraning technique

Time series data can be phrased as supervised learning. Given a sequence of numbers for a time series dataset, we can restructure the data to look like a supervised learning problem. We can do this by using previous time steps as input variables and use the next time step as the output variable.

time, measure
1, 10
2, 20
3, 30
4, 40
5, 50

We can restructure this time series dataset as a supervised learning problem by using the value at the previous time step to predict the value at the next time-step.

X, y
?, 10
10, 20
20, 30
30, 40
40, 50
50, ?


Univariate Time Series vs. Multivariate Time Series

Univariate Time Series: These are datasets where only a single variable is observed

at each time, such as temperature each hour. The example in the previous section is a

univariate time series dataset.

Multivariate Time Series: These are datasets where two or more variables are observed

at each time.

Most time series analysis methods and even books on the topic focus on univariate data.

This is because it is the simplest to understand and work with. Multivariate data is often more difficult to work.


Time series in practice

Let’s take the following data sample:

Minimum Daily Temperatures dataset. This dataset describes the minimum daily temperatures over 10 years (1981-1990) in the city Melbourne, Australia.

# create date time features of a dataset
from pandas import read_csv
from pandas import DataFrame

series = read_csv('../Datasets/daily-min-temperatures.csv', header=0, index_col=0,parse_dates=True, squeeze=True)

dataframe = DataFrame()

Please note the index_col=0 in the read_csv.
This parameter turns our first column, Date, into an index. We will use this for further feature engineering.

1981-01-01 20.7
1981-01-02 17.9
1981-01-03 18.8
1981-01-04 14.6

Feature engineering

we need to do some simple feature engineering, to get the date and month from the index itself:

#Get the month out of the Index

#Get the day out of the index

#Applied on the whole dataset
dataframe['month'] = [series.index[i].month for i in range(len(series))]
dataframe['day'] = [series.index[i].day for i in range(len(series))]
dataframe['temperature'] = [series[i] for i in range(len(series))]

Once we have the days and months, we can observe that still, we don’t have too many features that can describe our data on the best possible way. Just the month and day information alone will not give us a lot of information to predict temperature and most probably likely result in a poor model.

That is why we need to think about extracting more information from the features we have available at the moment, like at this example “Date”. You may enumerate all the properties of a time-stamp and consider what might be useful for your problem, such as:

  • Minutes elapsed for the day.
  • Hour of day.
  • Business hours or not.
  • Weekend or not.
  • Season of the year.
  • The business quarter of the year.
  • Daylight savings or not.
  • Public holiday or not.
  • Leap year or not.

How to transform a time series problem into a supervised learning problem

Lag features are the classical way that time series forecasting problems are transformed into supervised learning problems. The simplest approach is to predict the value at the next time (t+1) given the value at the current time (t). The supervised learning problem with shifted values looks as follows:

Value(t), Value(t+1)

Pandas library provides the shift() function1 to help create these shifted or lag features from a time series dataset. Shifting the dataset by 1 creates the t column, adding a NaN value for the first row. The time series dataset without a shift represents the t+1.

temps = DataFrame(series.values)
dataframe = concat([temps.shift(1), temps], axis=1)
dataframe.columns = ['t', 't+1']

Printing the data frame now will show us the original column(t) and the shifted column (t+1)

t t+1
0 NaN 20.7
1 20.7 17.9
2 17.9 18.8
3 18.8 14.6
4 14.6 15.8

The first row contains NaN because it was shifted, that is why we will have to discard this one. Please note that if you shit the data for N times because you want to predict N time intervals in future, you will create N number of rows with NaN values that you will have to discard after performing the sliding windows operation. For example:

dataframe = concat([temps.shift(3), temps.shift(2), temps.shift(1), temps], axis=1)
dataframe.columns = ['t-2', 't-1', 't', 't+1']
 t-2 t-1 t t+1
0 NaN NaN NaN 20.7
1 NaN NaN 20.7 17.9
2 NaN 20.7 17.9 18.8
3 20.7 17.9 18.8 14.6
4 17.9 18.8 14.6 15.8

Looking at this example, we can conclude that we cant expect usable data until the 4th row ( index 3).

Removing noise and improving the signal in time series

Let’s take for example one really widely used dataset: Airline dataset

from pandas import read_csv
from matplotlib import pyplot
series = read_csv('../Datasets/airline-passengers.csv', header=0, index_col=0, parse_dates=True,squeeze=True)

#lets do some plotting:

# line plot

# histogram


airline time series density plot

From the graph above, we can conclude that the data set is not stationary.
What does that mean? It means that the variance and the mean of the observations are changing over time.

This can happen in business problems where there is an increasing trend and or seasonality.

What does it mean for us? Non-stationary data causes more problems in solving time series problems. It makes it difficult to model a proper statistical method to give any kind of forecasting. That is why we need to perform certain transformations on the data.

Square root transformation

The square root, x to x^(1/2) = sqrt(x), is a transformation with a moderate effect on distribution shape: it is weaker than the logarithm and the cube root. It is also used for reducing right skewness, and also has the advantage that it can be applied to zero values. Note that the square root of an area has the units of a length. It is commonly applied to counted data, especially if the values are mostly rather small.

A time series that has a quadratic growth trend, like the example above, can be made linear by taking the square root.

What we need to do is apply square root transformation to out Airline dataset and make the growth trend from quadratic to linear and change the distribution of observations to be possibly Gaussian.

first lets import two more libraries:

from pandas import DataFrame
from numpy import sqrt

Perform  square root transformation

#Visualize after the transformations
#Lets create a function to make our transformation code more elegant and short
def visualize (column):
# line plot
# histogram

dataframe = DataFrame(series.values)

#Its very important to give name to our column, 
so later we can call the column by its name
dataframe.columns = ['passengers']

#Perform Transformations
dataframe['passengers'] = sqrt(dataframe['passengers'])
square transformation on time series machine learning
square transformation time series machine learning

Looking at the plot above, we can see that the trend was reduced, but was not removed. The line plot still shows an increasing variance from cycle to cycle. The histogram still shows a long tail to the right of the distribution, suggesting an exponential or long-tail distribution. That is why we need to look up for another type of transformation.


Log Transformation on time series

A class of more extreme trends is exponential. Time series with an exponential distribution can be made linear by taking the logarithm of the values.

The log transformation is, arguably, the most popular among the different types of transformations used to transform skewed data to approximately conform to normality. Maybe that is why I applied this transformation first back in university times. If the original data follows a log-normal distribution or approximately so, then the logtransformed data follows a normal or near normal distribution.

To perform a Log transformation in our python script, first, we need to import:

from  numpy import  log

Then we perform the Log transformations on the time series dataset:

dataframe['passengers'] = log(dataframe['passengers'])
log transformation time series machine learning
log transformation on time series machine learning

Running the example results in a trend that does look a lot more linear than the square root transform above. The line plot shows a seemingly linear growth and variance. The histogram also shows a more uniform or Gaussian-like distribution of observations.

Log transforms are popular with time series data as they are effective at removing exponential variance. It is important to note that this operation assumes values are positive and non-zero.

Box-Cox Transformation on time series

The Box-Cox transformation is a family of power transformations indexed by a parameter lambda. Whenever you use it the parameter needs to be estimated from the data.

Some common values for lambda:

  • lambda = -1. is a reciprocal transform.
  • lambda = -0.5 is a reciprocal square root transform.
  •  lambda = 0.0 is a log transform.
  •  lambda = 0.5 is a square root transform.
  •  lambda = 1.0 is no transform.

Implement Box-Cox transformation on time series in python:

First, import the boxcox library:

from  scipy.stats import  boxcox

Perform Box-Cox transformation on our time series:

dataframe['passengers' ] = boxcox(dataframe['passengers' ], lmbda=0.0)
Box-Cox transformation on time series machine learning
Box-Cox transformation on time series in machine learning

We can let lambda to None and let the function find the most statistically significant value for lambda.

dataframe['passengers'], lam = boxcox(dataframe['passengers'])
print('Lambda: %f' % lam)
Lambda: 0.148023

BoxCox transformation time series machine learning without lambda
BoxCox transformation time series machine learning without lambda

You see using this approach, the distribution is more normal since lambda is defined automatically.

It is sometimes possible that even if after applying the Box-Cox transformation the series does not appear to be stationary, diagnostics from ARIMA modeling can then be used to decide if differencing or seasonal differencing might be useful to remove polynomial trends or seasonal trends respectively. After that, the result might be an ARMA model that is stationary. If diagnostics confirm the orders p an q for the ARMA model, the AR and MA parameters can then be estimated.

Regarding other possible uses of Box-Cox in the case of a series of iid random variables that do not appear to be normally distributed there may be a particular value of lambda that makes the data look approximately normal.


Moving Average Smoothing

Smoothing is a technique applied to time series to remove the fine-grained variation between time steps. The hope of smoothing is to remove noise and better expose the signal of the underlying causal processes. Moving averages are a simple and common type of smoothing used in time series analysis and time series forecasting. Calculating a moving average involves creating a new series where the values are comprised of the average of raw observations in the original time series.

A moving average requires that you specify a window size called the window width. This denes the number of raw observations used to calculate the moving average value. The moving part in the moving average refers to the fact that the window denied by the window width is slid along the time series to calculate the average values in the new series. There are two main types of moving average that is used: Centered and Trailing Moving Average.

Calculating a moving average of a time series makes some assumptions about your data. It is assumed that both trend and seasonal components have been removed from your time series. This means that your time series is stationary, or does not show obvious trends (long-term increasing or decreasing movement) or seasonality (consistent periodic structure).

A moving average can be used as a data preparation technique to create a smoothed version of the original dataset. Smoothing is useful as a data preparation technique as it can reduce the random variation in the observations and better expose the structure of the underlying causal process.

Calculating Moving average over 10 periods in python:

#Moving Average
# tail-rolling average transform over 10 periods window=10
rolling = series.rolling(window=10)
rolling_mean = rolling.mean()
# plot original and transformed dataset
moving average smoothing
moving average smoothing

The moving average is one of the most common sources of new information when modeling a time series forecast. In this case, the moving average is calculated and added as a new input feature used to predict the next time step.

White Noise

If a time series is a white noise, it is a sequence of random numbers and cannot be predicted. If the series of forecast errors are not white noise, it suggests improvements could be made to the predictive model.

A time series is a white noise if the variables are independent and identically distributed with a mean of zero.

White noise is an important concept in time series analysis and forecasting. It is important for two main reasons:

  • Time series Predictability: If your time series is white noise, then, by definition, it is random. You cannot model and predict random occurrence.
  • Model evaluation: The series of errors from a time series forecast model should ideally be white noise. This means the errors are random.

Your time series is not white noise if:

  • Your series has a non-zero mean
  •  The variance change over time
  • Lag values with time series values correlate

In order to check if your time series is a white noise, it is a good idea to do some visualization and draws some statistics during the data inspection process.

  • Like we did before, create a line plot and histogram. Check for gross features like a changing mean, variance, or obvious relationship between lagged variables.

    histogram from a white noise time series
    histogram from a white noise time series
  • Calculate summary statistics. Check the mean and variance of the whole series against the mean and variance of meaningful contiguous blocks of values in the series (e.g. days, months, or years).

Create an autocorrelation plot. Check for gross correlation between lagged variables.
First import the autocorrelation library:

from pandas.tools.plotting import autocorrelation_plot

Then plot:

# autocorrelation

You can easily spot the difference between the autocorrelation of the Airplane sales dataset to the right vs. time series with an white noise to the left.

Random Walk and time series predictability

There is a tool called a random walk that can help you understand the predictability of your time series forecast problem.

A random walk is different from a list of random numbers because the next value in the sequence is a modification of the previous value in the sequence.

This dependence provides some consistency from step-to-step rather than the large jumps that a series of independent, random numbers provide.

We can confirm that our time series data set is a random walk and not a random white noise with a statistical test, called Adufller test.

To the code from our Airline time series you need to import the adufller library:

from statsmodels.tsa.stattools import adfuller

Then perform the statistical test:

# statistical test
result = adfuller(series)
print ('ADF Statistic: %f'  % result[0])
print ('p-value: %f'  % result[1])
print ('Critical Values:' )
for  key, value in  result[4].items():
print ('\t%s: %.3f'  % (key, value))

As a result, you will get something like:

ADF Statistic: 0.815369
p-value: 0.991880
Critical Values:
        1%: -3.482
        5%: -2.884
        10%: -2.579

The null hypothesis of the test is that the time series is non-stationary. Running the test we can see that the ADF Statistic value was 0.815369. This is larger than all of the critical values at 1%, 5%, and 10% confidence levels. Therefore, we can say that the time series does appear to be non-stationary.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.