Applying Attention on Lagged page views for Time-series Forecasting
- Attention mechanism
- Sliding window for multiple days forecasting
- New feature — Attention on Compressed Lag page views
- Deep learning
- Keras Model Subclassing
Ml Problem Formulation:-
Given a time series of length n predict 64 days of future web views
Use Attention on Lagged Page views to capture long term seasonality to predict next x days
● Dataset is taken from the Kaggle web traffic prediction competition. We Basically have days starting from 2015–7–01 till 2017–09–10, so that’s around 800 days worth of data to predict 64 days and we have 140k such series each divided by language, access agents, etc.
● You can find Exploratory data analysis over here.
I will post some of my plots below, as well.
The minimalist approach here, because LSTM is potent enough to dig up and learn features on its own. Model feature list:
● pageviews (spelled as ‘hits’ in the model code, because of my web-analytics background). Raw values are transformed by log1p() to get more-or-less normal intra-series values distribution, instead of skewed one.
● agent, country, site — these features are extracted from page urls and one-hot encoded
● day of the week — to capture weekly seasonality
● year-to-year autocorrelation, quarter-to-quarter autocorrelation — to capture yearly and quarterly seasonality strength.
● page popularity — High traffic and low traffic pages have different traffic change patterns, this feature (median of pageviews) helps to capture traffic scale. This scale information is lost in a pageviews feature because each pageviews series is independently normalized to zero mean and unit variance.
● lagged pageviews — using lagged page views of desirable length to capture seasonality
The first gist is for infinite data generator for feeding to model, 2nd gist to create a sliding window Xi and Yi
Modeling SeQ to Seq:-
- Our Encoder will be an LSTM unit with return sequences = True for decoding
2. Fingerprint Model:-This model takes in lagged page views as an index and create a denser /compact representation using 1D Convolutions and Dense Layers
3. Attention mechanism:- This model calculates three types of attention mechanisms
a. Dot:- dot product of two vectors (encoder out X decoder hidden state)
b. General:- we pass decoder hidden state with Dense layer and do Dot product (encoder out X Dense(decoder hidden state))
4. One step Decoder:- This model applies attention and produces output one step at a time
5. Decoder:- This model uses One step Decoder for N step times after which we reduce along the time axis using MaxPooling and 1D convolution to get the desired shape of [Batch, No of Days to predict]
Mean Absolute error Loss
Attention+LSTM vs LSTM:-
● LSTM suffer from long-range dependencies and hence fail to capture the seasonality of trend of this long time series
● Attention mechanism looks into our Compressed lag views and gives attention to the most important ones only hence capturing long term dependencies
1. Here is a loss to epoch training image for LSTM architecture
2. Here is the loss to epoch for our attention-based model
While the LStm model got stuck at local minima of .500 MAE loss Our Attention on lagged paged views approach surpasses that easily to .25 MAE
- Train on all-time series
- Plot loss curves using Tensorboard
- use Berts self-attention