Applying Attention on Lagged page views for Time-series Forecasting

Abhinav sharma
4 min readJan 5, 2022


Concepts used:-

  1. Attention mechanism
  2. Sliding window for multiple days forecasting
  3. New feature — Attention on Compressed Lag page views
  4. Deep learning
  5. Keras Model Subclassing

Ml Problem Formulation:-

Given a time series of length n predict 64 days of future web views

Core Idea:-

Use Attention on Lagged Page views to capture long term seasonality to predict next x days

Dataset Overview

● Dataset is taken from the Kaggle web traffic prediction competition. We Basically have days starting from 2015–7–01 till 2017–09–10, so that’s around 800 days worth of data to predict 64 days and we have 140k such series each divided by language, access agents, etc.

● You can find Exploratory data analysis over here.

I will post some of my plots below, as well.

Language page views vs Days
Mean page views
Median page views
std page views

Feature Generation:-

The minimalist approach here, because LSTM is potent enough to dig up and learn features on its own. Model feature list:

● pageviews (spelled as ‘hits’ in the model code, because of my web-analytics background). Raw values are transformed by log1p() to get more-or-less normal intra-series values distribution, instead of skewed one.

● agent, country, site — these features are extracted from page urls and one-hot encoded

● day of the week — to capture weekly seasonality

● year-to-year autocorrelation, quarter-to-quarter autocorrelation — to capture yearly and quarterly seasonality strength.

● page popularity — High traffic and low traffic pages have different traffic change patterns, this feature (median of pageviews) helps to capture traffic scale. This scale information is lost in a pageviews feature because each pageviews series is independently normalized to zero mean and unit variance.

lagged pageviews — using lagged page views of desirable length to capture seasonality

Model PipeLine:-

64 batch size,100 days of data, features — — 5 for encoder and 906 for concatenation of 90 180 270 and 365 days lag

The first gist is for infinite data generator for feeding to model, 2nd gist to create a sliding window Xi and Yi

Modeling SeQ to Seq:-

  1. Our Encoder will be an LSTM unit with return sequences = True for decoding

2. Fingerprint Model:-This model takes in lagged page views as an index and create a denser /compact representation using 1D Convolutions and Dense Layers

3. Attention mechanism:- This model calculates three types of attention mechanisms

a. Dot:- dot product of two vectors (encoder out X decoder hidden state)

b. General:- we pass decoder hidden state with Dense layer and do Dot product (encoder out X Dense(decoder hidden state))

c. Bahadnau attention mechanism

4. One step Decoder:- This model applies attention and produces output one step at a time

5. Decoder:- This model uses One step Decoder for N step times after which we reduce along the time axis using MaxPooling and 1D convolution to get the desired shape of [Batch, No of Days to predict]

Entire Model:-

Loss Function:-

Mean Absolute error Loss


Attention+LSTM vs LSTM:-

● LSTM suffer from long-range dependencies and hence fail to capture the seasonality of trend of this long time series

● Attention mechanism looks into our Compressed lag views and gives attention to the most important ones only hence capturing long term dependencies

attention weights vs Days


1. Here is a loss to epoch training image for LSTM architecture

LSTM model Epochs

2. Here is the loss to epoch for our attention-based model

Attentoin model epochs


While the LStm model got stuck at local minima of .500 MAE loss Our Attention on lagged paged views approach surpasses that easily to .25 MAE


Future Work:-

  1. Train on all-time series
  2. Plot loss curves using Tensorboard
  3. use Berts self-attention