Relevance of fundamentals in the age of GPTs (LLMs)

3 min readApr 21, 2023

There have been a lot of papers and blogs about LLMs since last November. I don’t intend to write here anything specific about GPT, but about its evolution and how the fundamentals can help us decode these mammoth models.

Disclaimer: Don’t get into too much technical nitty-gritty while reading this. The intention is to understand the evolution in layman's terms, which I personally prefer to make me understand these difficult technologies.

Time series models are one of the most widely used models pre-ML era and even now, be it for forecasting sales or stock price prediction. There is a concept of “Lag” in time series, which helps us to decide how much into the past or how many steps backward we need to look to forecast the future. This concept of lag is for the immediate past, whereas for far-away past events which can affect the forecast, we include “seasonality.” And then, there is a concept called Exogenity wherein we add the influence of external factors outside the time series which were not part of the time series data.

Now let me connect the dots on why I am introducing time series here while explaining the evolution of LLMs or the transformer architecture on which they are based.

Let’s start with the name. The most popular and widely used time series model is SARIMAX, in which AR is auto-regressive. LLMs or transformers are also called auto-regressive models, which try to predict the next token(word) given earlier tokens (Remember Lag). But with the earlier fundamental sequence to sequence model, there was one problem of long-term memory, wherein the model or architecture cannot remember far away past. Then came the seminal paper “Attention is all you need,” which helped solve this problem of long-term memory. The paper tells us how to focus on the right or influencing events in the past (remember seasonality).

Final point. To train LLMs, you cannot generate such massive label data. Let’s take a very simple example so that I can understand. I am forecasting sales for a month, and maybe it depends on the previous 3 months' sales. So my fourth month's “sales” is the masked token/data point, which I am trying to predict. But hey, for LLMs, we already have (imagine) 100 years of data, i.e., 1200 months of data. So now I have roughly 300 masked points on which I can train on, considering my every fourth-month prediction depends on the previous 3 months' data. And the difference between the prediction and the actual will become my objective function. I know I have tried to oversimplify the training mechanism, but I told you that’s how I remember and understand things. So was just sharing.

“Every step we take on life’s journey, whether it leads us to stumble or stride confidently forward, is a teacher in its own right, imparting lessons of resilience, wisdom, and growth.” — Anonymous (Or I don’t know)

Relevance of fundamentals in the age of GPTs (LLMs)

Written by Rahul Kharat