How transformers expanded my view of Math and ML
Last week, I released a blog post on how math academy helped me overcome personal learning barriers and encouraged me to go farther into learning more about my dream field of machine learning.
Taking the Plunge: Researching ML Models
My recent quest has been delving into research papers that talk about the neural network models and systems that are being used to power the AI programs that enhance our lives.
Three AI and ML papers I’ve been studying recently are:
Attention Is All You Need – The paper proposes the Transformer, a model that relies entirely on attention mechanisms, ditching the older recurrent and convolutional neural networks (RNNs and CNNs) for sequence tasks like translation
BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding – This paper introduces the Bidirectional Encoder Representations from Transformers (BERT) adapts the Transformer’s encoder for pre-training on massive text corpora, introducing a bidirectional approach. Processing sequences move from operating unidirectionally to bidirectionally (looking left and right) for every world to build on self-attention mechanisms to enhance performance and results
Improving Language Understanding by Generative Pre-Training – This paper introduces the first Generative Pre-trained Transformer (GPT), which uses the transformer’s decoder in a left-to-right, autoregressive way for language generation, contrasting BERT’s bidirectional approach and complements the original Transformer’s encoder-decoder setup by focusing solely on generation tasks.
Released between 2017-2019, these papers which introduce and build on the transformer model have opened my eyes even more as to how expansive math is and the backbone of so many systems and societal structures.
As a math lover, my excitement was brewing with every paper I read. But I also felt a little intimidated by the dense equations explaining the math behind them. I still have a lot more work to do on math academy to understand every intricate detail of how the calculations are made, but by making connections between my world and the world I want to enter, I am starting to piece together the bigger picture.
The Transformers’ Superiority over RNNs and CNNs
To better understand the impact of the Transformer model, it’s helpful to know why it needed to be created in the first place. More specifically, what was lacking in older models such as Recurrent Neural Networks (RNNs) and Convolutional Neural Networks (CNNs).
RNNs are a type of neural network that processes sequential data, or data where the order of it matters such as sentences, speech, or time-series data (stock prices).
Imagine an RNN as a conveyor belt that moves one word or data point at a time, remembering the data point it saw before. It reads a word, keeps memory of that word, and uses that memory to understand the next word.
Mathematically, think of the RNN process as that of a recursive sequence.
Note how you cannot calculate h3 until you calculate h2 and you cannot calculate h2 without having h1*.
RNNs are good for remembering recent words, but slow because it processes word by word and forgetful because it cannot remember early words well. So for example, if I continued with our recursive sequence, to say h20, with the right information we’ll solve the problem, but recalling h7 might prove to be a challenge.
CNNs are a type of neural network that analyzes visual data such as images and videos. CNNs are popularly used for image/facial recognition and self-driving cars.
Think of a CNN as a scanner that looks for the important patterns of images or text. It uses filters to scan over the data, detects the important features (such as edges in an image or key phrases within the text), and combines their features to understand the bigger picture.
Our mathematical comparison for CNNs are moving sums: y(i) = x(i) + x(i+1) + x(i+2).
For a sentence, “The duck quacks,” a CNN would take 3-word chunks at a time: y1 = “The” + “duck” + “quacks”
Afterwards it will slide to the next set of words. CNNs are faster than RNNs because they process words in chunks, but they can’t see beyond what it scans, which results in having limited understanding of the full sentence.
RNNs are good for sequential data (text, speech, time-series data) but can forget earlier information in long sequences and processes words one-by-one, making them slow and incapable of parallel processing.
CNNs are faster than RNNs but are more prone to errors, have limited filters (three to five words), do not scale well for varying length dependencies and require too many layers to capture long-range dependencies which complicates training these models.
Transformers surpass both RNNs and CNNs by gathering a global context of the data they are given. They process sequences fast and accurately by using a mechanism called self-attention that allows them to understand the relationships between all the data in a sequence at once. Let’s break this down mathematically by talking about dot products from linear algebra.
Step 1: We take the dot product of Q and K to measure how much focus (attention) one word should give to another. Let’s assume our matrices look like this:
To get attention scores, we compute:
Each value in this matrix represents how much one word should pay attention to another based on their dot product similarity.
Step 2: Apply Softmax to Normalize Scores We convert these scores into probabilities so they sum to 1 across a row. After softmax, we get an attention matrix that determines how much focus each word gives to others.
Step 3: Multiply by Value Matrix (V) The final step is multiplying this attention matrix with V, which contains the actual word representations, to get the final transformed word embeddings.
All of this is achieved with its two main parts:
- The encoder that processes the input data
- The decoder which generates the output data.
Both parts consist of stacked layers of self-attention and feedforward networks, allowing the model to capture deep contextual relationships in data.
By processing all words at once in this way, the model can see long-range dependencies efficiently and with the attention score acting as a weighting system determine which words hold greater influence on each other the most. Because of this, transformers are very scalable, working for both short and long texts without fixed filters.
Finalizing the Math Analogies RNNs = Adding numbers one by one → Slow & forgets early inputs CNNs = Sliding sum over a small window → Faster but has blind spots Transformers = Matrix multiplication (dot products) → Fast & sees everything
Exploring how BERT and GPT are impacting the World
The Transformer itself was an innovative solution for neural networks and their ability to process more information while maintaining high-level accuracy in a quick manner. But even innovators can evolve upon themselves. BERT and GPT were the next two major advancements built upon the foundation that the Transformer set as a scalable neural network regarding language and communication. Seeing how BERT and GPT refined the Transformer’s foundation made me realize just how rapidly AI was progressing. What once seemed like a futuristic dream—computers understanding and generating human language—is now an everyday reality. These models didn’t just improve AI’s ability to process text; they made it feel more natural, almost human-like.
BERT (Bidirectional Encoder Representations from Transformers) is a language understanding model that helps machines interpret text with context, much like humans do. Before BERT, search engines and AI assistants struggled with understanding the meaning of words based on context. For example, asking Siri or Google Assistant "Do jaguars run fast?" might have led to responses about Jaguar cars instead of the animal because keyword-based models didn't consider context. After BERT, the assistant understands that "run fast" applies to a living creature and provides speed-related facts about jaguars as animals.
As a general-purpose language model, BERT is designed to have a deep understanding of context. To achieve this, it uses a two-step process:
Pre-training (Learning a General Understanding of Language) Initial training begins with datasets of massive amounts of text in a self-supervised manner, meaning that human-labeled data is unnecessary.
During this training, BERT learns language patterns through two key tasks:
- Masked Language Modeling (MLM): BERT randomly masks certain words in sentences and tries to predict them.
- Next Sentence Prediction (NSP): BERT figures out how sentences relate to one another by predicting which sentence precedes or follows the other.
Fine-Tuning (Adapting to Specific NLP Tasks)
After initial training is completed, BERT can sharpen its abilities for specific tasks such as sentiment analysis, question-answering, and named entity recognition. While this is task-specific, the required amount of changes to be made is very little. Adding a small classifier on top and more training for a short period of time are the only requirements for fine-tuning.
The Transformer’s architecture is made up of the encoder and decoder. With BERT, the encoder is what is taken and built upon. As a result:
- Every input token attends to all other tokens in the sentence simultaneously.
- It processes words bidirectionally (both left and right contexts are considered).
- The multi-head self-attention mechanism helps BERT deeply understand relationships between words.
This is what you call a bidirectional process and it became a game changer because it meant that search engines could finally move beyond just matching keywords and start actually understanding the intent behind a search. From here, the leap in AI assistants was undeniable—responses became more relevant, and interactions felt less robotic. The impact of this shift is something we often take for granted today, but it's one of the most significant advancements in Natural Language Processing.
BERT achieves this by modifying the Transformer’s attention mechanism to apply multi-head self-attention bidirectionally rather than in a left-to-right sequence like GPT.
By doing this, BERT analyzes words in parallel, assigning different levels of importance to different words just like the original Transformer, but when processing a word, it considers the context of both the left and right at the same time. This allows the model to expand its ability to see the bigger picture.
For example, take the sentence: “He revealed the diamond ring after getting on one knee and opening the box.”
In unidirectional models, “diamond ring” would not have full visibility into “opening” while being processed. BERT, however, sees both directions at once and understands that “diamond ring” and “opening the box” are related, improving the level of contextual understanding.
Where BERT pulls from the encoder portion of the original Transformer design, GPT (Generative Pre-trained Transformer) pulls from the decoder portion. GPT’s objective as a model is to predict the next word in a given sequence in order to generate text. Because of this, the encoder isn’t required for its functionality.
GPT achieves this by processing words sequentially while using masked self-attention, guaranteeing that predictions are only made based on previously processed words. Masked self-attention is found within BERT’s process as well, but GPT does things slightly differently. It computes self-attention across all words in a sentence and applies a “mask” or “covering” to block attention to future words, pushing GPT to accurately predict the next word.
For example, take the sentence: “The dog barked at the mailman.” When GPT is predicting “barked”, it can only see “The dog.” It cannot see “at the mailman” because that would leak future information. Even when it predicts “barked”, GPT still cannot see “at the mailman.” All of this is done with the intent of making GPT more like a human writer, building sentences word by word instead of processing the entire sentence at once.
The ability of GPT to generalize across tasks feels like one of the most significant breakthroughs in AI. Instead of training dozens of models for different applications, we now have one base model that can adapt with minimal fine-tuning. This efficiency is what makes AI truly scalable. As someone learning ML, it’s fascinating to see how unsupervised learning is shifting the field—what once required painstakingly labeled datasets can now be achieved with large-scale pattern recognition.
All of this is achieved by training GPT on vast amounts of text while applying unsupervised learning. Instead of being given data with labels, it learns patterns by predicting the next word in massive text datasets. This is extremely powerful because instead of training a separate model for every language task, GPT can focus on developing a general sense of language understanding first, then fine-tune for specific tasks such as chatbots, summarization, and programming help, among other things. This makes NLPs scalable, reducing the need for massive labeled datasets.
Looking ahead, it's clear that BERT and GPT are just the beginning. With each iteration, these models become more capable, bringing AI closer to true human-like comprehension and reasoning. As I dive deeper into ML, I can't help but wonder: how far can this go? To what extent will future models understand nuance, emotion, and even morality? The pace of innovation suggests that we’re only scratching the surface of what’s possible.
It’s pondering on these possibilities that excite and remind me once again, that Math isn’t just an abstract discipline—it is the reason AI, and by extension, modern society, continues to advance. Without math, there is no BERT. Without math, there is no ChatGPT. Without math, our world wouldn’t look anything like it does today.
I’ll be honest, delving into these research papers was a challenging task. I had hit another ceiling, but broke through it and embraced the discomfort of expanding my world, just like I did with the Math Academy’s help. The math (matrices, scaling) and ML (attention, pre-training) clicked as extensions of what I love. If you’re a math or ML newbie like me, don’t be intimidated—start with the big ideas, ask questions, and lean on communities like those that have been built on platforms like X (I still call it twitter in my heart) and Discord where people talk all things math, AI, ML, programming and other overlapping topics.
We can acknowledge and probably agree, that AI is not only changing the world and our relationship with it, but also expanding possibilities of what can be achieved. Math is the key, the foundation, and the backbone to this evolving world.
I’m more optimistic than ever, on being the transformer of my own life, gathering new insights, and looking for new opportunities to build a career within the ML field (Hire me!!! :P).
As always, Math continues to be my north star and now my transformer.
What’s yours?