Nuts and Bolts of ChatGPT !

Have you ever wondered how ChatGPT effortlessly performs engaging conversations or provides insightful responses? It’s like chatting with a friend guru who always knows just what to say. But have you ever wondered how it works its magic?
As daily users of ChatGPT, many software developers like myself wonder at its capabilities without fully understanding the underlying mechanisms that power its magic. The inner workings of ChatGPT — the nuts and bolts that make it tick — have remained somewhat elusive.
ChatGPT is a large language model developed by OpenAI. In this article, we’re going to peel back the layers and take a closer look at what makes ChatGPT truly remarkable.
The Transformer Architecture: A Paradigm Shift
ChatGPT is based on Transformer architecture. It is a neural network architecture for processing sequential data, such as text.
Limitations of Traditional Sequence-to-Sequence Models
- Traditional sequence-to-sequence models, such as those based on recurrent neural networks (RNNs) and convolutional neural networks (CNNs), have limitations when it comes to processing long sequences.
- The CNN output sequence heavily relies on the context defined by the hidden state in the final output of the encoder, posing challenges for handling long sentences. Models like RNNs suffer from issues like vanishing gradients and difficulty in capturing long-range dependencies.
Overview of the Transformer Architecture and Its Key Innovations
- The Transformer architecture, introduced in the landmark paper “Attention is All You Need” in 2017, represents a significant departure from traditional models.
- Instead of relying on sequential processing, the Transformer processes the entire input sequence in parallel.
- The architecture consists of an encoder and a decoder, each composed of several layers.
GPT models consist of only the decoder part of the Transformer architecture. This decoder is autoregressive, meaning it generates output tokens one by one.
Layers of Transformer:

- Input Embeddings: The input sequence is converted into embeddings. It is easier for a computer to understand a numerical representation, which represents the meaning of each word in a high-dimensional space.
- Positional Encoding: Since the Transformer doesn’t inherently understand the order of words, positional encoding is added to provide information about the position of each word in the sequence.
- Multi-Headed Self-Attention Mechanism:
This is the Game changer. This mechanism allows the model to weigh the importance of different words in the input sequence. When processing each word, it captures long-range dependencies and contextual information effectively. - Feed-Forward Neural Network: After the self-attention mechanism, each sub-layer in the Transformer includes a feed-forward neural network. It processes the information with the self-attention mechanism and produces the final output for that layer.
- Normalization and Residual Connections: After every instance of multi-head attention and the feed-forward neural network, or you could say after each sub-layer in both the encoder and decoder, there is a residual connection, which is then followed by layer normalization. The residual connection helps in avoiding the vanishing gradient problem, and the layer normalization helps in faster and more stable training.
Self-attention
Imagine you’re reading a story, and there’s a word that seems important. It’s like the star of the sentence, grabbing your attention and helping you understand what’s going on. Now, think about how you decide which word is important as you read.
What is QUERY, KEY and VALUE?
- Key, Query, and Value: In our story, each word has three identities: a key, a query, and a value. The key is like a hint that tells us what other words are related to it. The query is like a question we ask to find out which words are important for understanding the current word. The value is the actual information that the word holds.
- Paying Attention: When we come across a word, we don’t just look at it, we pay attention to the other words around it too. We ask ourselves, “Which other words are related to this one?” This is where the self-attention mechanism comes in. It helps the model decide how much attention to give to each word in a sentence.
- Calculating Attention Scores: The model calculates attention scores for each word by comparing its query with the keys of all the other words in the sentence. If a word’s query matches well with another word’s key, it means they’re related, and the model gives them a high attention score. If they’re not related, the score is lower.
- Weighted Sum: Once we have the attention scores, we use them to calculate a weighted sum of all the words’ values. This sum represents the importance of each word in understanding the sentence. Words with higher attention scores contribute more to the final understanding, while words with lower scores have less impact.
- Putting it Together: By paying attention to how each word relates to others in the sentence, the model can understand the context better and generate more accurate responses. It’s like having a conversation where you listen carefully to what the other person is saying and respond based on what you’ve heard.
Attention Everywhere!
As you now understand, attention is a revolutionary idea.
If you want to play around a transformer, Using Hugging Face’s Write With Transformer application, we can do just that.
Pre-training: The Foundation of Language Understanding
The pre-training process begins with exposing the model to a vast corpus of text data. Books and articles to websites and databases. This text data serves as the raw material for the model to learn the patterns, structures, and nuances of language. The pre-training objective, known as Masked Language Modeling (MLM), is a self-supervised learning task that helps the model understand the context and relationships within the text.
Here’s how MLM works:
- The model is presented with sentences from the text corpus, with some words randomly masked or replaced with a special token.
- The model’s task is to predict the original, masked words based on the surrounding context.
- By learning to accurately predict these masked words, the model develops a deep understanding of language, capturing the intricate relationships between words, their meanings, and their contexts.
Imagine learning a new language by trying to fill in the blanks in sentences based on the words surrounding the blank spaces. As you practice this exercise repeatedly, with countless sentences and contexts, you gradually develop an intuitive understanding of the language’s grammar, vocabulary, and idiomatic expressions.
Pre-training on massive text corpora allows language models like ChatGPT to build robust language representations, like developing a rich mental lexicon and understanding of language nuances. This pre-training phase is computationally intensive, often requiring billions of parameters and vast amounts of data to achieve high-quality language understanding.
However, the true power of pre-training lies in its ability to enable transfer learning. Once the model has developed a strong foundation of language understanding through pre-training, it can be fine-tuned for specific tasks, such as open-ended conversation generation, with significantly less data and computational resources required.

This transfer learning approach has revolutionized the field of natural language processing, allowing models like ChatGPT to achieve remarkable performance on a wide range of language tasks while requiring relatively smaller task-specific datasets for fine-tuning.
Multilingual Support: Bridging Language Barriers
One day, I asked ChatGPT about the Redis — but there was a twist. I posed the question in Hindi, curious to see how ChatGPT would handle a different language.
To my amazement, ChatGPT responded with a perfect Hindi sentence, seamlessly, keeping the context and using the correct vocabulary and grammar. It was as if I was conversing with a fluent Hindi speaker who is a seasoned Software developer, and an expert in Redis.

Language is diverse and rich, with each language bringing its nuances and intricacies. Then how ChatGPT is so smooth in responding in multiple languages?
- Pre-training on Multilingual Text Data: Just as a polyglot learns multiple languages by exposure to various linguistic contexts, ChatGPT is pre-trained on multilingual text data. This means it’s exposed to conversations, articles, and texts in multiple languages during its training process.
- Shared Representations Across Languages: Like finding common ground between different languages, ChatGPT learns to represent words and phrases in a shared semantic space across languages. This allows it to understand and generate text in multiple languages using a unified model.
Amazing, isn’t it?
As we conclude our exploration into the inner workings of ChatGPT, it’s clear that its architectural innovations have revolutionized the landscape of natural language processing. The power of the Transformer architecture, the multilingual prowess and attention mechanisms, ChatGPT stands as a testament to the advancements in AI technology.
The potential applications of ChatGPT are boundless, From language translation and content generation to Financial adviser and beyond. Its impact on industries and daily interactions is profound, changing the way we interact with technology. ☕

In the end, ChatGPT is more than just a language model — it’s a catalyst for innovation and understanding, in an increasingly digital world. 🚀
References:
1. Illustrated Guide to Transformers- Step by Step Explanation: https://towardsdatascience.com/illustrated-guide-to-transformers-step-by-step-explanation-f74876522bc0
2: Attention Is All You Need: https://proceedings.neurips.cc/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
3:Attention Networks: A simple way to understand Self Attention: https://medium.com/@geetkal67/attention-networks-a-simple-way-to-understand-self-attention-f5fb363c736d
4. How Does ChatGPT Actually Work?https://www.scalablepath.com/machine-learning/chatgpt-architecture-explained