Generative AI for dummies(1/3)💡

9 min readSep 7, 2023

Welcome to the first part of a 3-part series on Generative AI! In this blog, we’ll dive deep into the inner workings of LLM technology, the underlying architecture, different components & flavours of the architecture.

🚨 Be sure to checkout the Glossary in the end 🔚

What exactly is a Language Model Technology (LLM)?
How does GPT (Generative Pre-trained Transformer) fit into the LLM landscape?
What’s the backstory of LLMs? How did we get here?
What are the key components that make up the architecture of LLMs?
Are there different flavors of LLMs, and what sets them apart?
What triggered the breakthrough in Generative AI?

Hello beautiful fellas,
Welcome to the Daily product management show! 📺 I’m your host, Bhavya, and I’ll be bringing you fresh insights & ramblings on product management every few days! 👋

What is GenAI?

Generative AI (GenAI) is a subset of machine learning (ML). The machine learning models that underpin GenAI have learned these abilities by finding statistical patterns in massive datasets of content that was originally generated by humans.
Generative AI models are being created for multiple modalities, including images, video, audio, and speech.
By either using these models as they are or by applying fine tuning techniques to adapt them to your specific use case, you can rapidly build customized solutions without the need to train a new model from scratch.

Models memory: Relative size in terms of their parameters

Usecases of Large Language Models

Usecases of LLMs when applied for different asks

Translate: LLMs can be used to translate text from one language to another. For eg, the image shows how an LLM can translate the French sentence “J’aime l’apprentissage automatique” into German as “Ich liebe maschinelles Lernen.”
Summarize: An LLM can summarize a chat session between a customer and a support agent.
Extract entities: LLMs can be used to extract named entities from text, this is called Named Entity Recognition(NER). For example, the image shows how an LLM can extract the named entities “Dr. Evangeline Starlight”, “Technopolis”, “quantum computing”, and so on from a short text.
Write code: An LLM can also write Python/any programming language code to calculate the mean of every column in a dataframe as shown in the image.
Answer questions: LLMs when integrated with external data can be used to answer questions like “Is flight VA8005 landing on time?”
Write essays: LLMs can be used to write essays. For example, the image shows how an LLM can write a 5-paragraph essay on the history of machine learning.

History of GenAI evolution

RNN

Generative algorithms are not new. Previous generations of language models made use of an architecture called Recurrent neural networks(RNN). RNNs while powerful for their time, were limited by the amount of compute and memory needed to perform well at generative tasks.

RNN facilitated- only next word prediction with a context of previous words. And was inadept to handle the complexities of language.

Transformer Architecture

Well in 2017, after the publication of this paper, Attention is All You Need, from Google and the University of Toronto, everything changed. The transformer architecture had arrived. It can be scaled efficiently to use multi-core GPUs, it can parallel process input data, making use of much larger training datasets, and crucially, it’s able to learn to pay attention to the meaning of the words it’s processing. And attention is all you need. It’s in the title.

Attention Is All You Need

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks in an…

arxiv.org

The power of the transformer architecture lies in its ability to learn the relevance and context of all of the words in a sentence. Not just as you see here, to each word next to its neighbor, but to every other word in a sentence. To apply attention weights to those relationships so that the model learns the relevance of each word to each other words no matter where they are in the input. This gives the algorithm the ability to learn who has the book, who could have the book, and if it’s even relevant to the wider context of the document.

Overview: Transformer model architecture

Components of Transformers Architecture

Lets understand the individual components that build the Transformer Architecture and the sequence of steps executed from input feeding to output completion. ➿

Tokenization

Machine-learning models are just big statistical calculators that deal with numbers, not words. So, before feeding them text, we need to turn words into numbers. This is called tokenization. It’s like giving each word a special number that the model understands from a big dictionary of words it knows. There are different ways to do this.
Tokenisation method Examples:

Token IDs matching two complete words.
Using token IDs to represent parts of words.

Embedding Layer

The tokenised text is now passed to a high-dimensional space where each token is represented as a vector and occupies a unique location within that space. Each token ID in the vocabulary is matched to a multi-dimensional vector, and the intuition is that these vectors learn to encode the meaning and context of individual tokens in the input sequence.

This diagram represents a 3D vector space. In the original Give attention to Detail paper, the vector size quoted was 512 dimensional

Postitional Encoding

To maintain word order relevance, positional encoding is added to token vectors in both the encoder and decoder. This preserves information on word position, ensuring effective parallel processing of input tokens.

Attention Layer

Once you’ve summed the input tokens and the positional encodings, you pass the resulting vectors to the self-attention layer. Here, the model analyzes the relationships between the tokens in your input sequence.
In the transformer architecture, multiple sets of self-attention weights, called multi-headed self-attention, are learned independently in parallel. The number of attention heads varies between models but typically falls within the range of 12 to 100.

The self-attention weights that are learned during training and stored in these layers

Each self-attention head learns distinct language aspects, such as relationships between entities, sentence activities, or rhyming words. It’s crucial to emphasize that we don’t predefine what these heads focus on. Instead, they autonomously discover language nuances.

Attention map: To illustrate the attention weights between each word and every other word

Multi-headed Self-Attention Transformer architecture: 3 Self-attention heads displayed

Feed forward Network

This layer produces logits, which represent the probability scores for all tokens in the tokenizer dictionary.

Softmax layer

To generate text, the model calculates probabilities for each word in the vocabulary using a softmax layer. This results in thousands of scores, with one token having the highest score, indicating the most likely prediction. Various methods can be applied to select the final output from this probability vector.

Example

Walkthrough of a Translation Process with a Transformer Model (Spanish to English)

Tokenizer: We start by splitting the French phrase into smaller units called tokens, using a special tool designed for this purpose.
Encoder: These tokens are then sent to the encoder side of the model. The encoder takes these tokens, turns them into numerical values, and understands their relationships.
Multi-Headed Attention: The encoder uses a multi-headed attention mechanism to focus on different parts of the input, capturing the structure and meaning of the French phrase.
Deep Representation: The encoder creates a deep representation of the input’s meaning, which is then passed to the decoder.
Decoder Input: A “start of sequence” token is added to the decoder input, signaling the decoder to start generating the English translation.
Contextual Understanding: The decoder uses the contextual information provided by the encoder to predict the next English token. It’s like having a conversation with context.
Output: The decoder’s predictions go through some more layers, and a final softmax layer to produce the first token of the translation.
Loop: This process repeats, with each generated token going back into the decoder to predict the next one, like building a sentence step by step.
End-of-Sequence: The model continues until it predicts an “end-of-sequence” token, indicating that the translation is complete.
Detokenize: Finally, the sequence of tokens is transformed back into words to get our translated output. In this case, it could be something like “I love machine learning.”

Glossary

Generative AI (GenAI): A subset of machine learning that involves models capable of generating content such as text, images, video, audio, or speech based on statistical patterns learned from large datasets.
Language Model Technology (LLM): A technology that uses machine learning models to understand and generate human language text.
Generative Pre-trained Transformer (GPT): A specific type of language model technology, known for its ability to generate human-like text based on pre-training on a large corpus of text.
Model’s Memory: Relative size in terms of their parameters.
Prompt: The text input provided to a language model to generate desired content.
Context Window: The space or memory available for the prompt in a language model, typically large enough to hold a few thousand words.
Completion: The output generated by a language model in response to a prompt.
Inference: The process of using a language model to generate text or content based on a given prompt.
Transformer Architecture: A neural network architecture known for its efficiency in processing and understanding the context of words in sentences, widely used in generative AI.
Tokenization: The process of converting human language text into numerical values (tokens) that machine learning models can process.
Embedding Layer: A layer in a neural network that represents tokens as vectors in a high-dimensional space, encoding the meaning and context of individual tokens.
Positional Encoding: Information added to token vectors to maintain the relevance of word order in the input sequence.
Self-Attention Layer: A component of the transformer architecture that analyzes the relationships between tokens in the input sequence, with multiple sets of self-attention weights in parallel.
Multi-Headed Self-Attention: A feature of the transformer architecture where multiple self-attention heads independently learn different language aspects.
Feed Forward Network: A layer in the neural network that produces logits, representing the probability scores for all tokens in the tokenizer dictionary.
Softmax Layer: A layer that calculates probabilities for each word in the vocabulary, used to select the final output token in generative AI models.
Encoder and Decoder Model: Components of a transformer-based architecture used in tasks like machine translation, where the encoder understands the input, and the decoder generates the output.
End-of-Sequence Token: A token used to indicate the completion of a sequence or translation.
Detokenize: The process of transforming a sequence of tokens back into human-readable words or content.
Homonyms: Same spelling words with different meanings. Eg: Bank
Syntactic ambiguity: “The teacher taught the students with the book.” Did the teacher teach using the book or did the student have the book, or was it both? How can an algorithm make sense of human language if sometimes we can’t?

This blog curates my takeaways as a product manager from the courses LLMs by Google & Introduction to LLMs and the generative AI by Coursera.