Class 2 Notes

Fundamentals of Generative AI and Large Language Models

Professor Pavlos - Technical Deep Dive

About the Instructor

Professor Pavlos - Scientific Program Director of the Institute for Applied Computational Science at Harvard SEAS. Expertise spans astronomy, physics, machine learning, and statistics. Leads the graduate program in data science and teaches flagship courses including CS109.

AI Evolution: Accelerating Transformation

Agricultural Age

~1,000 years

Steam Revolution

100 years

Electrical Age

~100 years

Digital Age

40 years

AI Era

Accelerating rapidly

Key Insight: Acceleration Pattern

Each technological revolution happens faster than the last. We cannot treat AI as a slow-moving frontier—it's already here and moving at unprecedented speed.

Defining AI Through Examples

❌ Not AI

Abacus
Calculators
Programmable calculators

❓ Debatable

Deep Blue (chess AI)
Uses brute force search
Limited domain expertise

✅ Clearly AI

AlphaGo (Monte Carlo + RL)
ChatGPT
Modern LLMs

Three Machine Learning Paradigms

Unsupervised Learning

Discovers hidden patterns without labels

Example: Clustering swan images without knowing they're swans

Pros: Useful for exploration

Cons: Difficult to evaluate

Supervised Learning

Learns from labeled examples (input-output pairs)

Example: Training to distinguish swans from non-swans

Pros: Clear evaluation metrics

Cons: Requires expensive human-labeled datasets

Self-Supervised Learning

Creates supervision signal from data itself

Example: Hide part of sentence and predict missing piece

Pros: No human labels needed, scalable

Cons: Complex architecture required

Self-Supervised Learning: The Game Changer

Self-supervised learning is the foundation of modern language models. By creating supervision signals from the data itself (like masking words and predicting them), it eliminates the need for expensive human labeling while enabling training on massive datasets.

Natural Language Processing Applications

Optical Character Recognition (OCR)

Predictive Text and Search Suggestions

Named Entity Recognition (NER)

Machine Translation

Conversational AI Systems

Sentiment Analysis

Text Summarization

Question Answering

Language Models: From Simple to Sophisticated

N-gram Models (Traditional Approach)

Unigrams

P(sentence) = ∏P(word)

Assume word independence

Bigrams

P(word|previous_word)

Consider word pairs

Problems

Sparsity issues
Storage explosion
Out-of-vocabulary words

Neural Language Models

Word Embeddings Innovation

Convert words to vectors (e.g., 200 dimensions)
Semantically similar words have similar embeddings
Famous example: King - Man + Woman ≈ Queen
Still limited by fixed window approach

Key Technical Insights

Self-Supervised Learning Revolution

Creating supervision signals from data itself eliminates need for expensive human labels

Attention Mechanism Breakthrough

Transformer attention allows models to focus on relevant parts of input regardless of distance

Multi-Stage Training Pipeline

From pre-training to RLHF, each stage adds capabilities and human alignment

The Transformer Revolution

Motivation for Transformers

Need for long-range context (not just 2-3 words)
Parallel processing capability for large datasets
Contextual embeddings that change based on surrounding words

Multi-Head Attention

Problem

Single attention can't capture all relationship types

Solution

Multiple attention "heads" focus on different aspects

Examples of Different Heads:

Subject-verb relationships
Object relationships
Topical relationships

Complete Transformer Architecture

Multi-Head Attention

Capture different relationship types simultaneously

Query-Key-Value system with trainable weights

Skip Connections

Enable deep network training and gradient flow

Residual connections around attention blocks

Layer Normalization

Stabilize training and improve convergence

Normalize inputs to each layer

Positional Encoding

Handle word order in sequence

Sinusoidal or learned position embeddings

Scale: GPT-4 uses many layers with multiple heads each, creating an incredibly sophisticated attention network.

From GPT to ChatGPT: The Training Pipeline

Stage 1: Pre-training

Train on massive internet text (~10TB for GPT-3)

Details: 175 billion parameters, ~$10M cost, 1 month on 1000 GPUs

Result: Powerful but 'untamed' base model

Stage 2: Supervised Fine-Tuning (SFT)

Use ~100,000 high-quality human-written responses

Details: Teaches assistant-like behavior, ~$100K cost

Result: GPT-3.5 level capabilities

Stage 3: Reward Modeling

Train model to rank responses like humans do

Details: Uses pairwise comparisons (A better than B)

Result: Automated quality assessment

Stage 4: RLHF

Use reward model to guide further training

Details: Clipping mechanism prevents over-optimization

Result: ChatGPT-level performance

RLHF: The Secret Sauce

Reinforcement Learning from Human Feedback (RLHF) uses the reward model to guide further training. A clipping mechanism prevents over-optimization, balancing helpfulness with natural language patterns to achieve ChatGPT-level performance.

Why Transformers Work: Critical Success Factors

Technical Advantages

Parallel Processing

Unlike RNNs, can process entire sequences simultaneously

Long-Range Dependencies

Attention connects distant words directly

Flexible Context

Contextual embeddings adapt to different meanings

Scalability

Architecture scales to very large models

Key Innovations

Self-Supervised Learning

Eliminates need for labeled data

Attention Mechanism

Captures complex linguistic relationships

Multi-Stage Training

Aligns models with human preferences

Massive Scale

Both data and parameters reach unprecedented sizes

Limitations and Broader Implications

Current Limitations

Language Subtleties

Struggle with irony, puns, complex slogans

No Explicit Rules

Lack grammatical or linguistic rule understanding

Hallucination Risk

Can generate biased or false content

Resource Requirements

Enormous computational resources needed

Societal Implications

Democratization Concern

AI power concentrated in few hands

Job Displacement

Potential impact on various industries

Need for Understanding

Importance of technical literacy

Rapid Evolution

Continuous advancement in capabilities

Session Summary and Key Takeaways

The transformer architecture represents a fundamental breakthrough in AI, enabling the language understanding capabilities we see in modern systems. The progression from simple n-grams to sophisticated attention mechanisms illustrates how technical innovations can dramatically expand AI capabilities.

Technical Foundation

Understanding transformer mechanics helps us use AI tools more effectively while being aware of their limitations.

Human-AI Collaboration

These models are powerful tools that augment rather than replace human intelligence in solving complex problems.

Class 2 Notes

Fundamentals of Generative AI and Large Language Models

About the Instructor

Agricultural Age

Steam Revolution

Electrical Age

Digital Age

AI Era

Key Insight: Acceleration Pattern

❌ Not AI

❓ Debatable

✅ Clearly AI

Unsupervised Learning

Supervised Learning

Self-Supervised Learning

Self-Supervised Learning: The Game Changer

Text Preprocessing Pipeline

N-gram Models (Traditional Approach)

Unigrams

Bigrams

Problems

Neural Language Models

Word Embeddings Innovation

Self-Supervised Learning Revolution

Attention Mechanism Breakthrough

Multi-Stage Training Pipeline

Motivation for Transformers

Self-Attention Mechanism Deep Dive

Multi-Head Attention

Problem

Solution

Examples of Different Heads:

Complete Transformer Architecture

Multi-Head Attention

Skip Connections

Layer Normalization

Positional Encoding

Stage 1: Pre-training

Stage 2: Supervised Fine-Tuning (SFT)

Stage 3: Reward Modeling

Stage 4: RLHF

RLHF: The Secret Sauce

Technical Advantages

Parallel Processing

Long-Range Dependencies

Flexible Context

Scalability

Key Innovations

Self-Supervised Learning

Attention Mechanism

Multi-Stage Training

Massive Scale

Current Limitations

Language Subtleties

No Explicit Rules

Hallucination Risk

Resource Requirements

Societal Implications

Democratization Concern

Job Displacement

Need for Understanding

Rapid Evolution

Technical Foundation

Human-AI Collaboration