Class 2 Notes

Fundamentals of Generative AI and Large Language Models

Professor Pavlos - Technical Deep Dive

About the Instructor

Professor Pavlos - Scientific Program Director of the Institute for Applied Computational Science at Harvard SEAS. Expertise spans astronomy, physics, machine learning, and statistics. Leads the graduate program in data science and teaches flagship courses including CS109.

AI Evolution: Accelerating Transformation

Agricultural Age

~1,000 years

Steam Revolution

100 years

Electrical Age

~100 years

Digital Age

40 years

AI Era

Accelerating rapidly

Key Insight: Acceleration Pattern

Each technological revolution happens faster than the last. We cannot treat AI as a slow-moving frontier—it's already here and moving at unprecedented speed.

Defining AI Through Examples

❌ Not AI

  • Abacus
  • Calculators
  • Programmable calculators

❓ Debatable

  • Deep Blue (chess AI)
  • Uses brute force search
  • Limited domain expertise

✅ Clearly AI

  • AlphaGo (Monte Carlo + RL)
  • ChatGPT
  • Modern LLMs
Three Machine Learning Paradigms

Unsupervised Learning

Discovers hidden patterns without labels

Example: Clustering swan images without knowing they're swans
Pros: Useful for exploration
Cons: Difficult to evaluate

Supervised Learning

Learns from labeled examples (input-output pairs)

Example: Training to distinguish swans from non-swans
Pros: Clear evaluation metrics
Cons: Requires expensive human-labeled datasets

Self-Supervised Learning

Creates supervision signal from data itself

Example: Hide part of sentence and predict missing piece
Pros: No human labels needed, scalable
Cons: Complex architecture required

Self-Supervised Learning: The Game Changer

Self-supervised learning is the foundation of modern language models. By creating supervision signals from the data itself (like masking words and predicting them), it eliminates the need for expensive human labeling while enabling training on massive datasets.

Natural Language Processing Applications

Optical Character Recognition (OCR)

Predictive Text and Search Suggestions

Named Entity Recognition (NER)

Machine Translation

Conversational AI Systems

Sentiment Analysis

Text Summarization

Question Answering

Language Models: From Simple to Sophisticated

N-gram Models (Traditional Approach)

Unigrams

P(sentence) = ∏P(word)

Assume word independence

Bigrams

P(word|previous_word)

Consider word pairs

Problems

  • Sparsity issues
  • Storage explosion
  • Out-of-vocabulary words

Neural Language Models

Word Embeddings Innovation

  • Convert words to vectors (e.g., 200 dimensions)
  • Semantically similar words have similar embeddings
  • Famous example: King - Man + Woman ≈ Queen
  • Still limited by fixed window approach
Key Technical Insights

Self-Supervised Learning Revolution

Creating supervision signals from data itself eliminates need for expensive human labels

Attention Mechanism Breakthrough

Transformer attention allows models to focus on relevant parts of input regardless of distance

Multi-Stage Training Pipeline

From pre-training to RLHF, each stage adds capabilities and human alignment

The Transformer Revolution

Motivation for Transformers

  • Need for long-range context (not just 2-3 words)
  • Parallel processing capability for large datasets
  • Contextual embeddings that change based on surrounding words

Multi-Head Attention

Problem

Single attention can't capture all relationship types

Solution

Multiple attention "heads" focus on different aspects

Examples of Different Heads:

  • Subject-verb relationships
  • Object relationships
  • Topical relationships

Complete Transformer Architecture

Multi-Head Attention

Capture different relationship types simultaneously

Query-Key-Value system with trainable weights

Skip Connections

Enable deep network training and gradient flow

Residual connections around attention blocks

Layer Normalization

Stabilize training and improve convergence

Normalize inputs to each layer

Positional Encoding

Handle word order in sequence

Sinusoidal or learned position embeddings

Scale: GPT-4 uses many layers with multiple heads each, creating an incredibly sophisticated attention network.

From GPT to ChatGPT: The Training Pipeline

Stage 1: Pre-training

Train on massive internet text (~10TB for GPT-3)

Details: 175 billion parameters, ~$10M cost, 1 month on 1000 GPUs
Result: Powerful but 'untamed' base model

Stage 2: Supervised Fine-Tuning (SFT)

Use ~100,000 high-quality human-written responses

Details: Teaches assistant-like behavior, ~$100K cost
Result: GPT-3.5 level capabilities

Stage 3: Reward Modeling

Train model to rank responses like humans do

Details: Uses pairwise comparisons (A better than B)
Result: Automated quality assessment

Stage 4: RLHF

Use reward model to guide further training

Details: Clipping mechanism prevents over-optimization
Result: ChatGPT-level performance

RLHF: The Secret Sauce

Reinforcement Learning from Human Feedback (RLHF) uses the reward model to guide further training. A clipping mechanism prevents over-optimization, balancing helpfulness with natural language patterns to achieve ChatGPT-level performance.

Why Transformers Work: Critical Success Factors

Technical Advantages

Parallel Processing

Unlike RNNs, can process entire sequences simultaneously

Long-Range Dependencies

Attention connects distant words directly

Flexible Context

Contextual embeddings adapt to different meanings

Scalability

Architecture scales to very large models

Key Innovations

Self-Supervised Learning

Eliminates need for labeled data

Attention Mechanism

Captures complex linguistic relationships

Multi-Stage Training

Aligns models with human preferences

Massive Scale

Both data and parameters reach unprecedented sizes

Limitations and Broader Implications

Current Limitations

Language Subtleties

Struggle with irony, puns, complex slogans

No Explicit Rules

Lack grammatical or linguistic rule understanding

Hallucination Risk

Can generate biased or false content

Resource Requirements

Enormous computational resources needed

Societal Implications

Democratization Concern

AI power concentrated in few hands

Job Displacement

Potential impact on various industries

Need for Understanding

Importance of technical literacy

Rapid Evolution

Continuous advancement in capabilities

Session Summary and Key Takeaways

The transformer architecture represents a fundamental breakthrough in AI, enabling the language understanding capabilities we see in modern systems. The progression from simple n-grams to sophisticated attention mechanisms illustrates how technical innovations can dramatically expand AI capabilities.

Technical Foundation

Understanding transformer mechanics helps us use AI tools more effectively while being aware of their limitations.

Human-AI Collaboration

These models are powerful tools that augment rather than replace human intelligence in solving complex problems.