Professor Pavlos - Scientific Program Director of the Institute for Applied Computational Science at Harvard SEAS. Expertise spans astronomy, physics, machine learning, and statistics. Leads the graduate program in data science and teaches flagship courses including CS109.
~1,000 years
100 years
~100 years
40 years
Accelerating rapidly
Each technological revolution happens faster than the last. We cannot treat AI as a slow-moving frontier—it's already here and moving at unprecedented speed.
Discovers hidden patterns without labels
Learns from labeled examples (input-output pairs)
Creates supervision signal from data itself
Self-supervised learning is the foundation of modern language models. By creating supervision signals from the data itself (like masking words and predicting them), it eliminates the need for expensive human labeling while enabling training on massive datasets.
Optical Character Recognition (OCR)
Predictive Text and Search Suggestions
Named Entity Recognition (NER)
Machine Translation
Conversational AI Systems
Sentiment Analysis
Text Summarization
Question Answering
P(sentence) = ∏P(word)
Assume word independence
P(word|previous_word)
Consider word pairs
Creating supervision signals from data itself eliminates need for expensive human labels
Transformer attention allows models to focus on relevant parts of input regardless of distance
From pre-training to RLHF, each stage adds capabilities and human alignment
Single attention can't capture all relationship types
Multiple attention "heads" focus on different aspects
Capture different relationship types simultaneously
Query-Key-Value system with trainable weights
Enable deep network training and gradient flow
Residual connections around attention blocks
Stabilize training and improve convergence
Normalize inputs to each layer
Handle word order in sequence
Sinusoidal or learned position embeddings
Scale: GPT-4 uses many layers with multiple heads each, creating an incredibly sophisticated attention network.
Train on massive internet text (~10TB for GPT-3)
Use ~100,000 high-quality human-written responses
Train model to rank responses like humans do
Use reward model to guide further training
Reinforcement Learning from Human Feedback (RLHF) uses the reward model to guide further training. A clipping mechanism prevents over-optimization, balancing helpfulness with natural language patterns to achieve ChatGPT-level performance.
Unlike RNNs, can process entire sequences simultaneously
Attention connects distant words directly
Contextual embeddings adapt to different meanings
Architecture scales to very large models
Eliminates need for labeled data
Captures complex linguistic relationships
Aligns models with human preferences
Both data and parameters reach unprecedented sizes
Struggle with irony, puns, complex slogans
Lack grammatical or linguistic rule understanding
Can generate biased or false content
Enormous computational resources needed
AI power concentrated in few hands
Potential impact on various industries
Importance of technical literacy
Continuous advancement in capabilities
The transformer architecture represents a fundamental breakthrough in AI, enabling the language understanding capabilities we see in modern systems. The progression from simple n-grams to sophisticated attention mechanisms illustrates how technical innovations can dramatically expand AI capabilities.
Understanding transformer mechanics helps us use AI tools more effectively while being aware of their limitations.
These models are powerful tools that augment rather than replace human intelligence in solving complex problems.