Class 5 Notes

AI Detection and Confidence Assessment in Large Language Models

Guest: Professor Alon Kipnis (Tel Aviv University)Guest: Professor Yudi Pawitan (Karolinska Institute)

Class Overview

Part 1: Detecting AI-generated text that has been edited by humans using statistical hypothesis testing

Focus: Mixed authorship scenarios beyond simple AI vs. human classification

Part 2: Confidence assessment challenges in Large Language Models and practical usage strategies

Key Finding: LLMs show systematic overconfidence with no correlation to actual accuracy

Key Insights from Class 5

Detecting Mixed AI-Human Authorship

Statistical methods using log loss and higher criticism can identify when AI-generated text has been edited by humans with 60-70% accuracy

LLM Overconfidence Problem

Large language models consistently report high confidence (90%+) while achieving only 60-70% accuracy, with no correlation between confidence and correctness

Expert-in-the-Loop Best Practice

The most effective use of LLMs involves experts generating possible answers while applying human judgment for final decisions

Part 1: Detecting AI-Generated Text with Human Edits

Professor Alon Kipnis - Statistical Hypothesis Testing Approach

The Challenge

Going beyond simple AI vs. human text classification to detect mixed authorship scenarios - a critical issue as AI tools become more integrated into writing workflows. This addresses the sophisticated detection challenge of identifying when AI-generated text has been edited by humans.

Log Loss Testing

Description: Calculate negative log probability for each sentence

Formula: Log loss = -log P(token|context)

Interpretation: Higher values suggest human authorship

Higher Criticism

Description: Combine evidence from individual sentences

Method: Donoho-Jin 2004 statistical framework

Advantage: Optimal for detecting sparse signals

Sentence Identification

Description: Built-in capability to identify edited sentences

Feature: Not just detection but localization

Benefit: Actionable insights for editors

Key Advantages

• Information-theoretically optimal under sparse editing model
• Built-in sentence identification capability
• No parameter tuning required - adaptive methodology
• Solid theoretical foundation in statistical hypothesis testing

Part 2: Confidence Assessment in Large Language Models

Professor Yudi Pawitan - The Overconfidence Problem

The Confidence Crisis

The Problem: How confident should we be in AI answers, and how confident are the AI systems themselves?

Critical Finding: Traditional confidence methods fail with modern LLMs due to their scale and complexity.

Bootstrap Impossibility

Standard statistical confidence methods fail with models having hundreds of billions of parameters

Impact: Cannot use traditional uncertainty quantification

Example: GPT-4: 1.8 trillion parameters, trillions of training tokens

Black Box Nature

Cannot interpret where specific knowledge is stored among parameters

Impact: No understanding of internal confidence mechanisms

Example: Unknown location of facts like 'capital of Indonesia is Jakarta'

Prompt Sensitivity

Confidence assessments highly sensitive to question phrasing

Impact: Unreliable and inconsistent responses

Example: Different prompts yield 17% vs 87% answer changes

Token Probability Disconnect

High internal token probabilities don't correlate with actual accuracy

Impact: Technical confidence measures are misleading

Example: 99% token probability with only 70% actual accuracy

Scale Comparison: LLMs vs. Human Brain

Large Language Models

• GPT-3: 175 billion parameters
• GPT-4: ~1.8 trillion parameters
• Training: Trillions of tokens
• Cost: Millions of dollars to train

Human Brain

• ~10^14 synapses
• Similar interpretability challenges
• Unknown information storage locations
• Empirical methods required for understanding

Experimental Findings and Performance

AI Detection Study

datasets:Wikipedia, news articles, scientific abstracts

accuracy:60-70% in mixed scenarios

comparison:Outperformed ML pipelines and minimum p-value approaches

key Insight:Information-theoretically optimal under sparse editing model

Big Bench Confidence Assessment

tasks:200+ standardized tasks across domains

models:GPT-4.0, GPT-4 Turbo, Mistral

overconfidence:99% report ≥90% confidence with 67-70% accuracy

key Insight:No correlation between self-reported confidence and accuracy

Confidence Assessment Results

GPT-4.0

• 99% report ≥90% confidence

• 67% actual accuracy

• 17% change answers when asked "are you sure?"

GPT-4 Turbo

• 78% report ≥90% confidence

• 70% actual accuracy

• 61% change answers when prompted

Mistral

• 86% report ≥90% confidence

• 68% actual accuracy

• 87% change answers when prompted

Technical Challenges and Limitations

Style Convergence

Human writing styles may converge with AI as people read more AI-generated content

Implication: Detection methods may become less effective over time

Independence Assumption

Current methods treat sentences as independent, which may not reflect real editing patterns

Implication: Need for contextual and paragraph-level approaches

Model Scale Challenge

Modern LLMs are too large for traditional statistical analysis methods

Implication: Requires new approaches to uncertainty quantification

Prompt Engineering Effects

Small changes in how questions are asked can dramatically affect responses

Implication: Confidence measures are brittle and unreliable

The "Idiot Savant" Analogy

Professor Pawitan compared LLMs to idiot savants - systems with remarkable capabilities in some areas but fundamental gaps in others, including self-assessment and confidence calibration. They can generate novel research ideas better than humans but fail at simple counting tasks like letters in words.

Practical Usage Scenarios and Recommendations

Complete Ignorance

Description: User has no knowledge of the topic

Approach: Can ask LLM to rethink; if same answer, higher confidence

Risk: Must accept answers at own risk

Recommendation: Use with extreme caution

Testing the LLM

Description: User knows the correct answer

Risk: Clever Hans Effect - user may inadvertently provide clues

Value: Limited practical utility

Recommendation: Avoid for real applications

Expert in the Loop (Recommended)

Description: User is expert but doesn't know specific answer

Approach: Use LLM to generate possible answers, apply expert judgment

Advantage: Most productive and safe usage pattern

Recommendation: Ideal for law, medicine, coding, research

The Clever Hans Effect

Historical Context: Clever Hans was a horse that appeared to do arithmetic but actually responded to subtle unconscious cues from its trainer.

LLM Implication: When testing LLMs with known answers, users may inadvertently provide clues that lead to correct responses, creating false impressions of capability.

Practical Applications

Academic Integrity

Detect AI assistance in student submissions

Use Case: Identify mixed human-AI authorship in essays and reports

Legal Documentation

Verify authenticity of legal documents

Use Case: Ensure human oversight in legal brief preparation

Medical Decision Support

Expert-guided AI assistance in diagnosis

Use Case: Generate diagnostic possibilities with physician judgment

Research Assistance

AI-human collaboration in research

Use Case: Generate research ideas with expert evaluation

Real-World Example: Elizabeth Gilbert Story

Professor Pawitan demonstrated expert-in-the-loop usage by asking GPT-4 about a magician in an Elizabeth Gilbert story. Through guided questioning, the LLM eventually provided the correct answer (H. Douglas), but when asked the same question the next day, it incorrectly claimed the magician had no name.

Lesson: LLMs can be useful for memory assistance when experts can validate responses, but answers are inconsistent across sessions.

Future Research Directions

Detection Methods

Development of real-time detection methods for live text editing

Creation of contextual and paragraph-level detection approaches

Research on robust confidence calibration methods for LLMs

Confidence Assessment

Investigation of style convergence effects and adaptation strategies

Development of external confidence measures independent of self-reporting

Creation of standardized benchmarks for AI detection and confidence assessment

Emerging Challenges

• Need for emulator models to analyze LLM behavior
• Development of external confidence measures independent of self-reporting
• Research on AI-to-AI analysis for understanding complex models
• Investigation of temperature and parameter effects on confidence

Key Takeaways and Critical Skills

Detection Insights

Statistical methods can detect mixed AI-human authorship with reasonable accuracy

Higher criticism provides information-theoretically optimal detection under sparse editing

Methods must evolve as AI capabilities improve and writing styles converge

Confidence Assessment Insights

Current LLMs cannot provide reliable self-assessment of confidence

Expert-in-the-loop remains the most reliable approach for critical applications

Traditional statistical confidence methods fail with current LLM architectures

Practical Recommendations

Do:

• Use AI as idea generator with expert judgment
• Maintain healthy skepticism of AI confidence scores
• Verify critical information through independent sources
• Apply domain expertise when evaluating AI responses

Don't:

• Rely on AI-reported confidence scores for decisions
• Use AI answers without verification in critical applications
• Assume high confidence correlates with accuracy
• Ignore the brittleness of AI responses to prompt changes

Session Summary

This session provided critical insights into two fundamental challenges in AI systems: detecting mixed AI-human authorship and assessing confidence in LLM responses. Professor Kipnis demonstrated sophisticated statistical methods for detection, while Professor Pawitan revealed the systematic overconfidence problem in current LLMs and the failure of traditional confidence assessment methods.

The class highlighted the continued importance of human expertise and critical thinking in AI-assisted workflows, providing both technical methodologies and practical guidance for effective LLM usage.

Both presentations emphasized that while AI systems show remarkable capabilities, they require careful understanding of limitations and appropriate human oversight for reliable operation.