Part 1: Detecting AI-generated text that has been edited by humans using statistical hypothesis testing
Focus: Mixed authorship scenarios beyond simple AI vs. human classification
Part 2: Confidence assessment challenges in Large Language Models and practical usage strategies
Key Finding: LLMs show systematic overconfidence with no correlation to actual accuracy
Statistical methods using log loss and higher criticism can identify when AI-generated text has been edited by humans with 60-70% accuracy
Large language models consistently report high confidence (90%+) while achieving only 60-70% accuracy, with no correlation between confidence and correctness
The most effective use of LLMs involves experts generating possible answers while applying human judgment for final decisions
Professor Alon Kipnis - Statistical Hypothesis Testing Approach
Going beyond simple AI vs. human text classification to detect mixed authorship scenarios - a critical issue as AI tools become more integrated into writing workflows. This addresses the sophisticated detection challenge of identifying when AI-generated text has been edited by humans.
Description: Calculate negative log probability for each sentence
Formula: Log loss = -log P(token|context)
Interpretation: Higher values suggest human authorship
Description: Combine evidence from individual sentences
Method: Donoho-Jin 2004 statistical framework
Advantage: Optimal for detecting sparse signals
Description: Built-in capability to identify edited sentences
Feature: Not just detection but localization
Benefit: Actionable insights for editors
Professor Yudi Pawitan - The Overconfidence Problem
The Problem: How confident should we be in AI answers, and how confident are the AI systems themselves?
Critical Finding: Traditional confidence methods fail with modern LLMs due to their scale and complexity.
Standard statistical confidence methods fail with models having hundreds of billions of parameters
Impact: Cannot use traditional uncertainty quantification
Cannot interpret where specific knowledge is stored among parameters
Impact: No understanding of internal confidence mechanisms
Confidence assessments highly sensitive to question phrasing
Impact: Unreliable and inconsistent responses
High internal token probabilities don't correlate with actual accuracy
Impact: Technical confidence measures are misleading
• 99% report ≥90% confidence
• 67% actual accuracy
• 17% change answers when asked "are you sure?"
• 78% report ≥90% confidence
• 70% actual accuracy
• 61% change answers when prompted
• 86% report ≥90% confidence
• 68% actual accuracy
• 87% change answers when prompted
Human writing styles may converge with AI as people read more AI-generated content
Current methods treat sentences as independent, which may not reflect real editing patterns
Modern LLMs are too large for traditional statistical analysis methods
Small changes in how questions are asked can dramatically affect responses
Professor Pawitan compared LLMs to idiot savants - systems with remarkable capabilities in some areas but fundamental gaps in others, including self-assessment and confidence calibration. They can generate novel research ideas better than humans but fail at simple counting tasks like letters in words.
Description: User has no knowledge of the topic
Approach: Can ask LLM to rethink; if same answer, higher confidence
Risk: Must accept answers at own risk
Description: User knows the correct answer
Risk: Clever Hans Effect - user may inadvertently provide clues
Value: Limited practical utility
Description: User is expert but doesn't know specific answer
Approach: Use LLM to generate possible answers, apply expert judgment
Advantage: Most productive and safe usage pattern
Historical Context: Clever Hans was a horse that appeared to do arithmetic but actually responded to subtle unconscious cues from its trainer.
LLM Implication: When testing LLMs with known answers, users may inadvertently provide clues that lead to correct responses, creating false impressions of capability.
Detect AI assistance in student submissions
Verify authenticity of legal documents
Expert-guided AI assistance in diagnosis
AI-human collaboration in research
Professor Pawitan demonstrated expert-in-the-loop usage by asking GPT-4 about a magician in an Elizabeth Gilbert story. Through guided questioning, the LLM eventually provided the correct answer (H. Douglas), but when asked the same question the next day, it incorrectly claimed the magician had no name.
Lesson: LLMs can be useful for memory assistance when experts can validate responses, but answers are inconsistent across sessions.
This session provided critical insights into two fundamental challenges in AI systems: detecting mixed AI-human authorship and assessing confidence in LLM responses. Professor Kipnis demonstrated sophisticated statistical methods for detection, while Professor Pawitan revealed the systematic overconfidence problem in current LLMs and the failure of traditional confidence assessment methods.
The class highlighted the continued importance of human expertise and critical thinking in AI-assisted workflows, providing both technical methodologies and practical guidance for effective LLM usage.
Both presentations emphasized that while AI systems show remarkable capabilities, they require careful understanding of limitations and appropriate human oversight for reliable operation.