LLM Passes MIT Math & Computer Science
-4,550 questions from the 30 MIT Math & CS courses required for a degree
-New benchmark likely not in any training data
On test set excluding image Qs, w/ prompt engineering:
-GPT-3.5 solves 33%
-GPT-4 solves 100%
arxiv.org/abs/2306.08997
John Nay
@johnjnay
John Nay’s Tweets
Code for LLM Agents Teaching & Misleading Weaker LLM Agents:
github.com/swarnaHub/Expl
Quote Tweet
13
43
LLM Agents Can Teach Weaker LLM Agents
-Teacher builds mental model of student
-Learning from explained data improves student on future unexplained data
-Misaligned teachers can lower student performance to random chance by intentionally misleading them
arxiv.org/abs/2306.09299
2
84
349
Code, data, models for Mind2Web: Towards Generalist LLM Agents on the Web:
github.com/OSU-NLP-Group/
Quote Tweet
18
57
Generalist LLM Agents Completing New Tasks On The Web
-2,000 open-ended tasks from 137 real-world sites
-Raw HTML of sites are often too large for context
-First filtering w/ a smaller LM significantly improves effectiveness & efficiency of larger LLMs
arxiv.org/abs/2306.06070
1
49
206
LLMs x Law: Hackathon
NYC, June 25-27
-Developing Legal LLMs, law/policy applications, training data, evals, benchmarks, etc
-Sponsors include , , , , ,
Details & apply here: docs.google.com/document/d/e/2
6
71
217
Code for A plug-and-play Transformer module for task-agnostic LLM reasoning:
github.com/HazyResearch/T
Quote Tweet
1
22
74
LLMs Are Capable of Learning How to Reason in a Task-Agnostic Way
-Transformer-based reasoning module trained on synthetic data
-Composed w/ LLM
-Improves perf across diff model types, sizes, tasks, modalities
-GPT-Neo (125M) can outperform BLOOM (176B)
arxiv.org/abs/2306.07536
7
139
480
Large Language Models as Tax Attorneys: A Case Study in Legal Capabilities Emergence
paper page: huggingface.co/papers/2306.07
Better understanding of Large Language Models' (LLMs) legal analysis abilities can contribute to improving the efficiency of legal services, governing… Show more
3
41
137
Augmenting LLMs w/ Long-Term Memory
-Decoupled architecture w/ backbone LLM frozen as memory encoder & residual side-network as memory retriever
-Caches & updates long-term past contexts
-Outperforms strong baselines on long-context modeling benchmark
arxiv.org/abs/2306.07174
5
160
495
Getting LLMs To Tell The Truth
-Shifts LLM activations during inference, following directions across attention heads
-Improves LLaMA 33% -> 65% on TruthfulQA
-LLMs may have internal representation of something being true, even as they produce falsehoods
arxiv.org/abs/2306.03341
12
126
542
LLM Can Self-Improve From Own Reasoning By Training Only On Its Certified Outputs
-Given a reasoning problem in natural language, LLMs formalize assumptions for a logical reasoning guide
-Uses state & incremental constraints to guarantee sound reasoning
arxiv.org/abs/2306.04031
1
87
370
LLMs Peer-to-Peer Eval of Each Other
-LLM Examiner probes, follows-up (scores align closely w/ human annotations)
-For peer-exam, each LLM is Examiner
-Combine all evals by voting
-Leverages diverse LLM expertise for higher coverage & fairer assessments
arxiv.org/abs/2306.04181
2
69
284
Language Models Can Learn to Generate from Textual Interactions
-LM gets error messages and stack traces from executing code it produces
-Iteratively fine-tunes on:
LM-generated programs
Binary reward token
Textual feedback
LM objective
Instructions
arxiv.org/abs/2305.10314
40
160
Smaller LLMs Can Imitate Reasoning of Larger LLMs
-13-billion param model learns from rich GPT-4 signals (explanations, step-by-step, complex instructions) guided by teaching of ChatGPT
-Beats SoTA instruction-tuned LLM (Vicuna-13B) by 100% in reasoning
arxiv.org/abs/2306.02707
10
151
660
Fine-Grained Human Feedback Gives Better Rewards for LLM Training
-RLHF doesn't indicate which aspects of the outputs influenced user preference
Improved perf:
-Provide reward after every sentence
-Incorporate multiple rewards (factual, relevance, etc.)
arxiv.org/abs/2306.01693
8
83
400
Code for Thought Cloning: AI Agent Learning to Think while Acting by Imitating Human Thinking github.com/ShengranHu/Tho
Quote Tweet
4
69
255
AI Agents Can Learn to Think While Acting
-Thinking & action data synthetically generated
-Thought Cloning trains on thoughts +behaviors
-Faster than Behavioral Cloning & outperformance grows further out of distribution
-Can steer agent through thoughts
arxiv.org/abs/2306.00323
7
151
589
How An LLM Sees The World’s Geography
-Experiments on
factual tasks (location, distance, elevation)
more complex ones (generating country outlines, travel networks, supply chain analysis)
-GPT-4 (w/ out plugins or Internet) knows a lot about the world
arxiv.org/abs/2306.00020
5
68
287
LLMs Take the Turing Test
-Largest scale Turing-style test ever conducted
-1.5 million humans blind chatted w/ either another human or an LLM for a couple mins
-When speaking w/ LLM agent, humans guessed it was an AI correctly only 60% of the time
arxiv.org/abs/2305.20010
10
66
251
Making Small LLMs Good at Planning
-Symbolic procedural knowledge distillation enhances
implicit knowledge in small models +
an inference-time algo for structured reasoning
-Much smaller models can beat larger teacher LLMs' at Counterfactual Planning
arxiv.org/abs/2305.19472
6
127
499
LLMs Can Excel in Diverse Strategic Scenarios
-Systematically generated demos of reasoning in prompts
-Can generalize almost perfectly to new game structures & new objectives
-Human-like negotiation strategies in realistic scenarios w/out any fine-tuning
arxiv.org/abs/2305.19165
3
69
261
LLM Agents "Thinking" Fast & Slow for Complex Interactions
-Swift: fast and intuitive thinking via LM fine-tuned on oracle agent's actions
-Sage: emulating deliberate planning via GPT-4
-Significantly outperforms SayCan, ReAct, & Reflexion on 30 tasks
arxiv.org/abs/2305.17390
4
99
451
LLMs Can Know When They’re Hallucinating References
-Consistency checks on direct queries of whether a generated reference title is real
-Consistency checks on indirect queries on details such as authors
-Helps reveal if a reference is a hallucination
arxiv.org/abs/2305.18248
6
108
402
Behavioral Game Theory for LLMs
-LLMs excel where valuing their own self-interest pays off
-LLMs behave sub-optimally in games that require coordination
-GPT-4 acts particularly unforgivingly, always defecting after another agent has defected only once
arxiv.org/abs/2305.16867
7
116
430
Training LLMs via Simulated Human Societies
-Fine-grained social interaction data collected by running open-source simulated society platform
-Collective ratings, detailed feedback, & revised responses fine-tune LLM
-Reduces instability & reward gaming
arxiv.org/abs/2305.16960
11
91
314
Purely Passive Learning Can Allow An Agent To Learn Generalizable Strategies For Using Causal Structures
-LLMs (trained only on next-word prediction) can generalize causal intervention strategies from prompts w/ examples of experimentation & explanations
arxiv.org/abs/2305.16183
5
74
306
LLMs Can Complete Sophisticated Action Trajectories
-LLM reads game's paper
-Employs directed acyclic graph w/ game-related questions as nodes
-Identifies optimal action by traversing DAG
-W/ out training on game, outperforms all SoTA RL trained on game
arxiv.org/abs/2305.15486
1
85
328
LLMs Planning & Executing Actions Over Long Documents
-Decomposes a question into a sequence of actions (eg, FIND_EVENT, FIND_RELATION)
-Plans
-Executes actions over long document
-Eval on questions requiring complex reasoning over long narrative texts
arxiv.org/abs/2305.14564
6
67
278
Repurposing LLMs As Both World Model & Reasoning Agent
-LLM (as agent) incrementally builds a reasoning tree under guidance of LLM (as world model)
-High rewards balancing exploration vs. exploitation
-W/ this, LLaMA-33B beats GPT-4 by 33% in planning
arxiv.org/abs/2305.14992
6
148
521
Purely Synthetic LLM Feedback Improves LLMs
-Reward model trained on contrasting responses from vanilla LLM of varied size
-Almost no human input
-No dependency on pre-aligned LLMs
-Outperforms Alpaca, Dolly, etc, which are trained on InstructGPT/humans
arxiv.org/abs/2305.13735
3
69
283
"Society of Minds"
Factuality & Reasoning in LLMs via Multi-Agent Debate
-LLM agents propose & debate their responses & reasoning over multiple rounds
-Arrive at common final answers
-Significantly enhances math, strategic reasoning, & factual validity
arxiv.org/abs/2305.14325
6
60
220
Prompting LLMs Underestimates Their Power
-Prompting tests metalinguistic judgment
-Other methods directly read out probabilities over strings
-Metalinguistic judgments are inferior
-So negative result from prompt is not evidence LLM lacks a competence
arxiv.org/abs/2305.13264
5
87
383
Meta-In-Context Learning in LLMs
-In-context learning abilities can be recursively improved via in-context learning itself
-Meta-ICL adaptively reshapes LLM priors & modifies learning strategies
-On real-world regression: competitive w/ traditional algos
arxiv.org/abs/2305.12907
55
243
LLM vs LLM: Detecting Errors via Cross Examining Agents
-An incorrect claim is likely to result in inconsistency w/ other claims
-Multi-turn interactions between LLM that generated claim and Examiner LLM
-Outperforms baselines across factual benchmarks
arxiv.org/abs/2305.13281
1
43
172
LLM Pre-training vs. Instruction-Tuning
-LLaMa 65B pre-trained
-Only simple fine-tuning,
w/ only 1k (carefully chosen) data points,
no RLHF
-Can plan trips & speculate about alternate histories
-Generalizes to unseen tasks
-Humans prefer it over GPT-3
arxiv.org/abs/2305.11206
3
100
382
Code for Tree of Thoughts: Deliberate Problem Solving w/ LLMs
github.com/kyegomez/tree-
Quote Tweet
1
35
151

