1/ New research agenda: Supervising AIs improving AIs π€ How can we ensure future prosaic AIs remain safe and controllable while shaping their own development or that of successor AIs? I will go over the agenda below.
Conversation
2/ AIs can improve through better training algorithms or better training data. We believe data-based improvement is riskier than architecture-based improvement, as current models mostly derive their behavior from training data. π
2
1
5
3/ We envision a future where AIs self-augment by seeking out more and better training data, running experiments in the real world, and creating successor AIs or training themselves. The goal: ensuring 'automated science' processes remain safe and controllable. π¬π
1
4
4/ Key problems to tackle include: preventing self-training from amplifying undesirable behaviors, preventing semantic drift, ensuring cross-modality actions remain grounded, and preventing value drift during iterated self-retraining. π§©
1
4
5/ Current research directions in the agenda focuses on scalable methods of tracking behavioral drift in language models and benchmarks for evaluating a language model's capacity for stable self-modification via self-training. π
1
4
6/ Algorithmic vs data-driven improvements: Supervision methods for data-driven improvement are high priority. Data influences behaviors more directly, automated data improvement is close, and algo improvements may be less likely to interfere with alignment techniques. ποΈ
1
3
7/ Risks of data-driven improvement: Bias amplification, positive feedback loops, data poisoning, semantic drift. Addressing these risks is crucial for maintaining safe and controllable AI systems during iterative self-training. β οΈ
1
3
8/ Cross-modal semantic grounding: Future work will extend language as an interface to control AI behaviors in other modalities (e.g., image generation, robotic manipulation). Addressing cross-modal grounding challenges is essential for stable AI systems. π
1
3
9/ Value drift: To ensure long-term stability and value alignment, researchers need to incorporate feedback mechanisms, explore meta-learning techniques, develop monitoring and evaluation methods, and investigate iterative alignment techniques. π§
1
3
10/ Related work: Continual learning, active learning, semi-supervised learning, self-distillation, and exposure bias are all relevant areas of research that can provide insights for maintaining stability in AI systems during iterative self-training. π
1
3
11/ Current research directions: Unsupervised behavioral evaluation and benchmarks for stable reflectivity are two projects aiming to address the challenges of maintaining stability and alignment in AI systems undergoing iterative training. π
1
4
12/ Unsupervised behavioral evaluation: Developing methods to quantify how fine-tuning changes model behavior, allowing researchers to identify unexpected ways in which iterative training influences models and address issues like bias amplification and semantic drift. π‘
1
3
13/ Benchmarks for stable reflectivity: Investigating self-reflective behavior in AI self-improvement, focusing on subtasks related to self-reflectivity, and developing probing datasets to evaluate model competency and track progress in these subtasks. π
1
4
14/ In conclusion, supervising AIs improving AIs is a research agenda for ensuring AI systems remain aligned with human values during iterative self-training and self-improvement. π€π€π¨βπ¬
Agenda started by and his MATS team.
π Full post:
7
Discover more
Sourced from across Twitter
"Mistakes were made? A critical look at how EA has approached AI safety"
David Krueger | EAG London 23
youtube.com/watch?v=kFkyW-
3
11
85
AI Denialists will always find ever newer goalposts to move.
Quote Tweet
7
1
44
[manually editing the neural net's parameters in Excel to build an aligned ASI]
ok... if I just tweak this weight here... maybe that'll do it
*ceiling splits in half, light pours into the room from the sky*
πππ πΈβπΌ πβππβπΎ ππ πππππΌ πβπΌ πβπβπΎ ββππΉβ
1
27
AI x-risk skepticism, pre 2023:
"The only people who believe this are weirdos without proper AI credentials, so I'm not going to bother looking at the arguments or evidence".
6
13
99
Show this thread





