Conversation

2/ AIs can improve through better training algorithms or better training data. We believe data-based improvement is riskier than architecture-based improvement, as current models mostly derive their behavior from training data. πŸ“ˆ
2
5
3/ We envision a future where AIs self-augment by seeking out more and better training data, running experiments in the real world, and creating successor AIs or training themselves. The goal: ensuring 'automated science' processes remain safe and controllable. πŸ”¬πŸ”’
1
4
4/ Key problems to tackle include: preventing self-training from amplifying undesirable behaviors, preventing semantic drift, ensuring cross-modality actions remain grounded, and preventing value drift during iterated self-retraining. 🧩
1
4
5/ Current research directions in the agenda focuses on scalable methods of tracking behavioral drift in language models and benchmarks for evaluating a language model's capacity for stable self-modification via self-training. πŸ”„
1
4
6/ Algorithmic vs data-driven improvements: Supervision methods for data-driven improvement are high priority. Data influences behaviors more directly, automated data improvement is close, and algo improvements may be less likely to interfere with alignment techniques. πŸŽ›οΈ
1
3
7/ Risks of data-driven improvement: Bias amplification, positive feedback loops, data poisoning, semantic drift. Addressing these risks is crucial for maintaining safe and controllable AI systems during iterative self-training. ⚠️
1
3
8/ Cross-modal semantic grounding: Future work will extend language as an interface to control AI behaviors in other modalities (e.g., image generation, robotic manipulation). Addressing cross-modal grounding challenges is essential for stable AI systems. 🌐
1
3
9/ Value drift: To ensure long-term stability and value alignment, researchers need to incorporate feedback mechanisms, explore meta-learning techniques, develop monitoring and evaluation methods, and investigate iterative alignment techniques. 🧭
1
3
10/ Related work: Continual learning, active learning, semi-supervised learning, self-distillation, and exposure bias are all relevant areas of research that can provide insights for maintaining stability in AI systems during iterative self-training. πŸ“š
1
3
11/ Current research directions: Unsupervised behavioral evaluation and benchmarks for stable reflectivity are two projects aiming to address the challenges of maintaining stability and alignment in AI systems undergoing iterative training. πŸ”
1
4
12/ Unsupervised behavioral evaluation: Developing methods to quantify how fine-tuning changes model behavior, allowing researchers to identify unexpected ways in which iterative training influences models and address issues like bias amplification and semantic drift. πŸ’‘
1
3
13/ Benchmarks for stable reflectivity: Investigating self-reflective behavior in AI self-improvement, focusing on subtasks related to self-reflectivity, and developing probing datasets to evaluate model competency and track progress in these subtasks. πŸ“
1
4
14/ In conclusion, supervising AIs improving AIs is a research agenda for ensuring AI systems remain aligned with human values during iterative self-training and self-improvement. πŸ€–πŸ€πŸ‘¨β€πŸ”¬ Agenda started by and his MATS team. πŸ“„ Full post:
7

Discover more

Sourced from across Twitter
[manually editing the neural net's parameters in Excel to build an aligned ASI] ok... if I just tweak this weight here... maybe that'll do it *ceiling splits in half, light pours into the room from the sky* π•π•†π•Œ 𝔸ℝ𝔼 𝕋ℝ𝕐𝕀ℕ𝔾 𝕋𝕆 π•Šπ•†π•ƒπ•π”Ό 𝕋ℍ𝔼 π•Žβ„π•†β„•π”Ύ ℙℝ𝕆𝔹—
1
27
AI x-risk skepticism, pre 2023: "The only people who believe this are weirdos without proper AI credentials, so I'm not going to bother looking at the arguments or evidence".
6
99
Show this thread