Archive
The one-stop shop, including all posts from the Blog and Projects.
2026
- Machine Consciousness blog
- Self-Evolving LLM Memory Extraction Across Heterogeneous Tasks papers
- The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents papers
- Video-Based Reward Modeling for Computer-Use Agents papers
- DP-RFT: Learning to Generate Synthetic Text via Differentially Private Reinforcement Fine-Tuning papers
- Experiential Reinforcement Learning papers
- One Model, All Roles: Multi-Turn, Multi-Agent Self-Play Reinforcement Learning for Conversational Social Intelligence papers
2025
- STEER-BENCH: A Benchmark for Evaluating the Steerability of Large Language Models papers
- The Hallucination Tax of Reinforcement Finetuning papers
- CoAct-1: Computer-using Agents with Coding as Actions papers
- Efficient Reinforcement Finetuning via Adaptive Curriculum Learning papers
- Discovering Knowledge Deficiencies of Language Models on Massive Knowledge Base papers
- On the Trustworthiness of Generative Foundation Models papers
- Detecting and Filtering Unsafe Training Data via Data Attribution papers
2024
- WildFeedback: Aligning LLMs With In-situ User Interactions And Feedback papers
- How Susceptible are Large Language Models to Ideological Manipulation? papers
2023
- Can Language Model Moderators Improve the Health of Online Discourse? papers
- Safer-Instruct: Aligning Language Models with Automated Preference Data papers
- CoAnnotating: Uncertainty-Guided Work Allocation between Human and Large Language Models for Data Annotation papers
- Positive Reframing Keyboard projects
- Neural Story Planning papers
- Investigating AAVE in Question Answering Systems papers