Jiayi Pan

潘家怡

University of California, Berkeley

Hi 👋

I am a 2nd-year PhD student at Berkeley AI Research. I work with Alane Suhr at Berkeley NLP Group.

I enjoy understanding and making things. I try to learn broadly but bet on a single direction at a time. Recently, I am most excited about post-training, in particular, developing scalable methods to evaluate and improve language model agents.

Feedback is always welcome :)

📢 Open to internship opportunities at 25SU.

I received

My work is also covered in Scientific American, State of AI Annual Report.

Publications & Manuscripts

* denotes equal contribution

Autonomous Evaluation and Refinement of Digital Agents

Jiayi Pan, Yichi Zhang, Nickolas Tomlin, Yifei Zhou, Sergey Levine, Alane Suhr. COLM 2024 / ⭐️ MAR Workshop @ CVPR 2024 Best Paper.

We design model-based evaluators to both evaluate and autonomously improve agents' performance. We show that these open-ended evaluators can significantly improve agents' performance, through either fine-tuning or inference-time guidance, without any extra supervision.

Autonomous Evaluation and Refinement of Digital Agents

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

OpenHands Community. Preprint.

We introduce OpenHands, a platform for developing AI agents that interact with the digital world. OpenHands is a community project from over 180 contributors and 30K+ stars.

OpenHands: An Open Platform for AI Software Developers as Generalist Agents

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Hao Bai*, Yifei Zhou*, Mert Cemri, Jiayi Pan, Alane Suhr, Sergey Levine, Aviral Kumar. NIPS 2024.

We develop reinforcement learning techniques to post-train device-control language agents. Our 2B VLM, when post-trained with an autonomous evaluator, improves its success rate from 17% to 67% on Android device-control tasks.

DigiRL: Training In-The-Wild Device-Control Agents with Autonomous Reinforcement Learning

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

Yuexiang Zhai, Hao Bai*, Zipeng Lin*, Jiayi Pan*, Shengbang Tong*, Yifei Zhou*, Alane Suhr, Saining Xie, Yann LeCun, Yi Ma, Sergey Levine. NIPS 2024.

We provide infrastructure and environment for training VLMs with RL on decision-making tasks. We show RL training enables our 7B model to outperform GPT-4V on these tasks. Additionally, we show the intriguing effectiveness of CoT reasoning for performance improvement

Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Yifei Zhou, Andrea Zanette, Jiayi Pan, Sergey Levine, Aviral Kumar. ICML 2024.

We present ArCHer, a new framework of multi-turn RL algorithms for training LM agents. It preserves the flexibility of mainstream single-turn LM RL methods like PPO, while effectively handling multiple turns, long horizons, and delayed rewards.

ArCHer: Training Language Model Agents via Hierarchical Multi-Turn RL

Inversion-Free Image Editing with Natural Language

Sihan Xu*, Yidong Huang*, Jiayi Pan, Ziqiao Ma, Joyce Chai. CVPR 2024.

We present an inversion-free editing (InfEdit) method which enables consistent natural language guided image editing. InfEdit excels in complex editing tasks and is ~10X faster than prior methods.

Inversion-Free Image Editing with Natural Language

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

Yichi Zhang, Jiayi Pan, Yuchen Zhou, Rui Pan, Joyce Chai. EMNLP 2023.

Do Vision-Language Models, an emergent human-computer interface, experience visual illusions similarly to humans, or do they accurately depict reality? We created GVIL dataset to study this. Among other findings, we discover that larger models align more closely with human perception.

Grounding Visual Illusions in Language: Do Vision-Language Models Perceive Illusions Like Humans?

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma*, Jiayi Pan*, Joyce Chai. ⭐️ ACL 2023 Outstanding Paper.

We study grounding and bootstrapping in open-world language learning through Grounded Open Vocabulary Acquisition. Our visually-grounded language model, OctoBERT, excels in learning grounded words quickly and robustly.

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

SEAGULL: An Embodied Agent for Instruction Following through Situated Dialog

Team SEAGULL at UMich, Perception Lead. 🏆 1st Place in the inaugural Alexa Prize SimBot Challenge.

We introduce SEAGULL, an interactive embodied agent which completes complex tasks in the Arena simulation environment through dialog with users. SEAGULL is engineered to be efficient, user-centric, and continuously improving.

SEAGULL: An Embodied Agent for Instruction Following through Situated Dialog

Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification

Jiayi Pan, Glen Chou, Dmitry Berenson. ICRA 2023.

We present a learning-based approach to translate from natural language commands to LTL specifications with only a handful of labeled data. It enables few-shot learning of LTL translators while achieving state-of-the-art performance.

Data-Efficient Learning of Natural Language to Linear Temporal Logic Translators for Robot Task Specification

DANLI: Deliberative Agent for Following Natural Language Instructions

Yichi Zhang, Jianing Yang, Jiayi Pan, Shane Storks, Nikhil Devraj, Ziqiao Ma, Keunwoo Peter Yu, Yuwei Bao, Joyce Chai. EMNLP 2022, Oral.

We introduce DANLI, a neural-symbolic agent that proactively reasons and plans according to its past experiences. DANLI achieves a 70% improvement on the challenging TEACh benchmark while improving transparency in its behaviors.

DANLI: Deliberative Agent for Following Natural Language Instructions

Side Projects

Last updated on Nov 22, 2023

A Guided Tour to Neural Radiance Field

Reproduction of the NeRF paper with extensive notes.

Last updated on Nov 22, 2023

Exploiting Language Context and Social Relation for Efficient Federated Learning

A federated learning system that leverages language context and social relation information for improved performance.

Last updated on Nov 22, 2023

FairEdit

Preserving fairness in graph neural networks through greedy graph editing.

Last updated on Nov 10, 2023

GPT-V-on-Web

GPT-4 Vision x Vimium = Autonomous Web Agent

Last updated on Nov 22, 2023

Modular Mars Rover

A modular-designed robotic car with support of robotic arm, 3 DoF camera, solar panel & power bank, replaceable chassis, disco light & sound.

Last updated on Oct 3, 2024

Open Hands

I am an active contributor at OpenHands

Last updated on Nov 22, 2023

Ray Tracer

A ray-tracing based renderer that supports diffuse illumination, specular lighting, shadows, reflection and refractions.

Last updated on Sep 28, 2022

Robot Localization

Comparison and analysis of Kalman Filter and Particle Filter on a robot localization task in simulated environment.

Last updated on Nov 22, 2023

Shade

An indie game about shadows and illuminations.

Last updated on Nov 22, 2023

Zeroda

An partial re-implementation of Legend of Zelda with some new game design.

Contact

Email: jiayipan [AT] berkeley [DOT] edu

Misc

I try to develop some habits. Currently, I am learning guitar and climbing.
Growing up, I lived in quite a few places: Chongqing, Xinyang, Chengdu, Shanghai, Ann Arbor, and now the Bay Area.
These days, I think a lot about how to align my research with a positive, counterfactual impact on the near future where AGI becomes a reality.
Before doing AI research, I was quite into physics and participated in the Physics Olympiad during my high school years (although I wasn’t exceptionally strong). I still occasionally read physics books.