From passive generation to interactive agents that strategically decide what to sense, when to sense, and how to act.
Generative models have mastered passive generation. But real intelligence is active. It observes, plans, acts, and learns from feedback.
Models that strategically choose what to observe—optimal viewpoints, sensor placement, information seeking.
Continuous replanning based on new observations. Perception and action form a tight feedback loop.
Agents that learn through interaction. Active decision-making transforms passive models into interactive systems.
Leading voices in vision, robotics, and embodied AI
Stanford University & Physical Intelligence
10:45
"Evaluating and Improving Robotic Foundation Models with World Models"
Location: 🎙️ Mile High 2A (Talks) & 🪧 ExHall A 229-239 (Posters) at conference center
Time: June 3rd, 8 am - 11:50 am
Opening Welcome & Introductions
10 min
Invited Talk: Nicholas Roy
"World Models and Why We Should Care about Their Structure"
30 min
Oral Session 1 and 2: SAW-Bench and GEM-4D
30 min
☕ Coffee Break
15 min
Invited Talk: Alan Yuille
"World Models: Bayes or Bust?"
30 min
Invited Talk: Yiannis Aloimonos
"Generative Action Systems"
30 min
Invited Talk: Chelsea Finn
"Evaluating and Improving Robotic Foundation Models with World Models"
30 min
Oral Session 3: RoboWM-Bench
Closing Remarks
Lambda generously sponsors compute credits for outstanding research:
listed alphabetically by last name
Poster session at ExHall A 229–239 (each board has two faces a/b) • June 3rd, 10:00 am – 12:00 • OpenReview portal
| Poster ID | Paper Title |
|---|---|
| 229a | When Predicted Depth Can Beat the Sensor: Depth-Free Deployment of RGB-D Self-Supervised Encoders |
| 229b | Reconstruction or Semantics? What Makes a Latent Space Useful for Robotic World Models (PDF) |
| 230a | Adding Thermal Awareness to Visual Systems in Real-Time via Distilled Diffusion Models (PDF) |
| 230b | GEM-4D: Geometry-Enhanced Video World Models for Robot Manipulation |
| 231a | Streaming3D: Sequential 3D Generation via Evidential Memory |
| 231b | Purposive Sensing: Task-Aligned Observation Selection via Closed-Loop World Model Imagination |
| 232a | Towards World Scene Graph Generation from Monocular Videos: A Structured World Representation for Embodied Agents (PDF) |
| 232b | ULTRA: Unified Multimodal Control for Autonomous Humanoid Whole-Body Loco-Manipulation (PDF) |
| 233a | RoboWM-Bench: A Benchmark for Evaluating World Models in Robotic Manipulation |
| 233b | Epistemic Horizons: Uncertainty-Gated Active Sensing for Closed-Loop World Model Planning |
| 234a | Imitation learning through imagination in latent space (PDF) |
| 234b | When to Look: A Theory of Observation Timing for World-Model-Guided Active Agents |
| 235a | The Information Gap Process: A Unified Theory of Closed-Loop Active Sensing in World Models |
| 235b | Latent Observability in World Models: A Unified Framework for Active Sensing, Belief Convergence, and Closed-Loop Planning Efficiency |
| 236a | EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses (PDF) |
| 236b | SAW-Bench: Learning Situated Awareness in the Real World |
| 237a | WorldReel: 4D Video Generation with Consistent Geometry and Motion Modeling |
| 237b | Turning Video Models into Generalist Robot Policies |
| 238a | Same Meaning, Different Pictures: Finding Missing Generated Pictures (PDF) |
| 238b | Addressable Memory for Closed-Loop Video World Models (PDF) |
Questions about the workshop, submissions, or anything else? Reach out to our contact person.
Jieneng Chen • jchen293@jh.edu