• AiNews.com
  • Posts
  • Meta Unveils V-JEPA 2 to Boost AI’s Physical Reasoning with Video

Meta Unveils V-JEPA 2 to Boost AI’s Physical Reasoning with Video

A robotic arm in a research lab hovers over a light wooden table, positioned near a blue mug, a red cube, and a small brown box. A computer monitor beside it displays a paused video of a person tilting a green cup, with green arrows overlaid to indicate motion prediction. A researcher in a green shirt stands in the background observing the setup. The scene illustrates Meta’s V-JEPA 2 model using video data to teach robots physical reasoning and object interaction in real-world environments.

Image Source: ChatGPT-4o

Meta Unveils V-JEPA 2 to Boost AI’s Physical Reasoning with Video

Meta has released V-JEPA 2, its latest video-trained world model designed to improve how AI agents understand and predict physical interactions. The model represents a step forward in teaching machines to reason about real-world dynamics—skills that are critical for AI agents operating in physical environments.

The company also introduced three new evaluation benchmarks to help researchers test how well models reason about the world using video.

A Step Toward Advanced Machine Intelligence

World models like V-JEPA 2 aim to give AI systems the ability to “think before they act.” These models are trained to observe how objects and people interact in the real world and to use that information to predict what might happen next. Meta describes this as essential to its long-term goal of building advanced machine intelligence, or AMI.

V-JEPA 2 builds on the original V-JEPA model released last year, also trained on video. With improved capabilities in understanding and prediction, V-JEPA 2 helps AI agents carry out physical tasks more reliably—even in unfamiliar settings.

Mimicking Human Physical Intuition

In daily life, people constantly make predictions about physical outcomes. Whether it’s anticipating where a ball will land or weaving through a crowded street, humans rely on an internal model of the world built from observation and experience.

For example, when walking through a busy sidewalk or train station, we automatically adjust our path to avoid bumping into others—often before a collision even seems imminent. This kind of navigation requires continuous, split-second predictions about where people are going, how fast they’re moving, and how the environment might shift. Meta’s V-JEPA 2 is designed to replicate that kind of reasoning for machines.

The model enables three core capabilities:

  • Understanding: Recognizing objects, actions, and environments.

  • Predicting: Anticipating how objects and people will behave.

  • Planning: Using predictions to decide on the next action.

According to Meta, training the model on video helped it learn key physical patterns, such as how objects move and how people interact with them. This gives robots using V-JEPA 2 a stronger foundation to act safely and effectively.

Real-World Performance and New Benchmarks

In lab settings, robots powered by V-JEPA 2 have successfully completed tasks like reaching for objects, picking them up, and placing them in new locations. These tasks test the model’s ability to apply its learned understanding to novel situations.

To support broader research, Meta is also releasing three benchmarks focused on video-based physical reasoning. These tools are intended to help the AI community evaluate how well models can interpret and predict real-world dynamics. While the blog post did not specify them by name, details from the accompanying research paper outline the core categories:

  • Video Question-Answering: This benchmark suite includes tests such as PerceptionTest, TempCompass, MVP, TemporalBench, TOMATO, and TVBench. These tasks assess how accurately a model can answer questions based on physical events depicted in video clips, ranging from object interactions to temporal logic.

  • Action Anticipation: Using the Epic-Kitchens-100 dataset, this benchmark challenges models to predict what a person will do next based on the first few seconds of a cooking video. V-JEPA 2 achieved state-of-the-art performance here, reaching 39.7% recall at rank 5—a key metric for anticipating human intent.

  • Motion Understanding: This area includes benchmarks like Something-Something v2, which require models to distinguish fine-grained physical actions such as pushing, pulling, or lifting. V-JEPA 2 scored 77.3% top-1 accuracy, demonstrating strong performance in recognizing dynamic object behavior.

These benchmarks are intended to help researchers measure how well their models learn, interpret, and predict physical interactions from video—skills that are central to advancing embodied AI.

“By sharing this work, we aim to give researchers and developers access to the best models and benchmarks to help accelerate research and progress – ultimately leading to better and more capable AI systems that will help enhance people’s lives,” Meta said.

What This Means

As AI continues to move from digital applications into the physical world—through robotics, automation, and smart devices—physical reasoning becomes a critical capability. With V-JEPA 2, Meta is working to equip machines with a more human-like understanding of cause and effect, movement, and interaction.

By training on video and emphasizing predictive planning, models like V-JEPA 2 could help AI systems operate more safely and effectively in real-world settings, bringing long-term goals like advanced machine intelligence closer to reality.

The release also signals Meta’s interest in building shared tools for the research community, supporting broader progress toward AI systems that can reason more like people do.

With V-JEPA 2, Meta isn’t just building smarter models—it’s preparing AI to make decisions in the unpredictable, unscripted environments where people already live and work.

Editor’s Note: This article was created by Alicia Shapiro, CMO of AiNews.com, with writing, image, and idea-generation support from ChatGPT, an AI assistant. However, the final perspective and editorial choices are solely Alicia Shapiro’s. Special thanks to ChatGPT for assistance with research and editorial support in crafting this article.