State of VLA Research at ICLR 2026October 2025 • by Moritz Reuss ![]() ICLR’s open submission policy gives a rare, real-time view of what the community is actually building. This post distills the state of Vision-Language-Action (VLA) Models research slice of ICLR 2026: what ‘counts’ as a VLA (and why that definition matters), what people are working on in the VLA field (discrete diffusion, embodied reasoning, new tokenizers), how to read benchmark results in VLA research, and the not-so-invisible frontier gap that sim leaderboards hide.
Each autumn, the ICLR publicly releases all anonymous submissions a few weeks after the deadline,
providing a unique real-time snapshot of ongoing research around the world without the typical six-month delay of other top ML conferences.
Given my personal research interests, I wanted to analyze Vision-Language-Action (VLA) Models research and share insights from this year's submissions. Table of ContentsWhat is a Vision-Language-Action Model?![]() The definition of VLAs is surprisingly contentious, with no clear consensus in the community. A recent survey paper defines it broadly: "A Vision-Language-Action (VLA) model is a system that takes visual observations and natural language instructions as required inputs and may incorporate additional sensory modalities. It produces robot actions by directly generating control commands." While this is a valid definition, in my personal opinion it misses what I consider the crucial distinguishing feature compared to other multimodal policies: internet-scale pretraining on some type of vision-language data. My personal definition: A VLA is a model that uses a pretrained backbone, which was trained on large-scale vision-language data and is subsequently trained with generating control commands. This control commands can be robot joints, end-effector poses, steering angle for a car, latent actions or mouse and keyboard commands for a virtual agent. This definition also includes Video-Action Policies that use pretrained video-generation models as their backbone. Without internet-scale pretraining, I refer to these as multimodal policies rather than VLAs. Where it becomes a bit fuzzy is when a model uses a pretrained text encoder like CLIP-text or T5 and a separately pretrained vision encoder like DINOv2 or SigLIP-Vision. In my personal opinion, I like to group these as multimodal policies and not VLAs since they are lacking the joint vision-language pretraining. For reference I also included a flowchart (Figure 1) to help you determine if your model is a VLA or not based on my definition. This distinction matters because internet-scale pretraining is what theoretically gives VLAs their moat: stronger language-instruction following and better generalization across tasks and environments. That's the promise, at least. The reality? Most current VLAs still struggle with zero-shot generalization and complex tasks, making them better described as "less dumb multi-modal policies" rather than truly general robot brains. But the potential is there and it has many exciting open problems for researchers to tackle. ![]() Related and complementary: Large Behavior Models (LBMs), a term from Toyota Research Institute (TRI) defined in their recent paper, represent another category. LBMs are robot policies trained on large-scale multitask robotic demonstration data, but they don't require internet-scale vision-language pretraining or a VLM backbone. Think of it this way: all VLAs trained on large-scale robotic data are also LBMs, but not all LBMs are VLAs. Together these two terms cover all types of robot foundation policies. I also included a flowchart (Figure 2) to help you determine if your model is a LBM or not based on my understanding What's your personal definition of a VLA? Do you agree with my take on internet-scale pretraining as the key differentiator? Let me know your thoughts! The Explosive Growth of VLA ResearchThe Vision-Language-Action field has experienced remarkable growth over the past two years. Based on keyword searches on OpenReview, the numbers tell an interesting story based on ICLR submissions openreview keyword search for "Vision-Language-Action":
This exponential growth trajectory shows that Vision-Language-Action models are rapidly gaining popularity with many new people from other domains like Vision coming into the exciting field of robot learning. Given this trend, I'm both excited and a bit terrified to review the projected 2,100+ VLA submissions at ICLR 2027, though I suspect the growth rate may stabilize as the field matures. Practitioner's Guide to Understanding Benchmark Results in VLA Research![]()
I want to provide a quick guide for practitioners on how to interpret benchmark results in VLA papers.
So you are reading a new VLA paper and want to understand if the claimed results are actually good or not?
In order to better understand the VLA results, it is important to have some context on the current state of popular VLA benchmarks and what good performance looks like. ICLR 2026 VLA Research TrendsAfter scanning most ICLR 2026 submissions with the VLA keyword, I identified the following key trends in VLA research. There is significant overlap between these categories, as many papers combine multiple ideas—for example, discrete diffusion with embodied reasoning, or efficient architectures with new tokenizers. Below, I highlight some notable papers in each category along with brief comments. Please note that this is not an exhaustive list, and I may have missed many excellent works. 1. Discrete Diffusion VLAs![]() Given the success of discrete diffusion models in other domains like text (e.g. (MDLM) and VLMs (e.g. LLaDA-V), it is no surprise that this trend is also making its way into VLA research. Why discrete diffusion? Compared to autoregressive models, diffusion models can generate sequences in parallel, which is a big advantage for discrete action token generation. Instead of having to run your policy 100 times, you can generate long action sequences in a few forward passes. In addition, you can combine it with ideas from Embodied Chain-of-Thought (see next section) to generate sub-goals and reasoning together with actions in parallel. This tackles some of the biggest limitations of ECoT from prior work, which was extremely slow due to the autoregressive nature of the VLM models. Current attempts for discrete Diffusion either finetune an autoregressive VLM with discrete diffusion because the variety of discrete diffusion VLMs is very limited. Other work uses LLaDA-V as a pretrained backbone with good success. Below are 4 concurrent papers that all propose different discrete diffusion VLAs with promising results on LIBERO and SIMPLER. Notable papers:
2. Reasoning VLAs and Embodied Chain-of-Thought (ECoT)Reasoning holds strong promise for improving the generalization and performance of VLAs, which still struggle with complex tasks and out-of-distribution scenarios. Inspired by the success of Chain-of-Thought prompting in LLMs, there is growing interest in applying similar ideas to VLAs. The core idea is to bridge action generation with intermediate visual and textual reasoning steps that help the VLA better ground and understand the task and environment. These reasoning traces are also more interpretable and can be used for debugging and understanding a VLA’s behavior. Since the first ECoT paper (CoRL 2024), interest has grown in combining spatially grounded reasoning with action prediction to improve VLAs. By combining subtasks, bounding-box predictions for task-relevant objects, and 2D motion trajectories, VLMs learn better representations for embodied tasks and show improved performance on generalization benchmarks. Prior analyses of ECoT training indicate that these objectives help bridge the representation gap between VLM static pretraining and robotics tasks. However, a key limitation of prior ECoT work is the autoregressive nature of VLAs and the increased token count, which results in slower training and inference. Overall, it remains an open question how to best implement grounded reasoning for VLAs. Recent work has explored additional modalities like depth prediction in MolmoAct. A major bottleneck is the limited availability of diverse training data: many ECoT studies still rely on the same BRIDGE and LIBERO labeled datasets. More diverse datasets with more complex tasks and environments are needed to push this direction further; however, labeling large-scale sources like DROID is tough. Notable papers:
3. New Tokenizers![]() We command robots with high-frequency, continuous control values (e.g., joint angles, gripper state). Vision-Language models, however, operate most effectively on discrete tokens. Naively fine-tuning a VLM to regress continuous actions tends to underperform and often induces catastrophic forgetting, because the new objective misaligns with the model’s pretrained representations. The core idea of these tokenizers is to convert continuous action sequences into compact discrete tokens that a VLM can predict—retaining accuracy and smoothness while minimizing compute and integration overhead. An ideal action tokenizer is fast, achieves a high compression ratio for long action-chunks, produces smooth long-horizon outputs, and drops into existing VLM architectures without modification.Prior work used discrete binning (e.g., RT-1) and VQ-VAE codebooks, but both struggle with either coarse precision or long-sequence efficiency. FAST introduced action-chunk tokenizers tailored for VLA prediction, demonstrating that discrete tokens can replace more complex diffusion/flow experts for integration. Building on this, newer tokenizers submitted to ICLR combine Residual Vector Quantization (RVQ) (e.g., SoundStream) for higher compression, spline-based parameterizations inspired by BEAST tokenizer for smooth, long trajectories, and DCT-style objectives (as in FAST) to bias toward low-frequency, physically plausible motion. I’m excited to test these tokenizers myself when they release. Notable papers:
4. Efficient VLAsAs someone working on this topic myself, I understand the pain of trying to train and run larger VLAs on limited compute setups. Thus, efficient VLAs are always relevant. Especially since this gives labs with limited compute access the chance to work on VLAs too. There are several interesting papers this year that try to tackle this problem with different approaches. Generally, one can divide them into two categories: try to make training and models more efficient by making smaller VLAs or better tokenizers etc. Otherwise, a lot of papers focus on making inference more efficient by using better quantization, distillation or similar ideas. Notable papers:
5. RL for VLAsFinetuning VLAs to go from 70-80% success rate in the real world to 99% is still an open problem. There is a lot of hope for RL Finetuning to close this gap. While a lot of attempts have been made before, no method has established itself as the go-to method yet. This year there are several interesting papers that try to tackle this problem with different approaches. Notable papers:
6. VLA + Video Prediction![]() Video generation models learn rich representations of temporal dynamics and physical interactions, which may provide useful priors for robotic control. Following strong results from the GR-1 paper at ICLR 2024, which demonstrated the potential of video-based policies, interest in this subfield has grown. These policies generally fall into two categories: (1) starting with a VLM that has optionally been pretrained on image/video generation, then continuing training with future frame and action prediction; or (2) starting from a Video Foundation Model and modifying it to also generate actions. Since most state-of-the-art video foundation models are diffusion/flow-based, these policies typically struggle with slow inference speed. Overall, results demonstrate that video generation—and the physics understanding and language grounding it requires—provides a valuable prior for robot learning. Compared to VLAs initialized from VLMs, this subfield is far less popular, and I hope to see more research in this direction. What holds back progress is the high computational requirements for finetuning state-of-the-art video models like Wan, which exceed even those of VLM-based VLA finetuning. Notable papers:
7. Evaluation and Benchmarking of VLAs![]() As mentioned above, the current state of VLA benchmarks is quite saturated and it is hard to judge which model is actually better given the limited number of benchmarks and the fact that most papers only compare against a few other baselines. Luckily, there are several submission that try to bridge this gap by introducing new benchmarks for VLAs. Other ideas include real2sim world models to test policies in generative environments. While I don't think these ideas are on a good enough level yet to be used as real alternatives yet, it's a very exciting research area that I hope to see more progress in the future. Notable papers:
8. Cross-Action-Space Learning![]() Most VLAs still avoid pretraining on diverse action spaces, given the difficulties of getting any positive transfer results. Thus, it is a very exciting research area for current VLAs to improve. In addition, there is a growing interesting in using human egocentric videos with action labels for pretraining VLAs. Datasets like EgoDex released earlier this year now enable more research in this direction. Several interesting papers are submitted this year that try to tackle this problem with different approaches. They either focus on architecture details of VLAs to better handle heterogeneous action spaces or use additional abstractions like motion in image-space to get better transfer. It's noteworthy that the recent Gemini Robotics 1.5 release from DeepMind indicates that a unreleased technique called motion transfer works for them to get positive zero-shot task transfer inbetween action spaces. So maybe this is just a data and model scale question. Nevertheless, more research is needed to better understand and tackle these issues. Notable papers:
9. Other Interesting PapersThere are several other interesting papers that don’t fit neatly into the categories above but are worth mentioning. They explore varied aspects of VLA design, from the choice of VLM backbone to adding memory modules into the policy. I’m especially interested in memory: most VLAs encode only the current image and ignore prior timesteps, which is a major limitation for many tasks. Naively feeding long histories into VLAs often backfires: models overfit to demonstrator-specific trajectories, and during rollouts the agent rarely encounters the same state sequences, leading to large performance drops. By contrast, memory modules that aggregate and compress past context (rather than memorizing it) look promising. Hopefully this makes learned policies more robust to distribution shift while preserving the temporal cues needed for long-horizon control. Another cool work I wanted to highlight is the challenge of composing multiple policies at test time to improve performance. Diffusion and flow-based VLAs are required for this, since their energy-based formulation allows combining multiple models by summing their scores. This is a promising direction to improve performance without training, and I hope to see more. Notable papers:
The Hidden Gap Between Frontier and Research VLAs![]()
On paper (pun intended:)), the gap seems small: on simulation setups (LIBERO, CALVIN) open-source VLAs surpass popular frontier baselines like Pi0.5.
In practice, there is still a bigger gap that only shows up precisely where current papers rarely evaluate: zero-shot, open-world behavior after pretraining.
Two weeks ago at CoRL, the Gemini-Robotics VLA demo attempted a wide range of novel tasks with arbitrary objects and paraphrased language.
My own VLA FLOWER is state-of-the-art on all CALVIN benchmarks, but nowhere close to that level of zero-shot robustness.
Simulation benchmarks hide this delta and current simulation setups don't optimize for this objective. Why the gap exists (based on what’s visible from papers, discussion with peers and personal experience):
Counterarguments against this thesis:
![]() What would help to bridge this gap without blowing the compute and human manpower budget:
I want to clearly highlight, that I don't think simulation or local finetuning are useless for research in contrast they are very important for many parts of robot learning. Only that they are a poor proxy for the thing for the main argument of VLA: robust zero-shot behavior in messy, new environments. Summary and OutlookOverall, I am very positive about the current state of VLA research and the progress being made in this field. The trends above show strong interest and contributions across VLA models—from architecture design to training strategies and evaluation methods. However, there are also a few disappointing aspects aside from the zero-shot performance gap of the current state of VLA research that are worth highlighting. Two Underrepresented Problems in Current VLA Research:
Despite these gaps, I remain optimistic that the field will continue to grow and evolve rapidly. The explosive growth in submissions and the convergence on promising directions like discrete diffusion and embodied reasoning suggest that VLA research is maturing quickly. As we address these fundamental challenges around data quality and contextual learning, we'll move closer to VLAs that can truly generalize in the messy, unstructured environments where robots need to operate. Cite this postIf you’d like to cite this article, use the BibTeX below:
|