The retained evidence follows the manipulated object and end effectors while native attention stays diffuse.
AIRoA · Robot Learning
See Less, Specify More
Broad reasoning. Focused control.
Robot data is scarce and embodiment-specific, while off-the-shelf VLMs bring broader priors for instruction and scene reasoning. S2 turns that asymmetry into a cleaner executor interface: preserve the goal, specify the local behavior, and restrict control to learned task-sufficient visual evidence.
Overview video
Why the executor needs a cleaner interface.
Three-minute walkthrough of the motivation, S2 interface, learned visual evidence masks, and real-robot results.
Real-robot result
A cleaner executor interface raises real-robot success.
Visual evidence budget
The policy learns what control needs to see.
See Less is not a visualization of native attention or a hand-labeled crop. It imposes an explicit budget on base and wrist visual tokens, then learns soft keep masks from action prediction alone, without annotation. The retained evidence centers on task-sufficient cues: objects, contact regions, end effectors, and destination context.
The budget shifts across shelf, object, gripper, and destination context as the local task phase changes.
OOD stress rollouts
The target stays grounded as the scene changes.
The ordered clothes-sorting task is to place green socks, then the handkerchief, then yellow socks into the basket. Most tabletop objects here never appear in training; as a person moves objects and basket positions during execution, S2 keeps the current garment and destination grounded.
Method framing
The interface, not just the policy, is the bottleneck.
Coarse language and full images make the executor solve avoidable ambiguity. S2 changes that contract: preserve task identity, specify the local execution mode, and budget visual evidence to the task-relevant subset.
Preserve task identity
Keep the original instruction as the stable goal, so local guidance cannot become a different task.
Specify the current phase
Use local language to disambiguate the execution mode: approach, grasp, transfer, place, or recover.
Budget visual evidence
Learn which base and wrist visual tokens are sufficient for action prediction under a fixed budget.
Act from a cleaner state
Predict actions from a smaller language-and-vision problem instead of broad, underspecified context.
Training signal
Evidence is grounded by action loss, not annotation.
The gate sees visual tokens and goal-preserving language context, then learns what can be suppressed while action prediction remains accurate.
Ablations
S2 is not just more language.
The ablation separates two failure modes. Replacing the original goal with richer local text can improve object robustness while damaging task identity; visual budgeting helps most when the goal is preserved and the local mode is specified.
Only full S2 stays high on both suites.
Goal measures whether task identity is preserved. Object measures whether the executor remains robust to object perturbations.
ρ = 0.2 is the measured default.
The sweep is shallow but consistent: too small a budget under-specifies control, while looser budgets reintroduce nuisance visual context.
Real robots
One interface across bimanual and mobile manipulation.
TX-G2 tests arm selection and precise tabletop control. HSR adds navigation, transport, and tight placement geometry. The same executor interface improves over π0.5 on every reported task.








Autonomous HSR rollouts
HSR completes the full instruction, not just a phase.
Four autonomous HSR executions from the evaluation setting, shown at 3x speed. The rollouts carry navigation, grasping, transport, and constrained placement through completion.
Paper abstract
A cleaner interface for the executor.
Generalization remains a central bottleneck for VLA models: under distractors, appearance shifts, and semantically similar tasks, the policy must infer local execution details from coarse instructions while also deciding which parts of the image matter for control.
S2 improves generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while adding refined trajectory- and subtask-level language. See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context.
BibTeX
@misc{wu2026s2,
title = {See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs},
author = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
year = {2026},
eprint = {TBA},
archivePrefix = {arXiv},
primaryClass = {cs.RO}
}