Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner conditioning interface.
Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning.
Across our main evaluation settings, S2 improves overall generalization by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations.
VLA executors carry too much burden: a coarse instruction underspecifies which local behavior should be executed, and an unconstrained image exposes the policy to clutter that doesn't matter for control. S2 shrinks the executor's learning problem from both sides through one consistent train-test interface.
The original instruction stays as the stable high-level goal — we don't replace it. We add refined trajectory- and subtask-level language that makes the current execution mode explicit. The executor learns under goal-preserving hierarchical conditioning, so task identity persists while ambiguity about how to act now is resolved.
A learned soft mask gates each visual token, with a budget regularizer that controls how much evidence is retained. No region annotations, no external VLM supervision — the executor discovers which evidence the control objective itself rewards keeping. Useful evidence may be a contact surface or clearance cue, not a whole object.
Because the planner communicates in the same language form that the executor is trained to follow, off-the-shelf VLMs can be swapped in without retraining. We demonstrate this by running S2 with both GPT-5.4 nano and the open-source Kimi K2.5 as the planner — performance is comparable across the two.
Architecture. For each camera view, the executor predicts a soft mask from visual tokens and goal-preserving language context. The gated representation is trained with a task loss together with a budget regularizer, with no region, box, or mask annotation. A nonzero gate floor, an ungated parallel path, and an annealed temperature schedule prevent collapse to trivial all-keep or all-drop solutions.
The learned visual evidence budget concentrates on behavior-relevant objects, contact regions, and local context. Native VLM attention, by contrast, remains diffuse and often allocates mass to broad or weakly task-related regions. The masks are learned without any region or mask annotation.
Qualitative comparison between the learned visual evidence budget (S2) and the backbone's native attention on representative real-robot observations. VEB concentrates on what control actually depends on; native attention often does not.
Eight tasks across two robots: bimanual manipulation on TX-G2 (AgiBot G2-compatible) and mobile manipulation on HSR. Every method gets the same 150k fine-tuning budget. S2 improves over π0.5 on every single task.
Platforms. TX-G2 (left) is bimanual and the policy must also infer which arm should execute the current behavior. HSR (right) adds locomotion and is queried at 2 Hz; TX-G2 is queried at 10 Hz.








Beyond the standard placements, we stress-test S2 on a deliberately cluttered TX-G2 clothes-sorting setup: most tabletop objects are unseen distractors, and a person perturbs the scene during execution. S2 keeps acting on the instructed garment and basket, completing the ordered sequence in spite of these perturbations.
Cluttered rollout. Ordered task green socks → handkerchief → yellow socks. Flowers, fruit, and bottles are unseen distractors. A person changes object and basket positions online during execution. S2 re-grounds the current target each step and preserves the required completion order.
Goal-preserving language > instruction replacement.
Simply rewriting the coarse instruction with richer local text doesn't solve the
identity problem. Only the goal-preserving hybrid — original instruction
plus refined local language — consistently wins on both
libero_goal and libero_object.
Mean-only object/goal ablation on LIBERO-PRO. Refined-only supervision remains weak
on libero_goal; recovering task identity without learned VEB still
underperforms the full S2 interface.
The visual budget has a stable sweet spot. Sweeping the shared visual evidence budget shows that a wide range of budgets are competitive, with the best mean at ρ = 0.2. This is the default S2 setting across all experiments.
LIBERO-PRO mean success vs. shared visual evidence budget ρ.
1. Hierarchical relabeling (Specify More). For each demonstration, a VLM rewrites the original task into one trajectory-specific instruction that describes how this trajectory solves the original task, adding visually supported execution detail (approach, contact, ordering, placement) without changing task identity. The trajectory is then decomposed into an ordered sequence of refined subtasks aligned to the action timeline.
2. Token-level evidence gating (See Less). Lightweight gate heads inside the executor predict a soft keep value for each image token from the token itself, a pooled summary of goal-preserving language context, and their product. A nonzero gate floor and annealed temperature stabilize learning.
3. Coupled training objective. The full S2 loss combines an ungated task loss (preserves stability), a gated task loss (forces the retained evidence to remain sufficient for action prediction), and a per-view budget regularizer (constrains average retention). Together they learn a control-grounded bottleneck rather than a generic saliency map.
4. Deployment. An off-the-shelf VLM emits local guidance in the same form as training-time subtasks via in-context learning. The low-level VLA executes the current detailed instruction under the learned visual evidence budget. Planners can be swapped without changing the executor.
@misc{wu2026s2,
title = {See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs},
author = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
year = {2026},
eprint = {TBA},
archivePrefix = {arXiv},
primaryClass = {cs.RO}
}