AIRoA · Robot Learning

See Less, Specify More

Broad reasoning. Focused control.

Robot data is scarce and embodiment-specific, while off-the-shelf VLMs bring broader priors for instruction and scene reasoning. S2 turns that asymmetry into a cleaner executor interface: preserve the goal, specify the local behavior, and restrict control to learned task-sufficient visual evidence.

Yueh-Hua (Kris) Wu*, Tatsuya Matsushima, Kei Ota AIRoA · *Corresponding author

Overview video

Why the executor needs a cleaner interface.

Three-minute walkthrough of the motivation, S2 interface, learned visual evidence masks, and real-robot results.

Real-robot result

A cleaner executor interface raises real-robot success.

54.2% to 79.0% mean subtask success across 8 real-robot tasks
PlatformsTX-G2 + HSR
Tasks8 real-robot tasks

Visual evidence budget

The policy learns what control needs to see.

See Less is not a visualization of native attention or a hand-labeled crop. It imposes an explicit budget on base and wrist visual tokens, then learns soft keep masks from action prediction alone, without annotation. The retained evidence centers on task-sufficient cues: objects, contact regions, end effectors, and destination context.

TX-G2 Bimanual tabletop manipulation
1 / 24 TX-G2 rollout 01

The retained evidence follows the manipulated object and end effectors while native attention stays diffuse.

HSR Mobile manipulation and placement
1 / 13 HSR rollout 01

The budget shifts across shelf, object, gripper, and destination context as the local task phase changes.

OOD stress rollouts

The target stays grounded as the scene changes.

The ordered clothes-sorting task is to place green socks, then the handkerchief, then yellow socks into the basket. Most tabletop objects here never appear in training; as a person moves objects and basket positions during execution, S2 keeps the current garment and destination grounded.

Mostly OOD objects Human perturbations Autonomous, 3x
Autonomous, 3x Clothes stress 01
Autonomous, 3x Clothes stress 02
Autonomous, 3x Clothes stress 03

Method framing

The interface, not just the policy, is the bottleneck.

Coarse language and full images make the executor solve avoidable ambiguity. S2 changes that contract: preserve task identity, specify the local execution mode, and budget visual evidence to the task-relevant subset.

01

Preserve task identity

Keep the original instruction as the stable goal, so local guidance cannot become a different task.

02

Specify the current phase

Use local language to disambiguate the execution mode: approach, grasp, transfer, place, or recover.

03

Budget visual evidence

Learn which base and wrist visual tokens are sufficient for action prediction under a fixed budget.

04

Act from a cleaner state

Predict actions from a smaller language-and-vision problem instead of broad, underspecified context.

Training signal

Evidence is grounded by action loss, not annotation.

The gate sees visual tokens and goal-preserving language context, then learns what can be suppressed while action prediction remains accurate.

Control-grounded visual evidence budgeting architecture
The budget gate suppresses broad visual dependence while preserving action-relevant information.

Ablations

S2 is not just more language.

The ablation separates two failure modes. Replacing the original goal with richer local text can improve object robustness while damaging task identity; visual budgeting helps most when the goal is preserved and the local mode is specified.

LIBERO object/goal mean

Only full S2 stays high on both suites.

Goal measures whether task identity is preserved. Object measures whether the executor remains robust to object perturbations.

Interface
Goal
Object
Reading
Originalcoarse instruction
57.8
53.3
Preserves the task, but leaves local behavior underspecified.
Refinedreplacement text
37.8
70.4
Object robustness rises, but task identity drops sharply.
Original + VEBvision budget only
62.5
61.8
The visual bottleneck helps, but does not specify the phase.
Refined + VEBreplacement + budget
35.0
71.9
Budgeting cannot recover a lost high-level goal.
Hybrid, no VEBgoal + local guidance
37.9
62.1
Language alone is still exposed to broad visual context.
S2goal + local guidance + VEB
70.1
73.1
The full interface is balanced across both suites.
Budget sensitivity

ρ = 0.2 is the measured default.

The sweep is shallow but consistent: too small a budget under-specifies control, while looser budgets reintroduce nuisance visual context.

LIBERO-PRO mean versus shared visual evidence budget The LIBERO-PRO mean is 66.16 at 0.10, 67.16 at 0.20, 66.25 at 0.30, and 65.16 at 0.40. The best value is at 0.20. 66.16 67.16 66.25 65.16 0.10 0.20 0.30 0.40

Real robots

One interface across bimanual and mobile manipulation.

TX-G2 tests arm selection and precise tabletop control. HSR adds navigation, transport, and tight placement geometry. The same executor interface improves over π0.5 on every reported task.

TX-G24 tasksbimanual tabletop manipulation
HSR4 tasksmobile manipulation and transport
Mean gain+24.8 ptsover π0.5 real-robot success
TX-G2 and HSR robot platforms
TX-G2 runs at 10 Hz with three cameras; HSR runs at 2 Hz with mobile manipulation.
Cutlery Transfer
Cutlery
Bowl Stacking
Bowl
Clothes Sorting
Clothes
Dish Racking
Dish
Coffee bottles
Coffee
Box relocation
Box
Bottle to box
Bottles
Mug placement
Mug

Autonomous HSR rollouts

HSR completes the full instruction, not just a phase.

Four autonomous HSR executions from the evaluation setting, shown at 3x speed. The rollouts carry navigation, grasping, transport, and constrained placement through completion.

Autonomous, 3x Coffee Bottle to Box
Autonomous, 3x Two Coffee Bottles to Table
Autonomous, 3x Box Relocation
Autonomous, 3x Mug Rectangle

Paper abstract

A cleaner interface for the executor.

Generalization remains a central bottleneck for VLA models: under distractors, appearance shifts, and semantically similar tasks, the policy must infer local execution details from coarse instructions while also deciding which parts of the image matter for control.

S2 improves generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while adding refined trajectory- and subtask-level language. See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context.

BibTeX

@misc{wu2026s2,
  title         = {See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs},
  author        = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
  year          = {2026},
  eprint        = {TBA},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}