AIRoA · Robot Learning

See Less, Specify More

Broad reasoning. Focused control.

Robot data is scarce and embodiment-specific, while off-the-shelf VLMs bring broader priors for instruction and scene reasoning. S2 turns that asymmetry into a cleaner executor interface: preserve the goal, specify the local behavior, and restrict control to learned task-sufficient visual evidence.

Yueh-Hua (Kris) Wu^*, Tatsuya Matsushima, Kei Ota AIRoA · ^*Corresponding author

Paper arXiv Overview Video Code (Coming Soon)

Overview video

Why the executor needs a cleaner interface.

Three-minute walkthrough of the motivation, S2 interface, learned visual evidence masks, and real-robot results.

Real-robot result

A cleaner executor interface raises real-robot success.

54.2% to 79.0% mean subtask success across 8 real-robot tasks

PlatformsTX-G2 + HSR

Tasks8 real-robot tasks

Visual evidence budget

The policy learns what control needs to see.

See Less is not a visualization of native attention or a hand-labeled crop. It imposes an explicit budget on base and wrist visual tokens, then learns soft keep masks from action prediction alone, without annotation. The retained evidence centers on task-sufficient cues: objects, contact regions, end effectors, and destination context.

TX-G2 Bimanual tabletop manipulation

1 / 24 TX-G2 rollout 01

The retained evidence follows the manipulated object and end effectors while native attention stays diffuse.

HSR Mobile manipulation and placement

1 / 13 HSR rollout 01

The budget shifts across shelf, object, gripper, and destination context as the local task phase changes.

OOD stress rollouts

The target stays grounded as the scene changes.

The ordered clothes-sorting task is to place green socks, then the handkerchief, then yellow socks into the basket. Most tabletop objects here never appear in training; as a person moves objects and basket positions during execution, S2 keeps the current garment and destination grounded.

Mostly OOD objects Human perturbations Autonomous, 3x

Autonomous, 3x Clothes stress 01

Autonomous, 3x Clothes stress 02

Autonomous, 3x Clothes stress 03

Method framing

The interface, not just the policy, is the bottleneck.

Coarse language and full images make the executor solve avoidable ambiguity. S2 changes that contract: preserve task identity, specify the local execution mode, and budget visual evidence to the task-relevant subset.

Preserve task identity

Keep the original instruction as the stable goal, so local guidance cannot become a different task.

Specify the current phase

Use local language to disambiguate the execution mode: approach, grasp, transfer, place, or recover.

Budget visual evidence

Learn which base and wrist visual tokens are sufficient for action prediction under a fixed budget.

Act from a cleaner state

Predict actions from a smaller language-and-vision problem instead of broad, underspecified context.

Training signal

Evidence is grounded by action loss, not annotation.

The gate sees visual tokens and goal-preserving language context, then learns what can be suppressed while action prediction remains accurate.

Control-grounded visual evidence budgeting architecture — The budget gate suppresses broad visual dependence while preserving action-relevant information.

Ablations

S2 is not just more language.

The ablation separates two failure modes. Replacing the original goal with richer local text can improve object robustness while damaging task identity; visual budgeting helps most when the goal is preserved and the local mode is specified.

LIBERO object/goal mean

Only full S2 stays high on both suites.

Goal measures whether task identity is preserved. Object measures whether the executor remains robust to object perturbations.

Interface

Goal

Object

Reading

Originalcoarse instruction

57.8

53.3

Preserves the task, but leaves local behavior underspecified.

Refinedreplacement text

37.8

70.4

Object robustness rises, but task identity drops sharply.

Original + VEBvision budget only

62.5

61.8

The visual bottleneck helps, but does not specify the phase.

Refined + VEBreplacement + budget

35.0

71.9

Budgeting cannot recover a lost high-level goal.

Hybrid, no VEBgoal + local guidance

37.9

62.1

Language alone is still exposed to broad visual context.

S2goal + local guidance + VEB

70.1

73.1

The full interface is balanced across both suites.

Budget sensitivity

ρ = 0.2 is the measured default.

The sweep is shallow but consistent: too small a budget under-specifies control, while looser budgets reintroduce nuisance visual context.

Real robots

One interface across bimanual and mobile manipulation.

TX-G2 tests arm selection and precise tabletop control. HSR adds navigation, transport, and tight placement geometry. The same executor interface improves over π_0.5 on every reported task.

TX-G24 tasksbimanual tabletop manipulation

HSR4 tasksmobile manipulation and transport

Mean gain+24.8 ptsover π_0.5 real-robot success

TX-G2 and HSR robot platforms — TX-G2 runs at 10 Hz with three cameras; HSR runs at 2 Hz with mobile manipulation.

Autonomous HSR rollouts

HSR completes the full instruction, not just a phase.

Four autonomous HSR executions from the evaluation setting, shown at 3x speed. The rollouts carry navigation, grasping, transport, and constrained placement through completion.

Autonomous, 3x Coffee Bottle to Box

Autonomous, 3x Two Coffee Bottles to Table

Autonomous, 3x Box Relocation

Autonomous, 3x Mug Rectangle

Paper abstract

A cleaner interface for the executor.

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over π_0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

BibTeX

@misc{wu2026s2,
  title         = {See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs},
  author        = {Wu, Yueh-Hua and Matsushima, Tatsuya and Ota, Kei},
  year          = {2026},
  eprint        = {2606.02735},
  archivePrefix = {arXiv},
  primaryClass  = {cs.RO}
}