Ego-Tutor: Multimodal Reasoning for Dexterous Mobile Robots

Abstract

Understanding how humans perceive and describe tasks can help improve embodied reasoning models for robotic manipulation. Existing policies such as Embodied Chain-of-Thought (E-CoT) are mostly trained on robot-centric datasets and lack exposure to real egocentric human demonstrations. We present Ego-Tutor, a system that enhances Vision-Language-Action models by leveraging rich multimodal signals from Meta Aria glasses to provide human-aligned reasoning for robot learning.

Our approach enables: (1) Recording natural human demonstrations with Aria glasses capturing synchronized RGB, eye gaze, hand tracking, and speech; (2) Using these multimodal signals to generate better annotations that classify objects as PRIMARY (gaze-attended) or AUXILIARY (contextual), incorporate spoken task descriptions, and leverage hand pose for action understanding; (3) Improving upon the original E-CoT work with enhanced reasoning chains generated by Gemini 2.5 Flash that produce more human-aligned subtask decomposition and movement explanations.

Additionally, we contribute a Microsoft HoloLens application that enables users to inspect, edit, and correct model reasoning and bounding boxes in mixed reality, enabling on-the-fly data augmentation and interactive policy improvement. Our experiments show that models fine-tuned on our multimodal egocentric data converge faster and produce more interpretable reasoning chains than those trained on standard E-CoT data alone.

What Ego-Tutor Does

1. Record Human Demos

Capture natural task demonstrations using Meta Aria glasses with synchronized RGB video, 3D eye gaze tracking, hand pose estimation, and speech-to-text transcription.

👁️ Gaze ✋ Hands 🎤 Speech

2. Better Annotations

Use multimodal signals to generate improved annotations: gaze determines PRIMARY vs AUXILIARY objects, speech provides natural task descriptions, hand tracking informs action phases.

PRIMARY = what human looks at
AUXILIARY = contextual objects

3. Enhanced Reasoning

Generate improved E-CoT reasoning chains using Gemini 2.5 Flash, producing more detailed subtask decomposition, gaze-grounded movement explanations, and human-aligned planning.

Built on E-CoT + OpenVLA

Ego-Tutor Pipeline

Our pipeline transforms raw multimodal Aria recordings into structured RLDS episodes with enhanced reasoning chains. We extract gaze coordinates, run zero-shot object detection with Grounding DINO, use a hierarchical classification algorithm to determine object importance, and generate improved reasoning with Gemini 2.5 Flash.

📹 Aria VRS

RGB + Sensors

→

👁️ Gaze

3D → 2D Projection

+

✋ Hands

Pose Tracking

+

🎤 Speech

Task Descriptions

↓

🔍 Grounding DINO

Object Detection

→

🎯 Classification

PRIMARY / AUX

→

✨ Gemini 2.5

Enhanced Reasoning

→

🤖 VLA Fine-tune

LoRA Adapters

Egocentric Data Collection

We collected rich multimodal demonstrations using Meta Aria glasses, recording synchronized RGB video, gaze direction, hand pose, and spoken task descriptions. Our pipeline transforms these raw streams into structured RLDS episodes suitable for training.

📷 Stereo RGB

Undistortion and spatial alignment of stereo RGB streams, scaled to 256×256 resolution for VLA input.

👁️ 3D Gaze Fusion

Extraction of 3D gaze rays from Aria MPS eye-tracking, projected onto image plane for object attention.

✋ Hand Tracking

Integration of hand pose and skeleton data for wrist position supervision and action phase detection.

🎤 Speech-to-Text

Automatic extraction of natural task descriptions spoken during demonstrations for grounding.

Multimodal Attention Classification

Our hierarchical classification algorithm determines object importance using multiple modalities:

Extract gaze point from Aria MPS eye-tracking data (projected to 256×256)
Run Grounding DINO with spoken task description to detect relevant objects
Find bbox closest to the gaze point centroid
Among overlapping bboxes, select highest confidence as PRIMARY
Filter out "hand" detections from PRIMARY selection
Remaining objects become AUXILIARY context
Use hand tracking to inform gripper state and action phase

This approach ensures the model learns to focus on what humans naturally attend to during manipulation, while maintaining awareness of the surrounding context.

Classification Example

Task: "put the orange on the white plate"

              👁️
              Gaze: [123, 142]
            

                orange [145, 148, 156, 160] | PRIMARY
              

                table [65, 124, 169, 201] | AUXILIARY
              

Orange is PRIMARY (gaze-attended). Table provides spatial context.

Reasoning Chain Comparison

Task: "taking the yellow sponge and putting it on the green napkin"

Baseline E-CoT

TASK:
Take the yellow sponge and put it on the
green napkin.

PLAN:
Grasp the yellow sponge. Move the yellow
sponge to the green napkin.

SUBTASK REASONING:
The yellow sponge is the first object that
needs to be grasped in order to complete
the task.

SUBTASK:
Grasp the yellow sponge.

MOVE REASONING:
The yellow sponge is below the robotic arm.

MOVE: Move down.

GRIPPER POSITION: [24, 173]

VISIBLE OBJECTS:
yellow sponge [168, 153, 187, 171]
green napkin [171, 173, 222, 227]
metal sink [85, 142, 177, 225]

// generic reasoning, no attention signal
// gripper position far from target object
// no object prioritization

Ego-Tutor (Ours)

TASK:
The yellow sponge needs to be grasped and
then placed onto the green napkin.

PLAN:
1. Grasp the yellow sponge.
2. Move sponge to the green napkin.
3. Release the yellow sponge.

GAZE POINT: [137, 150]

SUBTASK REASONING:
The spoken instruction "putting it on the green
napkin" sets the goal. Gaze at [137, 150] is on
the yellow sponge, confirming it as the immediate
target for grasping.

SUBTASK:
Grasping the yellow sponge.

MOVE REASONING:
Stopping to maintain stable pre-grasp position
directly above the yellow sponge. Gaze on PRIMARY
confirms alignment for precise grasp initiation.

MOVE: Stop. Preparing to grasp.

GRIPPER POSITION: [127, 152]

VISIBLE OBJECTS:
yellow sponge [132, 142, 142, 157] | PRIMARY
green napkin [171, 173, 222, 227] | AUXILIARY
table [77, 124, 203, 220] | AUXILIARY

Our multimodal reasoning provides explicit gaze attention signals, spoken instruction grounding, PRIMARY/AUXILIARY object classification, and detailed move explanations that help the model understand which objects are currently relevant and why.

Visual Comparison

Base E-CoT vs Ego-Tutor with gaze-guided attention

Base E-CoT

Ego-Tutor (with gaze)

Example 2: "put the orange on the white plate"

Baseline E-CoT

TASK:
Pick up the orange and place it on the plate.

PLAN:
Move to orange, grasp, move to plate, release.

SUBTASK REASONING:
The orange is in front of the robot, so the
robot needs to reach for it.

SUBTASK: Reach for the orange.

MOVE REASONING:
The robot is already in the correct position.

MOVE: Stop.

GRIPPER POSITION: [182, 68]

VISIBLE OBJECTS:
orange [145, 148, 156, 160]
table [65, 124, 169, 201]
hand [157, 183, 170, 198]

Ego-Tutor (Ours)

TASK:
The orange needs to be lifted, transported to
the white plate, and placed on it.

PLAN:
1. Pick up the orange. 2. Move to plate.
3. Place the orange on the plate.

GAZE POINT: [150, 154]

SUBTASK REASONING:
The spoken instruction "picking up the orange"
confirms current objective. Gaze at [150, 154]
is on the orange, the first object to grasp.

SUBTASK: Picking up the orange.

MOVE REASONING:
Gripper closed around orange, holding stationary
to confirm grasp before transport to plate.

MOVE: Holding orange stationary after grasp.

GRIPPER POSITION: [182, 68]

VISIBLE OBJECTS:
orange [145, 148, 156, 160] | PRIMARY
table [65, 124, 169, 201] | AUXILIARY

HoloLens Application for Interactive Feedback

To support interactive inspection and correction of model reasoning, we developed an immersive mixed-reality application on Microsoft HoloLens. This enables users to see E-CoT predictions directly in their environment and correct them in real-time, creating a human-in-the-loop feedback pipeline for on-the-fly data augmentation and policy improvement.

Visualize Reasoning

Display predicted bounding boxes, PRIMARY/AUXILIARY classifications, and reasoning chains overlaid on the real environment.

Voice Corrections

Use voice commands to correct reasoning sequences, adjust bounding boxes, and modify object classifications in real-time.

Live Feedback Loop

Corrections feed back into the training pipeline, enabling iterative policy improvement through human-in-the-loop data augmentation.

This tool provides an intuitive way to understand and refine embodied reasoning outputs, making it easier to diagnose failures and inject human knowledge into the learning process.

Results

Key Finding: Improved Training Dynamics

Models fine-tuned on our gaze-aware egocentric data show improved training dynamics compared to standard E-CoT. The gaze hierarchical classification provides stronger supervision for task-relevant features, enabling the model to learn which objects to attend to during manipulation.

Training Curves

Brown = Base E-CoT (with Gemini 2.5 Flash update) | Green = Gaze-Aware E-CoT (Ours)

Qualitative Improvements

Egocentric cues such as gaze, hand pose, and spoken narration shift reasoning steps toward more human-aligned attention. Our enhanced reasoning chains show:

👁️ → 🎯

Gaze-Guided Selection

Objects are classified as PRIMARY based on human gaze, not just task instruction parsing.

🎤 → 📝

Speech-Grounded Subtasks

Spoken instructions like "picking up the orange" directly inform subtask reasoning.

✋ → 🤖

Hand-Informed Actions

Hand tracking provides gripper state context and action phase detection.

Technical Details

Model Architecture

Base Model: ecot-openvla-7b-bridge
Fine-tuning: LoRA adapters (rank 32)
Vision Encoder: SigLIP + DinoV2
LLM Backbone: Llama 2 7B

Training Configuration

Learning Rate: 0.001
Batch Size: 8
Max Steps: 100
Hardware: 80GB A100 GPU

Data Pipeline

Multimodal Extraction: Aria MPS → Gaze + Hands + Speech
Object Detection: Grounding DINO
Reasoning Generation: Gemini 2.5 Flash
Dataset Format: RLDS (TFRecord)

Aria Data Processing

Input: VRS files with eye tracking + hands
Resolution: Scaled to 256×256
Gaze Output: (timestamp, gaze_x, gaze_y)
Sync: Frame-aligned multimodal extraction

References

Zawalski, M., Chen, W., Pertsch, K., Mees, O., Finn, C. and Levine, S., 2024. Robotic control via embodied chain-of-thought reasoning. arXiv preprint arXiv:2407.08693.
Kim, M.J., Pertsch, K., Karamcheti, S., et al., 2024. OpenVLA: An open-source vision-language-action model. arXiv preprint arXiv:2406.09246.
Liu, S., Zeng, Z., Ren, T., et al., 2023. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv preprint arXiv:2303.05499.
Meta Platforms, Inc. Project Aria: Research glasses for egocentric AI. projectaria.com