Understanding how humans perceive and describe tasks can help improve embodied reasoning models for robotic manipulation.
Existing policies such as Embodied Chain-of-Thought (E-CoT) are mostly trained on robot-centric datasets and lack exposure to real egocentric human demonstrations.
We present Ego-Tutor, a system that enhances Vision-Language-Action models by leveraging rich multimodal signals from Meta Aria glasses to provide human-aligned reasoning for robot learning.
Our approach enables: (1) Recording natural human demonstrations with Aria glasses capturing synchronized RGB, eye gaze, hand tracking, and speech;
(2) Using these multimodal signals to generate better annotations that classify objects as PRIMARY (gaze-attended) or AUXILIARY (contextual), incorporate spoken task descriptions, and leverage hand pose for action understanding;
(3) Improving upon the original E-CoT work with enhanced reasoning chains generated by Gemini 2.5 Flash that produce more human-aligned subtask decomposition and movement explanations.
Additionally, we contribute a Microsoft HoloLens application that enables users to inspect, edit, and correct model reasoning and bounding boxes in mixed reality, enabling on-the-fly data augmentation and interactive policy improvement.
Our experiments show that models fine-tuned on our multimodal egocentric data converge faster and produce more interpretable reasoning chains than those trained on standard E-CoT data alone.