Academic Project Page

In each trajectory, we identify the cause of failure modes and leverage this information to iteratively improve compositional foundation models for robotic decision-making.

Enabling robots to autonomously perform complex, long-horizon tasks remains challenging due to the need for hierarchical reasoning and dynamic adaptability. Humans overcome this by interacting with environment and learning from their own experience, which is infeasible for existing robots without human supervision. To enable similar capabilities in robotic agents, we introduce MOSEL, an modular self-reflective learning framework for robotic decision making. MOSEL combines hierarchical planning with multimodal foundation models, including LVLMs, video diffusion, and inverse dynamics models. These components work together to break down complex tasks, generate executable visual plans, and perform actions. We further introduce a modular self-reflective learning framework that autonomously identifies failures and iteratively refines policies with minimal human intervention. Evaluations on LIBERO-LONG and RoboTwin benchmarks demonstrate that MOSEL outperforms existing methods, achieving over 33% and 46% average performance improvements, respectively. Our results underscore the effectiveness of autonomous self-improvement and accurate failure identification in advancing robust robotic manipulation.

Push-Block. In this example, the robot was instructed to push a red block into a blue target region using a brush. Initially, it successfully picked up the brush with its left hand but failed to complete the task because it never transferred the brush to the right hand or used it to interact with the block. The original plan assumed the robot could use its right hand without explicitly including a transfer step. Based on this failure, a revised plan was generated: the robot should first pick up the brush with its left hand, then hand it over to the right hand, and finally push the red block with the brush using the right hand. This adjustment aligns better with the robot’s available skill set.

Sweep-Block. In this example, a robot is assigned the task of sweeping a red block into a dustpan using a brush. The original language plan instructed the robot to pick up the brush and dustpan sequentially with the left hand before performing the sweeping. However, the execution failed because the robot attempted to grasp both tools simultaneously with the same hand, making it impossible to perform the final task. A revised plan was generated, introducing a handover step to free up the left hand for proper tool handling. This adjustment allowed for sequential tool manipulation and set up the robot to complete the task more effectively.

Put the white mug on the plate and put the chocolate pudding to the right of the plate. In this example, the robot was tasked with placing a white mug on a plate and positioning a chocolate pudding to the right of the plate. The initial execution succeeded in placing the mug but failed overall when the robot’s arm knocked the mug over while trying to place the pudding, disrupting the setup. Moreover, the pudding was not successfully placed before the video ended. To address this, a revised language plan was proposed that reverses the action order—placing the pudding first, then the mug—to reduce the chance of collision and ensure successful task completion.

Beat-Block. Information loss or occluded objects can lead to erroneous transitions within visual plans. As illustrated in (a), the target becomes occluded by the robotic arm itself after tool pickup. Conditioned only on the current observation as initial image in (b), the world model incorrectly predicts the target block's position, causing the policy to direct the robotic arm to an incorrect location shown in the visual plan. From (c) and (d) we can see that the robotic arm is guided to a wrongly predicted position. To mitigate this failure mode, incorporating previously observed information proves crucial. Our strengthening procedure enables diffusion model to acquire knowledge of physically plausible transitions between skills. As demonstrated in (e), the red block stays in the fix position across the whole generated visual plan.

Pick up the book and place it in the back compartment of the caddy. In the original generated video as shown in (a), the robot initially moves the book toward the right compartment of the caddy but then abruptly places it in the front compartment, resulting in an implausible motion trajectory. After the strengthening process, as shown in (b), the robot successfully moves the book to the back compartment with a physically plausible motion.

Put the white mug on the plate and put the chocolate pudding to the right of the plate. In the original generated video as shown in (a), the robot correctly moves the white mug toward the plate but incorrectly moves the chocolate pudding across the mug to the left of the plate, resulting in noticeable visual artifacts. After the strengthening process, as shown in (b), the robot successfully moves the chocolate pudding to the right of the plate with a physically plausible motion.

Put the black bowl in the bottom drawer of the cabinet and close it. In the original generated video as shown in (a), the robot incorrectly moves the bowl toward the top of the cabinet, and the bowl subsequently disappears in later frames, leading to an implausible object motion. After the strengthening process, as shown in (b), the robot successfully places the bowl in the bottom drawer of the cabinet and closes it with a physically plausible motion.

Push-Block. Even when the inverse dynamics model follows a correct visual plan, as shown in (a) and (b), the robot may still fail to execute certain precise actions, as illustrated in (c). To address this, we finetune the inverse dynamics model using the collected data. Following the strengthening procedure, (d) shows that the robot successfully pushes the red block.

Put both the alphabet soup and the cream cheese box in the basket. In the original recorded video as shown in (a), the robot fails to move the alphabet soup toward the cabinet. After the strengthening process, as shown in (b), the robot successfully moves the the alphabet soup toward the cabinet.

Pick up the book and place it in the back compartment of the caddy. In the original recorded video as shown in (a), the robot fails to pick up the book. After the strengthening process, as shown in (b), the robot successfully pick up the book and place it in the target position.

MoSEL: Modular Self-Reflective Learning for Embodied Decision-Making

Abstract

Visualization

We provide some examples of comparison before/after our proposed strengthening procedure.

We provide some examples of strengthening task planner.

We provide some examples of strengthening visual planner.

We provide some examples of strengthening action policy.