WorkBenchMark

A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

1Chair of Machine Learning and Reasoning (i6), RWTH Aachen University
2MASCOR Institute, FH Aachen University of Applied Science

A benchmark for robotic assembly — manipulation meets task-level reasoning

Every task: build the target product from a pile of bricks.

Abstract

We introduce WorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints — a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers. We provide an open-vocabulary perception, Assembly-by-Disassembly baseline solution. Our planning-based pipeline outperforms a modern vision-language-action approach across all tiers. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

WorkBenchMark draws its task design from the Workbench Track of the RoboCup Smart Manufacturing League (SML), where robots assemble products from supplied parts. You do not need to take part in RoboCup to use it — it is a standalone, open benchmark for anyone working on robotic assembly, task-and-motion planning, or manipulation.

Leaderboard

Every method, every metric, every tier. Each cell shows the four per-tier scores (T1 T2 T3 T4) with the Overall average below. Click any column to rank by it. Want to appear here? See how to submit.

What the Metrics Mean

In each cell: 96.67 90.00 70.00 66.67 = Tiers 1–4 (small), 80.84 = Overall (bold).  ·  higher is better  ·  lower is better  ·  blue = best Overall, bold = best in that tier.

Planning and manipulation times depend on hardware and implementation — treat cross-submission timing comparisons with caution.

Real-Robot Track UR5

The structured pipeline run on a physical UR5 with a Robotiq 2F-style gripper and an Intel RealSense camera (10 trials per tier). Tier 4 is left for future real-robot validation.

Tier Physical task Steps Success Runtime (s)
Tier 1Single-brick relocation to a target pose 19 / 10 (90%)55.77 ± 4.29
Tier 2Multi-brick vertical stacking 2–49 / 10 (90%)167.98 ± 8.96
Tier 3Shape assembly (e.g. bridge structure) 4–67 / 10 (70%)183.33 ± 7.38
Tier 4Complex shape assembly not attempted

Runtime is full-pipeline wall-clock time (perception, planning, execution, communication), so it is not comparable to the simulation timings above.

How to Submit

We welcome new submissions to the leaderboard. To add your method:

  1. Evaluate on all four tiers using the standardized metrics above, reported as the average per tier.
  2. Prepare a short description of your method with a link to the paper or code.
  3. Report the hardware you used, so timing results can be interpreted fairly.
  4. Ideally, provide the running code in a reproducible form — for example an Apptainer/Singularity or Docker image — so your numbers can be reproduced with little overhead.
  5. Email your per-tier results table to workbenchmark@ml.rwth-aachen.de, or open a pull request on the GitHub repository.

Submissions are reviewed before being added to keep the leaderboard fair and reproducible.

The Task

Each task starts from a defined initial state that lists every available LEGO Duplo brick and its placement in the pick area. The robot must autonomously move the correct bricks and assemble them into a target product — a specific arrangement of brick positions, rotations, types, and colors — in the assembly area. The same tasks are defined for both simulation and real-world execution.

Initial state in the pick area
Initial state — bricks laid out in the pick area
Target assembled product
Target product — assembled in the assembly area

Example of a Tier 4 task: the initial pick-area configuration (left) and its corresponding target assembly (right).

The Workspace

A task takes place in a workspace of two disjoint regions — a pick area Apick and an assembly area Aasm. The regions may be placed freely within a 1 × 1 × 2 m bounding volume, so the benchmark accommodates a range of robot embodiments.

Initially the pick area holds a set of bricks placed flat and mutually non-contacting, with a minimum spacing of 3 cm. It supplies at least one brick of every shape and color required by the target (counting multiplicity) and may include additional distractor bricks.

Experimental workspace with pick area, assembly area, and camera
A Franka Emika Panda, a tabletop pick and assembly area, and LEGO Duplo bricks.

Target Structure as a Graph

A target assembly is encoded as a directed graph G = (V, E). Each vertex is a brick labelled with its shape and color; a directed edge (u, v) means brick v is stacked on u, annotated with the relative pose Tu,v ∈ SE(3) of v's center in u's frame. Absolute target poses for every brick follow by traversing the graph. Both the initial and target arrangements are stored as YAML files (see Dataset & Usage).

Rules

  • The initial-state file is only for setting up the task — it must not be used as a substitute for perception.
  • The system must sense the scene, estimate brick poses, plan, and execute autonomously, with no human intervention or environment resets.
  • A task is successful when the product is assembled in the assembly area and is no longer in contact with the robot.

The Benchmark

400

Simulation tasks

4

Complexity tiers

40

Real-world tasks

1×1×2 m

Bounding volume

The 400 tasks (100 per tier, 10 of each reserved for real-robot experiments) are categorized into four complexity tiers that independently scale the planning challenge (order dependencies, interlocking sub-assemblies) and the manipulation challenge (tight tolerances, multi-layer insertions).

Tier 1 example

Tier 1 · Two-Brick Vertical Stacking

2 bricks

Evaluates basic pick-and-place with a vertical two-brick stack.

Tier 2 example

Tier 2 · Multi-Brick Vertical Stacking

3–5 bricks

Introduces sequential error accumulation and repetitive precision requirements.

Tier 3 example

Tier 3 · Shape Assembly

3–12 bricks

Expands into 3D spatial layouts, requiring horizontal and vertical positioning for multi-column structures.

Tier 4 example

Tier 4 · Complex Shape Assembly

3–12 bricks

Interlocking elements, overhangs, and stability dependencies that reduce the number of feasible assembly sequences.

Simulation Environment

The release includes a reproducible MuJoCo simulation environment built with LIBERO tools and task conventions, so it plugs into the familiar LIBERO-style workflow and task interface — while the LEGO Duplo assets, baseplates, contact settings, and tasks are defined specifically for WorkBenchMark. The scene models a Franka Emika Panda with a parallel-jaw gripper, a tabletop pick and assembly area, and LEGO-compatible plates, observed by a fixed external and a wrist-mounted RGB-D camera. The same tasks transfer to a real UR5 with a Robotiq 2F-style gripper.

Faithful brick contacts. Bricks load from STL meshes scaled to metric units; bodies use box colliders and studs use cylindrical primitives, with friction and contact parameters tuned for stable stud-based insertion.

Two camera views. A global external view for scene-level detection and pose estimation, plus a wrist view for robust close-range manipulation under occlusion.

Grid-aligned targets. Baseplates carry cylindrical studs aligned to the LEGO Duplo grid, giving a fixed geometric reference for assembled structures.

Reproducible protocol. Every method receives identical initial configurations and target specifications per tier. Reference runs used a single NVIDIA RTX 4000 Ada GPU.

Explore the Tiers

Live 3D renderings of a representative target structure for each tier, built in the browser. Pick a tier — drag is not needed, it rotates on its own.

Baseline

We provide an integrated baseline that combines constraint-driven Assembly-by-Disassembly planning — enforcing grasp reachability and structural stability — with open-vocabulary perception (language-guided detection, segmentation, and 6D pose estimation) and collision-aware motion execution. It produces physically feasible assembly sequences and outperforms a modern vision-language-action baseline across all tiers (see the leaderboard). Full method details are in the paper.

Pipeline overview: from RGB-D input and a product specification to a physically feasible, collision-aware assembly sequence.
Qualitative execution results (video coming soon).

Dataset & Usage

Grab the 400 tasks and the simulation environment, run the baseline, and put your method on the leaderboard.

Released under an open license — free for research and competition use.

Task Specification Format

Each task is a structured YAML file describing the target product as a set of bricks. Every brick has a name (a key, not bound to a specific physical brick), a type (e.g. brick_2x2, brick_4x2), a color, a pos (translation, in metres) and a rotation (Euler angles, in degrees), all relative to a common frame.

blocks:
  - name: "2x2_brick_1"
    type: "brick_2x2"
    color: "blue"
    pos: [-0.096, 0.084, 0.0]
    rotation: [0, 0, 0]

  - name: "2x2_brick_2"
    type: "brick_2x2"
    color: "yellow"
    pos: [-0.096, 0.084, 0.02]
    rotation: [0, 0, 90]

Loading a Task in Python

import yaml

with open("tier1.yaml") as f:
    task = yaml.safe_load(f)

for brick in task["blocks"]:
    print(brick["name"], brick["type"], brick["color"],
          brick["pos"], brick["rotation"])

Running the Simulation

Tasks run in the provided MuJoCo + LIBERO environment. Grab the code below and follow the quickstart to roll out a task or evaluate a method.

# Setup and launch instructions — TBD
# (released together with the simulation code)

License

Everything is released openly upon publication, with the dataset and the code licensed separately (to be confirmed):

FAQ

How do I load a task?

Each task is a YAML file; use the Python snippet above (yaml.safe_load) or the loaders in the released code.

Which split should I report?

Report all four tiers, averaged per tier, exactly as in the leaderboard. The 40 real-robot tasks (10 per tier) are a separate track.

Can I use the initial-state file as input?

No — it is only for setting up the task. Methods must perceive the scene; see the task rules.

How do I get on the leaderboard?

Follow How to Submit. Providing a reproducible (e.g. Apptainer/Docker) run is encouraged.

Citation

@inproceedings{ma2026workbenchmark,
  title     = {WorkBenchMark: A LEGO-Based Assembly Benchmark with an
               Assembly-by-Disassembly Baseline for the Smart Manufacturing League},
  author    = {Ma, Wenbo and Swoboda, Daniel and Tschesche, Matteo and Hofmann, Till},
  booktitle = {RoboCup 2026: Robot World Cup},
  year      = {2026},
  note      = {To appear}
}