WorkBenchMark

A LEGO-Based Assembly Benchmark with an Assembly-by-Disassembly Baseline for the Smart Manufacturing League

Oral Presentation RoboCup Symposium 2026 · Incheon, South Korea

Wenbo Ma¹, Daniel Swoboda¹, Matteo Tschesche², Till Hofmann¹

¹Chair of Machine Learning and Reasoning (i6), RWTH Aachen University
²MASCOR Institute, FH Aachen University of Applied Science

A benchmark for robotic assembly — manipulation meets task-level reasoning

Get the Dataset Read the Paper

arXiv Dataset Leaderboard

Parts Target

Every task: build the target product from a pile of bricks.

Abstract

We introduce WorkBenchMark, a LEGO Duplo-based robotic assembly benchmark motivated by the RoboCup Smart Manufacturing League. Robotic assembly couples low-level manipulation with task-level symbolic reasoning under physical constraints — a combination that current end-to-end learning methods do not yet solve reliably. The benchmark provides 400 tasks across four complexity tiers, together with a simulation environment and an open-vocabulary Assembly-by-Disassembly baseline that combines language-guided perception, constraint-based planning, and collision-aware execution. The pipeline is evaluated against zero-shot and fine-tuned vision-language-action baselines on 100 tasks per tier, and achieves higher success, execution accuracy, and stability across all tiers — especially on long-horizon assemblies. The benchmark, simulation environment, and baseline implementation will be released openly to support the broader robotic assembly community.

400 assembly tasks across four difficulty tiers (100 each); 40 for real robots, 10 per tier.
Every task: build a target LEGO Duplo product from scattered bricks — coupling manipulation with task-level reasoning under physical constraints.
A reproducible MuJoCo + LIBERO simulation environment with structured YAML task specifications.
Standardized metrics (success, accuracy, planning/wall time, stability) and a public leaderboard with simulation and real-robot tracks.
An open-vocabulary Assembly-by-Disassembly baseline that outperforms zero-shot and fine-tuned vision-language-action baselines across all tiers.

WorkBenchMark draws its task design from the Workbench Track of the RoboCup Smart Manufacturing League (SML), where robots assemble products from supplied parts. You do not need to take part in RoboCup to use it — it is a standalone, open benchmark for anyone working on robotic assembly, task-and-motion planning, or manipulation.

The Task

Each task starts from a defined initial state that lists every available LEGO Duplo brick and its placement in the pick area. The robot must autonomously move the correct bricks and assemble them into a target product — a specific arrangement of brick positions, rotations, types, and colors — in the assembly area. The same tasks are defined for both simulation and real-world execution.

Initial state in the pick area — Initial state — bricks laid out in the pick area

Target assembled product — Target product — assembled in the assembly area

Example of a Tier 4 task: the initial pick-area configuration (left) and its corresponding target assembly (right).

The Workspace

A task takes place in a workspace of two disjoint regions — a pick area A_pick and an assembly area A_asm. The regions may be placed freely within a 1 × 1 × 2 m bounding volume, so the benchmark accommodates a range of robot embodiments.

Initially the pick area holds a set of bricks placed flat and mutually non-contacting, with a minimum spacing of 3 cm. It supplies at least one brick of every shape and color required by the target (counting multiplicity) and may include additional distractor bricks.

Experimental workspace with pick area, assembly area, and camera — A Franka Emika Panda, a tabletop pick and assembly area, and LEGO Duplo bricks.

Target Structure as a Graph

A target assembly is encoded as a directed graph G = (V, E). Each vertex is a brick labelled with its shape and color; a directed edge (u, v) means brick v is stacked on u, annotated with the relative pose T_u,v ∈ SE(3) of v's center in u's frame. Absolute target poses for every brick follow by traversing the graph. Both the initial and target arrangements are stored as YAML files (see Dataset & Usage).

Rules

The initial-state file is only for setting up the task — it must not be used as a substitute for perception.
The system must sense the scene, estimate brick poses, plan, and execute autonomously, with no human intervention or environment resets.
A task is successful when the product is assembled in the assembly area and is no longer in contact with the robot.

Real-Robot Demonstrations

A UR cobot with a parallel gripper running the benchmark end-to-end — perceiving the scene, planning grasps, and assembling each target autonomously. Five tasks of increasing complexity, sped up and shown without sound.

Task 1 / 5

Single brick

Perceive one brick in the pick area, plan a grasp, and place it on the assembly baseplate.

Tier 1 1 brick

Click a number to jump to that task, or scrub the video.

The Benchmark

400

Simulation tasks

4

Complexity tiers

40

Real-world tasks

1×1×2 m

Bounding volume

The 400 tasks (100 per tier, 10 of each reserved for real-robot experiments) are categorized into four complexity tiers that independently scale the planning challenge (order dependencies, interlocking sub-assemblies) and the manipulation challenge (tight tolerances, multi-layer insertions).

Tier 1 · Two-Brick Vertical Stacking

2 bricks

Evaluates basic pick-and-place with a vertical two-brick stack.

Tier 2 · Multi-Brick Vertical Stacking

3–5 bricks

Introduces sequential error accumulation and repetitive precision requirements.

Tier 3 · 3D Shape Assembly

3–12 bricks

Expands into 3D spatial layouts, requiring horizontal and vertical positioning for multi-column structures.

Tier 4 · Complex Shape Assembly

3–12 bricks

Interlocking elements, overhangs, and stability dependencies that reduce the number of feasible assembly sequences.

Simulation Environment

The release includes a reproducible MuJoCo simulation environment built with LIBERO tools and task conventions, so it plugs into the familiar LIBERO-style workflow and task interface — while the LEGO Duplo assets, baseplates, contact settings, and tasks are defined specifically for WorkBenchMark. The scene models a Franka Emika Panda with a parallel-jaw gripper, a tabletop pick and assembly area, and LEGO-compatible plates, observed by a fixed external and a wrist-mounted RGB-D camera.

Faithful brick contacts. Bricks load from STL meshes scaled to metric units; bodies use box colliders and studs use cylindrical primitives, with friction and contact parameters tuned for stable stud-based insertion.

Two camera views. A global external view for scene-level detection and pose estimation, plus a wrist view for robust close-range manipulation under occlusion.

Grid-aligned targets. Baseplates carry cylindrical studs aligned to the LEGO Duplo grid, giving a fixed geometric reference for assembled structures.

Reproducible protocol. Every method receives identical initial configurations and target specifications per tier. Reference runs used a single NVIDIA RTX 4000 Ada GPU.

Leaderboard

Methods evaluated on WorkBenchMark, in simulation and on a real robot. Each cell shows the four per-tier scores (T1 T2 T3 T4) with the Overall average below; click any column to rank by it. Want to appear here? See how to submit.

Simulation Track

In each cell: 94.00 87.00 67.00 62.00 = Tiers 1–4 (small), 77.50 = Overall (bold). · ↑ higher is better · ↓ lower is better · blue = best Overall, bold = best in that tier.

Planning and wall-clock times depend on hardware and implementation — treat cross-submission timing comparisons with caution.

Real-Robot Track

Results on a physical robot (e.g. a UR5-class arm with a parallel-jaw gripper and an RGB-D camera), 10 trials per tier. Tier 4 is reserved for future real-robot validation.

Success counts complete, free-standing assemblies over 10 trials per tier. Runtime is full-pipeline wall-clock time (perception, planning, execution, communication), so it is not comparable to the simulation timings above.

What the Metrics Mean

How to Submit

We welcome new submissions to the leaderboard. To add your method:

Evaluate on all four tiers using the standardized metrics above, reported as the average per tier.
Prepare a short description of your method with a link to the paper or code.
Report the hardware you used, so timing results can be interpreted fairly.
Ideally, provide the running code in a reproducible form — for example an Apptainer/Singularity or Docker image — so your numbers can be reproduced with little overhead.
Email your per-tier results table to workbenchmark@ml.rwth-aachen.de, or open a pull request on the GitHub repository.

Submissions are reviewed before being added to keep the leaderboard fair and reproducible.

Baseline

We provide an integrated baseline that combines constraint-driven Assembly-by-Disassembly planning — enforcing grasp reachability and structural stability — with open-vocabulary perception (language-guided detection, segmentation, and 6D pose estimation) and collision-aware motion execution. It produces physically feasible assembly sequences and reaches 77.5% overall success — ahead of both a fine-tuned (42.5%) and a zero-shot (38.3%) vision-language-action baseline across all tiers (see the leaderboard). Full method details are in the paper.

product.yamlinput

1

Planning

Assembly-by-Disassembly with a voxel feasibility check

2

Perception

GroundingDINO + SAM + FoundationPose

3

Execution

8-state machine with MoveIt 2

Pipeline overview: from RGB-D input and a product specification to a physically feasible, collision-aware assembly sequence.

Tier 1 — Two-brick stacking
Tier 2 — Multi-brick stacking
Tier 3 — 3D shape assembly
Tier 4 — Complex assembly

Simulation · Tier 1 Clip 1 / 2

2-brick vertical stack

Two 4×2 bricks stacked vertically — green base, yellow top.

2 bricks Task 001

Click a number to jump to that clip, or use the tier tabs above.

Dataset & Usage

Grab the 400 tasks and the simulation environment, run the baseline, and put your method on the leaderboard.

Download (GitHub) Zenodo (DOI) Submit to Leaderboard

Released under an open license — free for research and competition use.

Task Specification Format

Each task is a structured YAML file describing the target product as a set of bricks. Every brick has a name (a key, not bound to a specific physical brick), a type (e.g. brick_2x2, brick_4x2), a color, a pos (translation, in metres) and a rotation (Euler angles, in degrees), all relative to a common frame.

blocks:
  - name: "2x2_brick_1"
    type: "brick_2x2"
    color: "blue"
    pos: [-0.096, 0.084, 0.0]
    rotation: [0, 0, 0]

  - name: "2x2_brick_2"
    type: "brick_2x2"
    color: "yellow"
    pos: [-0.096, 0.084, 0.02]
    rotation: [0, 0, 90]

Loading a Task in Python

import yaml

with open("tier1.yaml") as f:
    task = yaml.safe_load(f)

for brick in task["blocks"]:
    print(brick["name"], brick["type"], brick["color"],
          brick["pos"], brick["rotation"])

Running the Simulation

Tasks run in the provided MuJoCo + LIBERO environment. Grab the code below and follow the quickstart to roll out a task or evaluate a method.

Simulation environment — link TBD

# Setup and launch instructions — TBD
# (released together with the simulation code)

License

Everything is released openly upon publication, with the dataset and the code licensed separately (to be confirmed):

Dataset (tasks & assets) — planned CC BY 4.0.
Code (simulation environment & baseline) — planned MIT.

FAQ

How do I load a task?

Each task is a YAML file; use the Python snippet above (yaml.safe_load) or the loaders in the released code.

Which split should I report?

Report all four tiers, averaged per tier, exactly as in the leaderboard. The 40 real-robot tasks (10 per tier) are a separate track.

Can I use the initial-state file as input?

No — it is only for setting up the task. Methods must perceive the scene; see the task rules.

How do I get on the leaderboard?

Follow How to Submit. Providing a reproducible (e.g. Apptainer/Docker) run is encouraged.

Citation

@inproceedings{ma2026workbenchmark,
  title     = {WorkBenchMark: A LEGO-Based Assembly Benchmark with an
               Assembly-by-Disassembly Baseline for the Smart Manufacturing League},
  author    = {Ma, Wenbo and Swoboda, Daniel and Tschesche, Matteo and Hofmann, Till},
  booktitle = {RoboCup 2026: Robot World Cup XXIX},
  series    = {Lecture Notes in Computer Science},
  publisher = {Springer},
  address   = {Incheon, South Korea},
  year      = {2026},
  eprint        = {2606.19358},
  archivePrefix = {arXiv},
  note      = {Oral presentation, RoboCup Symposium 2026}
}