Mapping Motion: Architecting an End-to-End MLOps Pipeline for Legged Robots

1. Introduction: The Sim-to-Real Challenge

In traditional Machine Learning, a bad prediction usually results in a lower click-through rate or a flagged email. In robotics, a bad prediction results in physical damage: a robot losing its balance, crashing into a wall or stripping a gear.

For an MLOps engineer, this shift in stakes changes everything. The challenge is no longer just about serving a model via an API, it is about managing a lifecycle where the code must interact with the immutable laws of physics.

This article details the architecture of Locomotion MLOps Platform, an example of what a production-grade platform would look like designed to manage the lifecycle of Reinforcement Learning (RL) policies for the Anymal-D wheeled-legged robot. While training a robot to walk in a perfect simulation is now a solved problem, reliably deploying that capability to an edge device like an NVIDIA Jetson, without breaking the hardware, remains a complex operations challenge.

Personal Motivation: Closing the Gap

I do not come from a robotics background, but I am passionate about robots, artificial intelligence and data. My experience working with Generative AI prepared me well for this. Much like managing the unpredictability of LLMs, robotics requires architecting systems that can handle the inherent noise and ‘hallucinations’ of the real world. I embarked on this project with a specific goal: to close the gap between my expertise in software infrastructure and the complex, noisy reality of robotics.

I wanted to answer a fundamental question: How do we apply the rigour of CI/CD and the safety of modern MLOps to a system that moves?

I realized that while there are thousands of papers on new RL algorithms, there are very few resources on the infrastructure required to ship them. This project is my attempt to build that bridge, moving from a “works on my machine” research script to a pipeline that can handle Hyperparameter Optimization (HPO), distillation, automated simulation gates, security scanning and safe canary deployments.

The Platform Architecture

Before diving into the details, here is the high-level workflow I designed to move code from a laptop to the robot:

Figure 1 – High-level workflow from code push to fleet deployment.
Tap to view in full resolution

The Problem: The “Happy Path” vs. Reality

The central friction in robotic learning is the Sim-to-Real gap. In the Isaac Lab simulator, we have privileged information. We know the exact friction of the floor, the precise velocity of every joint, and the perfect orientation of the robot. In the real world, sensors are noisy, friction varies and actuators lag.

Figure 2 – The Sim-to-Real gap: the clean Isaac Sim environment vs. the messy real world.
Tap to view in full resolution

A naive pipeline simply trains a massive model in this perfect world and hopes it works on the physical robot. This rarely succeeds. To build a truly production-ready pipeline, we cannot just train a policy, among the many challenges involved, I focused on addressing these three:

Optimize efficiently: Stop wasting compute on models that have plateaued (using HPO and Early Stopping).
Compress (Teacher-Student Distillation): We bridge the gap by first training a “Teacher” model with access to perfect, privileged simulation data. We then distill this knowledge into a lightweight “Student” model that learns to mimic the Teacher using only the noisy sensor data available in the real world, effectively teaching the robot to navigate without the “cheat codes” of the simulator.
Validate: ensuring that what works in Docker also works on metal, using Anomaly Detection to catch behavior that drifts too far from the safety of the training distribution.

This platform is not just about making a robot walk; it is about building the factory that builds the walking robot.

2. Phase I: Optimization & Efficiency (The Lab)

Before we can deploy a robot, we need a policy that actually works. In Reinforcement Learning (RL), finding a configuration that converges is often a process of trial and error. To structure this, I implemented a distributed Hyperparameter Optimization (HPO) sweep using Ray Tune.

Hunting for the Best Brain (HPO)

The search space for RL is vast. A slightly wrong learning rate or an incorrect entropy coefficient can prevent the robot from ever learning to stand. Instead of manually tuning these values, I automated the process using the ASHA (Asynchronous Successive Halving Algorithm) scheduler.

Why ASHA? (The Resource Constraint)

I chose ASHA specifically because I was training on a single consumer GPU (NVIDIA RTX 3090 with 24GB VRAM). By carefully tuning the VRAM usage, I managed to run two parallel executions simultaneously. However, even with this optimization, I could not afford to let a bad configuration run for hours. ASHA is aggressive: it pauses poor-performing trials early, freeing up the GPU for more promising candidates. It effectively allowed me to simulate a “massively parallel” sweep on a limited hardware budget.

The “Gold Standard” Alternative: Population Based Training (PBT)

If I had access to a dedicated compute cluster, I would have chosen Population Based Training (PBT).

How it works: PBT doesn’t just find the best static hyperparameters; it evolves them. It runs a population of agents in parallel, where underperforming agents copy the weights and hyperparameters of the best agents, mutating them slightly.
When to use it: PBT is the industry standard for high-end robotics (e.g., DeepMind) because it yields better results by adapting learning rates dynamically. However, it requires running multiple agents simultaneously (e.g., 16+ GPUs). For a single-GPU setup, ASHA provides the best balance of exploration speed and resource efficiency.

The efficiency of ASHA comes from its ruthlessness: it aggressively terminates trials that are underperforming compared to their peers.

Figure 3 – MLflow dashboard showing all child runs and their metrics from the HPO sweep.
Tap to view in full resolution

Total Sweep Time: 3 hours 32 minutes.
Total Trials: 20 configurations.
Efficiency: ASHA early-stopped 12 of the 20 trials at just 100 iterations. Only the top 5 promising trials were allowed to run the full 1500 iterations.

The winning configuration (Reward: 25.50) utilized a large network architecture ([512, 256, 128] units) and a very specific learning rate (2.92e-05). This became our “Teacher” policy.

Figure 4 – MLflow dashboard showing the winning HPO run with its parameters and metrics.
Tap to view in full resolution

The “Early Stopping” Trap

Once the best hyperparameters were found, I noticed a critical inefficiency in the training pipeline.

The data showed that the Teacher policy often hit a reward plateau of ~25 by iteration 200. However, the configuration was set to run for 1,500 iterations. This meant that for stable runs, over 70% of the GPU compute was wasted on a policy that had already finished learning.

But in RL, overtraining is not just wasteful, it has a dangerous side. Continued PPO training on a converged policy can lead to Entropy Collapse. The policy becomes too confident (deterministic), losing the randomness required to recover from unexpected perturbations. It begins to overfit to the specific artifacts of the physics engine, increasing the Sim-to-Real gap.

To fix this, I implemented a strict Early Stopping mechanism:

Patience: 200 iterations.
Min Delta: 0.5 reward units.

The Result: The system now automatically detects convergence. In a benchmark run, training stopped at iteration 657 (Reward 25.88), saving ~843 iterations and approximately 28 minutes of GPU time per run. This ensures we get a high-performing policy without the instability of overfitting.

3. Phase II: The Bridge to the Edge (Teacher-Student Distillation)

Even with an optimized “Teacher” policy, we face a deployment problem. The Teacher succeeds because it cheats. It has access to privileged simulation state: exact terrain height, true friction coefficients, and perfect joint velocities. The physical robot has none of this: it is effectively blind, relying only on noisy proprioception (internal joint sensors) and an IMU (gyroscope/accelerometer).

To bridge this gap, I implemented a Teacher-Student Distillation pipeline.

“God Mode” vs. Reality

The goal is to transfer the “God Mode” capabilities of the Teacher into a “Student” policy that can survive the real world.

Figure 5 – Teacher-Student Distillation pipeline: the Teacher uses privileged state while the Student learns from noisy sensors.
Tap to view in full resolution

The Teacher: A large MLP ([512, 256, 128]) that takes privileged state as input. It outputs the optimal action.
The Student: A compact MLP ([128, 128, 128]) that takes only history of sensor readings as input.

The distillation process treats the Teacher as a ground-truth oracle. We freeze the Teacher’s weights and force the Student to predict the Teacher’s actions given only the partial, noisy information available to the sensors.

Why Distillation Works (The TCN vs. MLP Choice)

This process uses Supervised Learning (minimizing MSE Loss between Student and Teacher actions) rather than Reinforcement Learning. This is much more stable than trying to train a blind agent from scratch using rewards.

For the student architecture, I intentionally chose the simplest possible approach: a vanilla MLP ([128, 128, 128]) receiving only the current observation.

While sophisticated architectures like Temporal Convolutional Networks (TCNs) or History Stacks (sliding windows) are often used to give the robot “memory” of past states to handle partial observability, they add complexity to the deployment pipeline (e.g., managing synchronized buffers or custom CUDA kernels).

Since the primary goal of this project was to architect an initial MLOps lifecycle rather than to achieve State-of-the-Art locomotion performance, I opted for the standard MLP. The resulting policy is a pure function mapping current sensor state → action, which is trivial to export to ONNX.

Figure 6 – MLflow metrics during Student distillation: reward, behavior loss, and throughput over training steps.
Tap to view in full resolution

We do not deploy the Teacher, the Student is our production artifact.

4. Phase III: The CI/CD Pipeline (The Factory)

In software engineering, if you break the build, you fix the code. In robotics, if you break the build, you might break an expensive robot. Therefore, the Continuous Integration/Continuous Deployment (CI/CD) pipeline acts as the final line of defense before software touches hardware.

Designing for the Gap: Demo vs. Reality

This project was my “learning laboratory”, a way to map my MLOps experience onto robotic constraints without access to a physical lab or a fleet of $50,000 robots. While I implemented a robust software-level CI/CD, I want to be transparent about where this demo ends and where a rigid industrial pipeline begins.

The Implemented Pipeline (Software Verification)

I designed a tiered testing strategy that validates the robot’s “brain” at four distinct levels of abstraction, focusing on what could be automated in a standard cloud environment.

The Robotics Testing Pyramid

Level	Environment	What it Tests
1. Unit Tests	Python (Dev Machine)	Catches logic errors, shape mismatches, and configuration bugs (using pytest).
2. Sim Eval	PyTorch (GPU Server)	Validates policy quality. Does the robot actually walk? Does it achieve the expected reward in Isaac Sim?
3. Edge Sim	ONNX Runtime (x86 Docker)	Validates the export artifact. Does the ONNX file load? Do the inputs/outputs match the C++ runtime expectations?
4. HIL (Hardware-in-the-Loop)	Jetson (Real/Remote)	(Mocked in this project) Validates performance on real hardware. Is the inference latency < 10ms?

Deployment Gates

The pipeline enforces a strict sequence of gates. A failure at any stage stops the release:

Code Quality: First, standard tools (ruff, mypy) ensure the code is clean and typed.
Simulation Smoke Test: The CI spins up a headless Isaac Sim instance to train a Teacher for a short burst (100 iterations) and runs a distillation pass. This ensures that a code change didn’t break the physics interaction or the reward calculation.
Export Verification: The system automatically exports the student policy to ONNX. It then boots a lightweight Docker container (simulating the edge environment) to load that ONNX file and run inference on dummy data. This catches the common “works in PyTorch, fails in ONNX” errors caused by unsupported operators.
Performance Benchmarking: Finally, we measure inference latency. I set a strict gate of p99 latency < 10ms to allow headroom for communication overhead.

The Production Reality: What I Would Add

While my pipeline ensures the code is healthy and the policy converges in simulation, a real robotics company requires strict hardware verification. If I had access to the lab and hardware of a robotic company, I would extend this pipeline with four critical steps:

TensorRT Compilation (Not just ONNX): I used ONNX Runtime for simplicity and portability. However, in production, every millisecond counts. The “correct” approach is to compile the ONNX artifact into a TensorRT engine. This performs layer fusion, kernel auto-tuning, and precision calibration (FP16/INT8) specific to a target Jetson Orin.
True Hardware-in-the-Loop (HIL): My project mocks the edge environment in Docker. A real pipeline uses a “HIL Farm”: actual Jetson boards in a rack. The CI pipeline would deploy the TensorRT engine to these boards to catch issues that Docker misses: thermal throttling, memory fragmentation or specific driver incompatibilities.
The Physical “Canary”: Simulation is a proxy, not reality. The final gate before a fleet update shouldn’t be a software check, but a physical one. The pipeline would trigger a specific test routine (e.g., “stand up and sit down”) on a dedicated test robot in the office. Only if the physical sensors report success does the code go to production.
ROS 2 Integration: Finally, the policy doesn’t run in a vacuum. In production, this inference engine runs inside a high-priority ROS 2 Control Node. This node handles the strict control frequency (e.g., 50Hz), subscribes to sensor topics, and publishes motor commands. Integrating the MLOps pipeline with the ROS 2 build system (colcon) is the final bridge between the model and the robot.

5. Phase IV: Safety & Monitoring (The Safety Net)

Once the model is deployed, we enter the most critical phase. In a simulator, if a model sees a state it wasn’t trained on (e.g., a sudden change in friction or a loose motor), it might fail gracefully or just reset. In the real world, “undefined behavior” can be catastrophic.

To handle the “unknown unknowns” of the real world, I used a safety layer that sits alongside the policy.

The “Sixth Sense”: Anomaly Detection

We cannot train a policy to recognize every possible failure mode, because we don’t know what they all look like. Instead of trying to classify errors (Supervised Learning), I implemented an Unsupervised Anomaly Detection system using an Isolation Forest.

Training: During the “Student Distillation” phase, we collect the latent vectors and observation patterns of successful runs. The Isolation Forest learns the shape of “normality”, the distribution of joint velocities and body orientations that correspond to a healthy walking gait.
Inference: On the robot, this lightweight model runs in parallel with the control policy. It assigns an “anomaly score” to every observation.
The Safety Trigger: If the anomaly score breaches a defined threshold (indicating the robot is entering a state significantly different from its training data), the system can trigger a fallback protocol (e.g., “Safe Stop” or switch to a robust, static controller) before the robot falls.

Safe Rollouts: Shadow Mode & Progressive Propagation

Even after a policy passes the physical “office robot” test (Phase III), deploying it to a fleet of hundreds of robots operating in different environments (warehouses, sidewalks, rainy streets) carries residual risk. A slight regression might not cause an immediate crash in the lab but could overheat motors over 20 minutes or drain batteries 15% faster in the field.

To mitigate this, I designed a Progressive Rollout strategy that assumes failure is possible.

(Note: Since I do not have access to a physical fleet, this pipeline stage is currently a mock implementation. My code validates the deployment logic, checking thresholds and timeouts, but forces a “pass” result to demonstrate the workflow without actual hardware).

Instead of a “Big Bang” update, the rollout follows a strict gradient:

Safe Rollouts: Shadow Mode and Progressive Propagation

Figure 7 – Safe rollout sequence: Shadow Mode, Canary Activation, and Fleet Promotion.
Tap to view in full resolution

Shadow Mode: The new model is deployed to the robot but does not control the motors. It runs in the background, consuming live sensor data and predicting actions. The system compares these predictions against the currently running safe model. If the divergence is too high (indicating the new model would behave drastically differently), the deployment is silently aborted without the robot ever stumbling.
Sample Group (1%): If Shadow Mode passes, the model is activated on a small, representative subset of the fleet.
Fleet Expansion: Only if the sample group survives a defined “Soak Time” (e.g., 30 minutes) with healthy metrics does the update propagate to the rest of the fleet.

Engineering Reflection: The Cost of Shadow Mode

In an interview or production design meeting, it is crucial to acknowledge the hardware cost of this strategy. Running “Shadow Mode” means executing two neural networks simultaneously (the active control policy + the shadow candidate). On a self-driving car with a trunk full of GPUs, this is negligible. On a $500 Jetson module controlling a dynamic robot, doubling the compute load could push the inference latency beyond the 20ms safety limit (50Hz), causing the robot to fall.

For this project, I calculated the trade-off: my “Student” model is a tiny MLP ([128, 128, 128]) with an inference time of <1ms. Running two instances consumes ~2ms, leaving ample headroom within the control budget. If I were using a heavier architecture (e.g., a Vision Transformer), I would have opted for Offline Log Replay instead: logging sensor data to the cloud and running the shadow evaluation asynchronously in the CI pipeline.

The Kill Switch (Automated Rollback)

The MLOps platform monitors real-time metrics via Prometheus. The system is configured to automatically trigger a rollback to the previous version if:

Intervention Rate > 5%: Humans have to take over control too often.
Success Rate < 95%: The robot fails to complete its navigation tasks.
Latency Spikes: Inference time exceeds the safety margin.

6. Phase V: Evolution (The Feedback Loop)

The current pipeline is linear: Code → Sim → Robot. However, a truly mature robotics platform is circular. The robot’s experience in the real world must flow back to improve the simulator. This is the Data Engine, and although I cannot implement it without hardware, I have designed the architecture to support it.

The Missing Link: Real-World Data & DVC

In a production environment, the most valuable asset is not the code, but the logs from the fleet.

To handle this, the platform architecture includes DVC (Data Version Control) integrated with MinIO (S3-compatible storage).

The Workflow: Every time the robot completes a mission, it uploads a compressed log of sensors, actions, and interventions to the cloud.
The Versioning: DVC tracks these datasets just like Git tracks code. This allows us to say, “This model was trained on dataset v3.0, which includes the ‘rainy day’ scenarios added last week.”

Closing the Loop: System Identification

Why is this data critical? System Identification.

If the robot consistently slips on a specific surface where the simulator said it should be stable, we have a “Sim-to-Real mismatch.” By feeding the real-world sensor logs back into the pipeline, we can automatically tune the simulator’s physics parameters (e.g., friction coefficients, motor damping) to match reality.

This creates a virtuous cycle: The robot fails → We capture the data → We update the Simulator → We retrain the Teacher → The robot no longer fails.

7. Conclusion

Building an MLOps platform for robotics is about more than just training a neural network. It is about architecting a system that can handle the complexity of the physical world, from the ruthlessness of resource-constrained HPO to the safety-critical nature of fleet deployment.

In this project, I moved beyond the “happy path” of simulation to build a robust factory for robotic behavior. I leveraged Isaac Lab for massive parallelism, Ray Tune for efficient optimization, Distillation for edge deployment, and Docker/ONNX for rigorous verification.

While I may not have the hardware to test the final mile, this architecture demonstrates a readiness to tackle the real challenges of the robotics industry: reliability, safety, and the endless pursuit of closing the Sim-to-Real gap.

Adrián Chamorro