We built an encapsulated, self-hosted AI agent that sits between our GPU training jobs and a chat. This is how it works — and what it taught us about agents you can leave running near production.
We train vision-based AI models that run on factory floors, doing everything from quality control to guiding robots. To keep each model sharp, we collect fresh examples from the floor and retrain on them, then tune hyperparameters to pull the most out of every model and dataset.
The counts climb fast. One site can run up to 40 models at once, each on a different part or view. Multiply that across customers, then add every retraining and tuning run behind each model, and the population of models and experiments grows past what anyone can track by hand.
That growth pushed us toward agents. We had more training runs than a person could watch. So we built one to watch them for us.
Live: talking with the MLOps agent in plain language — ask what's running, and it answers.
The layout
One reasoning core, a handful of skills
You talk to it over chat. The training jobs report in as they run. One reasoning core in the middle decides what to say, what to do, and what to hold back until a human says yes.
report events + metrics
experiment history
read-only access
It reaches the outside world through skills. Each system it touches sits behind a skill — a self-contained capability the core invokes by name. AiQu, the metrics store, and the dataset store each have one. Adding a new system means writing a new skill, not rewiring the agent.
Design
Dumb plumbing, smart prompt
The most important decision in the system looks like the most boring one. The part that receives a training event does almost nothing. It hands the event to the reasoning core.
No rule like “if the event is an error, notify” lives in the code. Whether to notify, how to phrase it, when to stay quiet — all of that lives in the agent's written instructions.
Design principle
Push your judgment into the instructions and keep the code mechanical. The code moves bytes. The model makes the decisions.
Notifications
Let the agent choose silence
Most notification systems fail by talking too much. A bot that pings you every epoch is a bot you mute. So we wrote silence into its instructions as an explicit rule: when an epoch update isn't worth sending, say nothing at all.
The agent earns the right to interrupt you. It always speaks on errors, on finished jobs, and on real milestones — and says nothing the rest of the time.
customer-a/weld-defect-v4
Epoch 1/60 · 2× DGX H100 · batch 128
Dataset: 14 820 samples
customer-a/weld-defect-v4 — epoch 15/60
mAP@0.5 0.874 ↑ new best
lr 3.2e-4
eta ~2 h 40 min
customer-b/pcb-inspect-v2 — epoch 12
RuntimeError: CUDA OOM — batch 64 on 80 GBLikely cause: image_size mismatch in config (1280 vs 640). Recommend fixing
train_cfg.yaml before requeue.Safety
Put a human at every write
The agent holds read and write access to our training repos. It can edit configs, commit, and push. That access earns its keep — but it's the kind of capability that should make you nervous.
What keeps it safe is a hard split between reading and changing. The agent runs read-only checks on its own. The confirmation rule covers changes only.
Write protocol — enforced per turn
- Plan the change.
- Describe it to the user — what file, what change, why.
- End the turn without making the change.
- Wait for explicit confirmation: “yes / go ahead / kör på”.
- Only then perform the change.
The subtle part: the proposal and the action cannot share a turn. The gate is structural, not a polite request in the instructions.
configs/.../train_cfg.yaml and cross-checked with MLflow run r-4f92a1.Found:
image_size: 1280 — model head was compiled for 640. Caused OOM at epoch 12.I will make one edit and commit:
- image_size: 1280
+ image_size: 640
Waiting for your go-ahead. Type kör på to confirm.
Committed
fix: image_size 1280→640 pcb-inspect-v2Job requeued — run ID
r-7c03b8 · starting in ~4 min.Will notify on first milestone or error.
Autonomy
Let it run the loop while you sleep
The write gate covers the code. The experiment loop is different work, and the agent owns most of it. While you sleep, it runs a hyperparameter sweep, evaluates each model that falls out of it, starts the next training run, and early-stops a run once the metric flattens.
The boundary runs along reversibility
Submitting a tuning job spends a few GPU hours and leaves the repo untouched. Editing a config and pushing it changes what every future run does. Give your agent the actions you can undo and keep the rest behind a human.
Security
Stack boring boundaries on the edges
The agent answers a known list of senders and ignores everyone else — a stranger who finds the bot gets nowhere before the agent spends a single token. The channel the training jobs report on carries a shared secret. The agent pushes with one deploy key whose scope is controlled outside the agent.
None of these are clever — which is the point
Auth at the door, a secret on the wire, least privilege on the credentials. Stacked plain boundaries let you sleep while an agent runs.
In summary
What it adds up to
Stack these choices and you get an agent with a clear character. It runs on hardware we own, so it can hold sensitive data. Its behaviour lives in prose, so anyone on the team can tune it. It picks silence, so the things it says repay the read. It reads on its own and never writes without a human, so it helps without scaring us. Its model is a swappable part, so it ages well.
Building your first agent to leave running next to systems that matter? Reach for these dials first. Run it local. Push judgment into the instructions and keep the code dumb. Give it a way to say nothing. Put a human at every write. Treat the model as the most replaceable part you have — because it is.
The dashboard still exists. We stopped watching it.
The agent wraps opencode, runs as one self-contained service, and takes its orders from a written brief. The boundaries above come from that brief and the shape of the container. No magic — only lines drawn on purpose.


