The agent that watches our trainings

We built an encapsulated, self-hosted AI agent that sits between our GPU training jobs and a chat. This is how it works — and what it taught us about agents you can leave running near production.

We train vision-based AI models that run on factory floors, doing everything from quality control to guiding robots. To keep each model sharp, we collect fresh examples from the floor and retrain on them, then tune hyperparameters to pull the most out of every model and dataset.

The counts climb fast. One site can run up to 40 models at once, each on a different part or view. Multiply that across customers, then add every retraining and tuning run behind each model, and the population of models and experiments grows past what anyone can track by hand.

That growth pushed us toward agents. We had more training runs than a person could watch. So we built one to watch them for us.

Live: talking with the MLOps agent in plain language — ask what's running, and it answers.

The layout

One reasoning core, a handful of skills

You talk to it over chat. The training jobs report in as they run. One reasoning core in the middle decides what to say, what to do, and what to hold back until a human says yes.

mlops-agent · system architecture

User · Chat

chat in / notifications out

MLOps Agent

Chat bridge

Reasoning core (self-hosted model)

Event intake

Skill router

drives

AiQu (GPUs)

training jobs
report events + metrics

reads

MLflow

runs / metrics
experiment history

checks

Dataset store

labels + versions
read-only access

It reaches the outside world through skills. Each system it touches sits behind a skill — a self-contained capability the core invokes by name. AiQu, the metrics store, and the dataset store each have one. Adding a new system means writing a new skill, not rewiring the agent.

Design

Dumb plumbing, smart prompt

The most important decision in the system looks like the most boring one. The part that receives a training event does almost nothing. It hands the event to the reasoning core.

No rule like “if the event is an error, notify” lives in the code. Whether to notify, how to phrase it, when to stay quiet — all of that lives in the agent's written instructions.

Design principle

Push your judgment into the instructions and keep the code mechanical. The code moves bytes. The model makes the decisions.

Notifications

Let the agent choose silence

Most notification systems fail by talking too much. A bot that pings you every epoch is a bot you mute. So we wrote silence into its instructions as an explicit rule: when an epoch update isn't worth sending, say nothing at all.

The agent earns the right to interrupt you. It always speaks on errors, on finished jobs, and on real milestones — and says nothing the rest of the time.

MLOps Agent

online

Today

MLOps Agent

Started
customer-a/weld-defect-v4
Epoch 1/60 · 2× DGX H100 · batch 128
Dataset: 14 820 samples

08:14

MLOps Agent

25% milestone
customer-a/weld-defect-v4 — epoch 15/60

val_loss  0.0842 ↓ from 0.1104
mAP@0.5  0.874 ↑ new best
lr  3.2e-4
eta  ~2 h 40 min

10:47

MLOps Agent

Crashed
customer-b/pcb-inspect-v2 — epoch 12
RuntimeError: CUDA OOM — batch 64 on 80 GB

Likely cause: image_size mismatch in config (1280 vs 640). Recommend fixing train_cfg.yaml before requeue.

11:23

fix the image size and requeue

11:26 ✓✓

Safety

Put a human at every write

The agent holds read and write access to our training repos. It can edit configs, commit, and push. That access earns its keep — but it's the kind of capability that should make you nervous.

What keeps it safe is a hard split between reading and changing. The agent runs read-only checks on its own. The confirmation rule covers changes only.

Write protocol — enforced per turn

Plan the change.
Describe it to the user — what file, what change, why.
End the turn without making the change.
Wait for explicit confirmation: “yes / go ahead / kör på”.
Only then perform the change.

The subtle part: the proposal and the action cannot share a turn. The gate is structural, not a polite request in the instructions.

MLOps Agent

online

Today

fix the image size and requeue customer-b/pcb-inspect-v2

11:26 ✓✓

MLOps Agent

Read configs/.../train_cfg.yaml and cross-checked with MLflow run r-4f92a1.

Found: image_size: 1280 — model head was compiled for 640. Caused OOM at epoch 12.

11:26

MLOps Agent

Proposed change
I will make one edit and commit:

configs/customer-b/pcb-inspect-v2/train_cfg.yaml
- image_size: 1280
+ image_size: 640

Then requeue the job on 2× DGX H100.

Waiting for your go-ahead. Type kör på to confirm.

11:26

kör på

11:28 ✓✓

MLOps Agent

Done
Committed fix: image_size 1280→640 pcb-inspect-v2
Job requeued — run ID r-7c03b8 · starting in ~4 min.
Will notify on first milestone or error.

11:28

Autonomy

Let it run the loop while you sleep

The write gate covers the code. The experiment loop is different work, and the agent owns most of it. While you sleep, it runs a hyperparameter sweep, evaluates each model that falls out of it, starts the next training run, and early-stops a run once the metric flattens.

The boundary runs along reversibility

Submitting a tuning job spends a few GPU hours and leaves the repo untouched. Editing a config and pushing it changes what every future run does. Give your agent the actions you can undo and keep the rest behind a human.

Security

Stack boring boundaries on the edges

The agent answers a known list of senders and ignores everyone else — a stranger who finds the bot gets nowhere before the agent spends a single token. The channel the training jobs report on carries a shared secret. The agent pushes with one deploy key whose scope is controlled outside the agent.

None of these are clever — which is the point

Auth at the door, a secret on the wire, least privilege on the credentials. Stacked plain boundaries let you sleep while an agent runs.

In summary

What it adds up to

Stack these choices and you get an agent with a clear character. It runs on hardware we own, so it can hold sensitive data. Its behaviour lives in prose, so anyone on the team can tune it. It picks silence, so the things it says repay the read. It reads on its own and never writes without a human, so it helps without scaring us. Its model is a swappable part, so it ages well.

Building your first agent to leave running next to systems that matter? Reach for these dials first. Run it local. Push judgment into the instructions and keep the code dumb. Give it a way to say nothing. Put a human at every write. Treat the model as the most replaceable part you have — because it is.

The dashboard still exists. We stopped watching it.

The agent wraps opencode, runs as one self-contained service, and takes its orders from a written brief. The boundaries above come from that brief and the shape of the container. No magic — only lines drawn on purpose.

Cookie	Duration	Description
cookielawinfo-checkbox-analytics		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional		The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary		This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy		The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

The agent that watches our trainings

One reasoning core, a handful of skills

Dumb plumbing, smart prompt

Let the agent choose silence

Put a human at every write

Let it run the loop while you sleep

Stack boring boundaries on the edges

What it adds up to

Latest News

From the Screen to the Factory Floor: Is Swedish Industry Ready for Physical AI?

The Industrialization of Intelligence: Why Architecture and Orchestration Determine the Winners in the AI Era

Price shocks and chip shortages: How to protect your IT budget in 2026

From Paper Strategy to Server Room: What Do Sweden’s New AI Goals Mean for Your Budget?