The agent that watches our trainings

We built an encapsulated, self-hosted AI agent that sits between our GPU training jobs and a chat. This is how it works — and what it taught us about agents you can leave running near production.

We train vision-based AI models that run on factory floors, doing everything from quality control to guiding robots. To keep each model sharp, we collect fresh examples from the floor and retrain on them, then tune hyperparameters to pull the most out of every model and dataset.

The counts climb fast. One site can run up to 40 models at once, each on a different part or view. Multiply that across customers, then add every retraining and tuning run behind each model, and the population of models and experiments grows past what anyone can track by hand.

That growth pushed us toward agents. We had more training runs than a person could watch. So we built one to watch them for us.

Live: talking with the MLOps agent in plain language — ask what's running, and it answers.


The layout

One reasoning core, a handful of skills

You talk to it over chat. The training jobs report in as they run. One reasoning core in the middle decides what to say, what to do, and what to hold back until a human says yes.

mlops-agent · system architecture
User · Chat
chat in / notifications out
MLOps Agent
Chat bridge
Reasoning core (self-hosted model)
Event intake
Skill router
drives
AiQu (GPUs)
training jobs
report events + metrics
reads
MLflow
runs / metrics
experiment history
checks
Dataset store
labels + versions
read-only access

It reaches the outside world through skills. Each system it touches sits behind a skill — a self-contained capability the core invokes by name. AiQu, the metrics store, and the dataset store each have one. Adding a new system means writing a new skill, not rewiring the agent.


Design

Dumb plumbing, smart prompt

The most important decision in the system looks like the most boring one. The part that receives a training event does almost nothing. It hands the event to the reasoning core.

No rule like “if the event is an error, notify” lives in the code. Whether to notify, how to phrase it, when to stay quiet — all of that lives in the agent's written instructions.

Design principle

Push your judgment into the instructions and keep the code mechanical. The code moves bytes. The model makes the decisions.


Notifications

Let the agent choose silence

Most notification systems fail by talking too much. A bot that pings you every epoch is a bot you mute. So we wrote silence into its instructions as an explicit rule: when an epoch update isn't worth sending, say nothing at all.

The agent earns the right to interrupt you. It always speaks on errors, on finished jobs, and on real milestones — and says nothing the rest of the time.

MLOps Agent
online
Today
MLOps Agent
Started
customer-a/weld-defect-v4
Epoch 1/60 · 2× DGX H100 · batch 128
Dataset: 14 820 samples
08:14
MLOps Agent
25% milestone
customer-a/weld-defect-v4 — epoch 15/60
val_loss  0.0842 ↓ from 0.1104
mAP@0.5  0.874 ↑ new best
lr  3.2e-4
eta  ~2 h 40 min
10:47
MLOps Agent
Crashed
customer-b/pcb-inspect-v2 — epoch 12
RuntimeError: CUDA OOM — batch 64 on 80 GB

Likely cause: image_size mismatch in config (1280 vs 640). Recommend fixing train_cfg.yaml before requeue.
11:23
fix the image size and requeue
11:26 ✓✓

Safety

Put a human at every write

The agent holds read and write access to our training repos. It can edit configs, commit, and push. That access earns its keep — but it's the kind of capability that should make you nervous.

What keeps it safe is a hard split between reading and changing. The agent runs read-only checks on its own. The confirmation rule covers changes only.

Write protocol — enforced per turn

  1. Plan the change.
  2. Describe it to the user — what file, what change, why.
  3. End the turn without making the change.
  4. Wait for explicit confirmation: “yes / go ahead / kör på”.
  5. Only then perform the change.

The subtle part: the proposal and the action cannot share a turn. The gate is structural, not a polite request in the instructions.

MLOps Agent
online
Today
fix the image size and requeue customer-b/pcb-inspect-v2
11:26 ✓✓
MLOps Agent
Read configs/.../train_cfg.yaml and cross-checked with MLflow run r-4f92a1.

Found: image_size: 1280 — model head was compiled for 640. Caused OOM at epoch 12.
11:26
MLOps Agent
Proposed change
I will make one edit and commit:
configs/customer-b/pcb-inspect-v2/train_cfg.yaml
- image_size: 1280
+ image_size: 640
Then requeue the job on 2× DGX H100.

Waiting for your go-ahead. Type kör på to confirm.
11:26
kör på
11:28 ✓✓
MLOps Agent
Done
Committed fix: image_size 1280→640 pcb-inspect-v2
Job requeued — run ID r-7c03b8 · starting in ~4 min.
Will notify on first milestone or error.
11:28

Autonomy

Let it run the loop while you sleep

The write gate covers the code. The experiment loop is different work, and the agent owns most of it. While you sleep, it runs a hyperparameter sweep, evaluates each model that falls out of it, starts the next training run, and early-stops a run once the metric flattens.

The boundary runs along reversibility

Submitting a tuning job spends a few GPU hours and leaves the repo untouched. Editing a config and pushing it changes what every future run does. Give your agent the actions you can undo and keep the rest behind a human.


Security

Stack boring boundaries on the edges

The agent answers a known list of senders and ignores everyone else — a stranger who finds the bot gets nowhere before the agent spends a single token. The channel the training jobs report on carries a shared secret. The agent pushes with one deploy key whose scope is controlled outside the agent.

None of these are clever — which is the point

Auth at the door, a secret on the wire, least privilege on the credentials. Stacked plain boundaries let you sleep while an agent runs.


In summary

What it adds up to

Stack these choices and you get an agent with a clear character. It runs on hardware we own, so it can hold sensitive data. Its behaviour lives in prose, so anyone on the team can tune it. It picks silence, so the things it says repay the read. It reads on its own and never writes without a human, so it helps without scaring us. Its model is a swappable part, so it ages well.

Building your first agent to leave running next to systems that matter? Reach for these dials first. Run it local. Push judgment into the instructions and keep the code dumb. Give it a way to say nothing. Put a human at every write. Treat the model as the most replaceable part you have — because it is.

The dashboard still exists. We stopped watching it.


The agent wraps opencode, runs as one self-contained service, and takes its orders from a written brief. The boundaries above come from that brief and the shape of the container. No magic — only lines drawn on purpose.

Latest News

The agent that watches our trainings

We built an encapsulated, self-hosted AI agent that sits between our GPU training jobs and a chat. This is how…

Read more

White Pearl bids for Aixia – the premium exceeds 30 percent

IT consultant Aixia has received a public takeover offer from White Pearl Technology Group, WPTG, worth a total of approximately…

Read more

AI acceleration without compromise – Aixia × WEKA

On-premises data center, Azure, AWS, GCP or a hybrid of all. Aixia and WEKA are building a common data platform…
Read more

Interview with ML engineer: “We got back one working day a week – per person”

When scalable AI is discussed, it is almost always about enterprise companies with dedicated platform teams. For smaller growth companies,…

Read more