Interview with ML engineer: “We got back one working day a week – per person”

When scalable AI is discussed, it is almost always about enterprise companies with dedicated platform teams. For smaller growth companies, the reality is different. Here, there is rarely anyone who serves the researchers with ready-made infrastructure – ML engineers have to hold together everything from data pipelines to GPU allocation themselves, often alongside their actual job.

We spoke to an ML Engineer at a Swedish SaaS company (anonymized at the client’s request) about how they restructured their MLOps platform with AiQu, what actually worked – and what didn’t.

1. the starting point: three engineers, too many hats

“We are three ML engineers. No platform role, no DevOps. We handled everything ourselves, which meant a lot of time was spent on things other than models.”

We did not have a unified environment. Training was done on an on-prem server with two A100s, and on spot instances in the cloud when we needed more capacity. Different people set up different Docker images. The result:

  • Reproducibility issues. Not “works on my machine” in the classic sense, but more subtly – small differences in CUDA versions and torch builds that produced divergent results between runs.
  • Low GPU utilization. Our on-prem server was often idle at night, while someone in the team started a cloud instance because they didn’t want to wait. The cloud cost became unpredictable.
  • Manual deploy work. Moving a model from experimentation to production required the author to configure the endpoint, write the Dockerfile and set up monitoring. Some models were stuck in “soon in production” mode for months.

2. The road to AiQu – and what didn’t go as planned

“We considered building something on Kubernetes ourselves using Kubeflow or similar. We realized pretty quickly that it would take at least a quarter of one of us full-time to get it stable, and we didn’t have that time.”

We landed in AiQu, but there were things to learn along the way:

  • The first attempt at GPU sharing did not work well. We tried to split an A100 between several smaller jobs, but for some models with irregular VRAM peaks, the jobs were knocked out. We had to go back and define clearer resource classes – less flexible, but predictable.
  • The migration took longer than we thought. Not the platform itself, but repackaging our existing scripts and environments in some standardized format. There were a lot of undocumented dependencies that we had to dig out.
  • We kept some local flows. For quick experiments and prototypes, part of the team still runs locally. It’s not worth the friction of forcing everything into the pipeline from day one.

3. Technical choices

“We tried to keep it pragmatic and avoid locking ourselves in too tightly.”

Standardized base images

We now have four or five tightly controlled images (different combinations of PyTorch version and CUDA) that the whole team is working from. This simple decision solved the majority of our reproducibility problems – it’s not rocket science, but it required someone to own it.

Queue-based GPU allocation

Instead of everyone trying to book resources ad hoc, we put jobs in a queue with priorities. Anyone who needs something urgently can raise the priority, but the default is that jobs run when capacity is available. This made the biggest difference to night and weekend utilization.

CI trigger for training, manual trigger for deploy

We automated training runs via commits, but we chose not to auto-deploy to production. A human always approves – for us it was the right trade-off between speed and risk.

4. Results – with reasonable reservations

“Here is what we actually saw in the first year. The figures are our own and based on logs plus a fairly inexact self-estimate of time spent before we started measuring.”

CPIIn the pastWith AiQuCommentary
Time to set up new training environment1-3 hours (with troubleshooting)10-20 minutesStandardized images make more difference than the platform itself
Time from finished model to production1-2 weeks on average1-3 daysManual approval step left
GPU utilization (on-prem)~25%~60%More can certainly be gained, but we have not prioritized it
Time released per engineer~6-8 hours/weekVariation between people

Where does time come from?

Mainly from things not visible in a calendar – context switches when something crashes, waiting for environments, debugging reproducibility issues, manual deploy steps. In total for the team, this is equivalent to about half a full-time position that can be spent on modeling instead.

It’s not 2 000 hours and not a full-time position. But it is the difference between delivering four models a year and eight.

5 Lessons for smaller teams

“Three things we think are worth saying out loud:”

  1. MLOps is not a platform – it is a set of habits. The tool helps, but the discipline around versioning data and models must be there regardless. We could get halfway there with just standardized images and a shared convention.
  2. Don’t build it yourself unless you have to. It’s tempting to write your own scripts for queues and resources. It works for three months and then becomes technical debt that no one wants to touch.
  3. Do not expect hardware to solve anything by itself. An extra GPU won’t help if the bottleneck is that data takes two days to prepare or the deploy process is manual.

Similar situation in your country?

This is one of several ways to solve it. If you want to brainstorm your own setup without a sales pitch – get in touch. We’d be happy to show you AiQu in action and talk through where it fits and where it doesn’t.

Latest News

Interview with ML engineer: “We got back one working day a week – per person”

When scalable AI is discussed, it is almost always about enterprise companies with dedicated platform teams. For smaller growth companies,…

Read more

Electricity price suddenly became one of the most important lines in your AI budget

The electricity that powers your AI workloads has become a strategic issue, not just an operational cost. CNBC warns that…
Read more

Why 87% of AI models never reach production – and what you can do about it

87% of machine learning models never reach production. MLOps and AiQu are helping Swedish companies overcome the gap between AI…
Read more

Data center design not keeping up – are Swedish facilities really ready for AI?

Swedish data centers are often touted as world leaders. But there is an inconvenient truth: they are built for a…
Read more