Techblog
May 21, 2026

Interview with ML engineer: “We got back one working day a week – per person”

When scalable AI is discussed, it is almost always about enterprise companies with dedicated platform teams. For smaller growth companies, the reality is different. Here, there is rarely anyone who serves the researchers with ready-made infrastructure – ML engineers have to hold together everything from data pipelines to GPU allocation themselves, often alongside their actual job.

We spoke to an ML Engineer at a Swedish SaaS company (anonymized at the client’s request) about how they restructured their MLOps platform with AiQu, what actually worked – and what didn’t.

1. the starting point: three engineers, too many hats

“We are three ML engineers. No platform role, no DevOps. We handled everything ourselves, which meant a lot of time was spent on things other than models.”

We did not have a unified environment. Training was done on an on-prem server with two A100s, and on spot instances in the cloud when we needed more capacity. Different people set up different Docker images. The result:

Reproducibility issues. Not “works on my machine” in the classic sense, but more subtly – small differences in CUDA versions and torch builds that produced divergent results between runs.
Low GPU utilization. Our on-prem server was often idle at night, while someone in the team started a cloud instance because they didn’t want to wait. The cloud cost became unpredictable.
Manual deploy work. Moving a model from experimentation to production required the author to configure the endpoint, write the Dockerfile and set up monitoring. Some models were stuck in “soon in production” mode for months.

2. The road to AiQu – and what didn’t go as planned

“We considered building something on Kubernetes ourselves using Kubeflow or similar. We realized pretty quickly that it would take at least a quarter of one of us full-time to get it stable, and we didn’t have that time.”

We landed in AiQu, but there were things to learn along the way:

The first attempt at GPU sharing did not work well. We tried to split an A100 between several smaller jobs, but for some models with irregular VRAM peaks, the jobs were knocked out. We had to go back and define clearer resource classes – less flexible, but predictable.
The migration took longer than we thought. Not the platform itself, but repackaging our existing scripts and environments in some standardized format. There were a lot of undocumented dependencies that we had to dig out.
We kept some local flows. For quick experiments and prototypes, part of the team still runs locally. It’s not worth the friction of forcing everything into the pipeline from day one.

3. Technical choices

“We tried to keep it pragmatic and avoid locking ourselves in too tightly.”

Standardized base images

We now have four or five tightly controlled images (different combinations of PyTorch version and CUDA) that the whole team is working from. This simple decision solved the majority of our reproducibility problems – it’s not rocket science, but it required someone to own it.

Queue-based GPU allocation

Instead of everyone trying to book resources ad hoc, we put jobs in a queue with priorities. Anyone who needs something urgently can raise the priority, but the default is that jobs run when capacity is available. This made the biggest difference to night and weekend utilization.

CI trigger for training, manual trigger for deploy

We automated training runs via commits, but we chose not to auto-deploy to production. A human always approves – for us it was the right trade-off between speed and risk.

4. Results – with reasonable reservations

“Here is what we actually saw in the first year. The figures are our own and based on logs plus a fairly inexact self-estimate of time spent before we started measuring.”

CPI	In the past	With AiQu	Commentary
Time to set up new training environment	1-3 hours (with troubleshooting)	10-20 minutes	Standardized images make more difference than the platform itself
Time from finished model to production	1-2 weeks on average	1-3 days	Manual approval step left
GPU utilization (on-prem)	~25%	~60%	More can certainly be gained, but we have not prioritized it
Time released per engineer	–	~6-8 hours/week	Variation between people

Where does time come from?

Mainly from things not visible in a calendar – context switches when something crashes, waiting for environments, debugging reproducibility issues, manual deploy steps. In total for the team, this is equivalent to about half a full-time position that can be spent on modeling instead.

It’s not 2 000 hours and not a full-time position. But it is the difference between delivering four models a year and eight.

5 Lessons for smaller teams

“Three things we think are worth saying out loud:”

MLOps is not a platform – it is a set of habits. The tool helps, but the discipline around versioning data and models must be there regardless. We could get halfway there with just standardized images and a shared convention.
Don’t build it yourself unless you have to. It’s tempting to write your own scripts for queues and resources. It works for three months and then becomes technical debt that no one wants to touch.
Do not expect hardware to solve anything by itself. An extra GPU won’t help if the bottleneck is that data takes two days to prepare or the deploy process is manual.

Similar situation in your country?

This is one of several ways to solve it. If you want to brainstorm your own setup without a sales pitch – get in touch. We’d be happy to show you AiQu in action and talk through where it fits and where it doesn’t.

Latest News

Blog
July 13, 2026

From the Screen to the Factory Floor: Is Swedish Industry Ready for Physical AI?

Physical AI moves intelligence from the cloud to the factory floor. Is Swedish industry ready for this shift?…

Blog
July 10, 2026

The Industrialization of Intelligence: Why Architecture and Orchestration Determine the Winners in the AI Era

Building AI is one thing. Industrializing it is quite another. Architecture and orchestration determine the winners….

Blog
July 7, 2026

Price shocks and chip shortages: How to protect your IT budget in 2026

Memory manufacturer Micron has locked in memory prices at a five-year high. GPUs have wait times of several months. Here…

Blog
July 7, 2026

From Paper Strategy to Server Room: What Do Sweden’s New AI Goals Mean for Your Budget?

Sweden’s new AI strategy entails concrete investment needs for Swedish companies. But what will it actually cost?…

Cookie	Duration	Description
cookielawinfo-checkbox-analytics		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional		The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary		This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance		This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy		The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.