Category: AI infrastructure / Experiences
This is not a post about the cloud being bad. The cloud is a great tool for many purposes, and we work with cloud providers in many of our customers’ environments. But having worked with AI infrastructure in a range of organizations – from mid-sized manufacturing companies to regional governments – we see the same mistakes recurring. Not once, but consistently.
And in most cases, they are entirely avoidable. It deserves an honest review.
Mistake 1: Not calculating the cost of peeling
Getting started with AI in the cloud is cheap. It’s designed to be. A pilot with limited data and sporadic runs costs little. It creates a false notion of what it actually costs to operate the system in production.
We’ve spoken to customers whose cloud bill for AI tripled within six months of going live – without a corresponding increase in traffic volume. The reason was a combination of continuous inference, egress costs for data, and storage prices that scaled unexpectedly.
The solution is not to avoid the cloud. The solution is to properly calculate the TCO before committing to an architecture – and in most cases, an on-prem or hybrid solution carries more weight than it seems in a pilot phase.
Mistake 2: Building yourself into a supplier’s ecosystem
It is easy to build dependencies on a specific cloud provider’s AI services – their proprietary APIs, their specific model deployment format, their way of handling vector databases. In the early stages, it’s unproblematic. It’s smooth and it works.
The problem arises when you want to switch, or when the provider changes pricing, discontinues a service, or introduces terms that do not suit your business. Migrating from a deeply integrated cloud ecosystem is technically complex and organizationally painful.
We see this particularly clearly in the public sector and health, where a sudden change of ownership of a cloud service or a change in terms and conditions can create serious problems for mission-critical systems.
Mistake 3: Data sovereignty is treated as a checkbox
‘We store data in a European data center’ is an answer that sounds good in a tender but does not mean what most people think it means. Where data is physically stored and who legally controls it are two completely different issues.
The CLOUD Act, which allows US authorities to request access to data handled by US companies regardless of where it is physically stored, is a well-known example. But there are subtler variants: model ownership (who owns a model trained on your data at an external provider?), logging (what does the provider log and for how long?), and training (is your data used to improve the provider’s own models?).
These questions need to be asked and answered in writing, not adopted.
Mistake 4: Security is added afterwards
AI systems are often built iteratively and quickly, which is good for innovation but bad for security architecture. What starts as an internal tool for one team gradually expands – more users, more sensitive data, broader integration with other systems – without the security model being updated in tandem.
Document-level access control in RAG systems is a concrete example we already mentioned. But there are more: API keys shared without a rotation policy, models running with excessive rights, logging that does not meet audit requirements, and integrations with external services that are not properly audited.
Retrofitting security into an existing system is always more expensive than building it right from the start. It’s a cliché, but it’s true.
Mistake 5: Underestimating what ‘in production’ actually means
There is a big difference between an AI system that works and an AI system that is operational. Functioning means that it produces reasonable outputs. Operable means that it is monitored, updatable, debugged, documented, and can be managed by more than the person who built it.
AI systems in production require an MLOps process: versioning of models and data, automated testing, monitoring of model performance over time, and clear procedures for when and how the model is updated. Without it, you’re left with a system that works until it doesn’t, and no one really knows why.
What it costs to redo
The common theme in these five mistakes is that they are expensive to correct after the fact but relatively easy to avoid with the right architecture decisions early on.
It requires asking difficult questions at a stage when it still feels like a pilot project. What happens when it scales? Who really owns the data? What do we need to be able to show in an audit? How will the system be updated in six months’ time?
These are the questions we ask when we help organizations design their AI infrastructure. Not to slow down, but to avoid rebuilding.
Contact us at Aixia if you want an honest conversation about your current setup, or explore AiQu at aiqu.ai.