So you want to migrate to Kubernetes: observations from a software developer

Kubernetes: everyone wants to do it, regardless of their scale and business objectives.¹

Common justifications include better scalability, cost savings, standardization and being super modern and stuff. It’s the future!

In my personal experience, Kubernetes is far from the magical uptime machine that a lot of people think it is, and migrating it to it comes with a lot of hidden costs and potential downtime.

I’m not a Kubernetes expert, but I’ve been involved in a few Kubernetes migration projects and I have opinions. Here are all the learnings and observations that I’ve personally witnessed.

Migrations are complex

For most companies, using Kubernetes will mean planning and executing a migration project.

Assumptions will be made, estimates communicated and then the work begins. 90% of the migration will likely go relatively smoothly, but the last 10% will result in the migration project blowing past any initial estimates that you had.

There will always be those teams and services that require more time due to conflicting priorities or unexpected technical nuances popping up during testing.

It’s the long tail that gets you.

Kubernetes is complex

Kubernetes is so complex that most people point you towards managed Kubernetes clusters provided by the big cloud providers as a starting point.² To me, this is the best indication that we’ve lost the plot.

Kubernetes is an abstraction layer on an already complicated stack, and abstractions tend to leak at the most inconvenient time.

There is nothing wrong about running plain old virtual machines as container hosts and scaling them vertically. Load balancers and containers are a stable and reliable technology by now, and individual servers have made a big leap in performance over the past decade.

You’re going to have to know about the fundamentals of where your service is running either way, so you might as well keep the stack simple, understandable and easily debuggable, avoiding all the extra complexity.

There is no shame in choosing boring technology.

Shout-out to all the madlads who run Kubernetes at home for fun. I respect the hustle.

Kubernetes will only start making sense at scale

If your company doesn’t have a fully staffed platform team (6-8 full-time employees), then you probably don’t need Kubernetes.

If you do, then you can start considering it, but know that it won’t magically solve every issue that you have in your tech organization. Your time might be better spent on tackling those issues first.

Kubernetes is great if you want to standardize how your workloads run, and with additional tooling and setup you can end up with a pretty neat system where developers can set up new services on their own and easily monitor them using your observability stack (Grafana, Prometheus etc.).

This requires a lot of effort though, from both your platform team and developers. This effort will be unreasonably high for small startups and organizations, and my guesstimate is that using Kubernetes will start making sense if you have 100+ developers in your tech organization.

If you’re a small team that has a setup that works for you, then continue using it. You’re doing great!

If you only need a few Kubernetes features, such as autoscaling, health checks and rolling deployments, then you can probably find a simple solution that works on your existing stack.

If you’re just starting up, then don’t use Kubernetes. My recommendation is to start with a stupid simple stack that you know really well and scale it up vertically for as long as possible. Once that setup does not work for you, you will probably have enough money and people to do the Kubernetes migration. It’s a good problem to have!

Let your developers learn Kubernetes before migrating to it

If you skip this part, then expect a lot of questions, blocking issues, missed deadlines, hasty debugging, lost productivity and multiple production outages.

When I first dealt with Kubernetes, I had no idea what I was doing. I barely got to search what was the difference between pods and nodes, how to package applications into containers and what the hell an ingress was³. There was no formal training or opportunities to take a few days to play around with Kubernetes before rolling it out to production. I had to make sure my other work commitments got done in parallel.

It sucked.

I eventually got better at working with Kubernetes, mainly as a result of learning from production outages. This is also training, but much more expensive compared to simply giving developers the opportunity to learn and experiment in a sandbox.

After doing most of “Kubernetes The Much Harder Way”, I have a vague understanding of what I’m doing, but if the Kubernetes cluster were to completely fall over in an unexpected way, I would still have no idea on how to even approach fixing it, or how to make sure that the cluster is properly secured.

One day long Kubernetes workshop organized by the company can go a long way in helping everyone get up to speed. Just organize one before the migration.

Your overworked developers won’t like it

During my 8+ years as a software developer, there was a push for developers to be full stack and to embrace the operational side. I don’t have anything against that, developers should be responsible for what they deploy and observe the behaviour of their services diligently.

However, I’ve also learned that a good chunk of developers don’t want to mess with the full stack and want to focus on their area of responsibility, which may involve more product-focused work. Throw in some inefficient meetings, absurdly high expectations from the business side, and the time and tolerance for handling anything else goes way down.

The cognitive capacity for the average developer is limited. If your developers are already on that limit, and you decide that we need some Kubernetes and developers need to be responsible for their own deployments, then you need to expect some resistance and contempt towards you.

Even before you get to the Kubernetes part, you may also have to make sure that developers know the fundamentals about where their service runs and what amount of resource consumption is appropriate for their service. Turns out that this is not a given, especially in a fast-growth environment where teams and ownerships change often.

Oh, and developers might get very angry with you as every Kubernetes-related frustration will be attributed to your platform team, even if it’s an issue they themselves caused. It’s not fair, but it’s how it may play out.

Your application code has made assumptions about the platform it’s running on

If you don’t have expert knowledge about the service that you’re about to migrate to Kubernetes, then you’ll likely miss any assumptions that have been made in the application code itself.

Most common one is the assumption that there exists only one instance of your service at any time. You can lift-and-shift it to Kubernetes as-is, but then you won’t be taking advantage of any scalability benefits that Kubernetes offers, so what’s the point?

There was also a case where a service was relying on local storage for temporarily storing tasks that had to be picked up later. This made perfect sense on a virtual machine, but on Kubernetes the storage on the pods is ephemeral, and pods have a habit of restarting for all sorts of reasons. This issue went unnoticed for quite a while and only became known after someone familiar with the service asked about it. Adding a persistent volume fixed this issue.

Some libraries and solutions can also make assumptions about the number of instances that your service has, or the internal IP addresses that point to your service being static and predictable. Kubernetes breaks all of those assumptions.

You still need a platform team

I’ve seen claims that using Kubernetes will mean that you’ll need fewer people on your platform team, especially if you use a managed Kubernetes offering.

The reality is that you still need someone to make sure that even a managed Kubernetes instance stays up and running. This involves mundane work, such as making sure that updates are applied correctly without breaking every workload, or making sure that additional tooling bolted on to your Kubernetes cluster doesn’t wreck the services that are running on it.

Before the migration, your platform team answered questions and requests from developers, and wrangled whatever infrastructure you had running.

During the migration, your platform team will be answering questions and requests for both setups while also setting up Kubernetes and related infrastructure-as-code solutions, and unless you brought in more people before the migration, they’ll be overworked.

After the migration, your platform team will still be answering questions and requests, maintaining whatever infrastructure-as-code solutions you put in place, and making sure that Kubernetes stays running, which seems to take about the same number of people as before, if not more.

If you managed to avoid burning out any engineers during the migration, then that’s great!

If you managed to reduce headcount after a Kubernetes migration and it did not bite you in the ass years after the fact, then please do let me know.

Kubernetes won’t fix your legacy monolith

Kubernetes works really well with small services that start up within seconds and use relatively few resources.

The start-up time of your monolith is probably measured in minutes, and it likes to use all the CPU cores and RAM that you give it.

It can still run on Kubernetes, but certain aspects, such as scaling up fast in response to a spike in load, won’t work due to the long start-up time, or due to existing Kubernetes nodes not being able to accommodate your monolith without slowly starting up new nodes. By the time more instances of your service start up, that temporary increase in load might have already passed. Your performance still sucks and your resource usage graphs look like a poorly maintained saw.

Your platform team will also be unhappy with these types of services as these big resource-hungry monoliths tend to require the use of bigger nodes, and they might even end up impacting neighboring pods if configured improperly.

If you have set up tooling to ship service logs from your pods to a centralized location, then you might also find that your high-traffic monolith is logging so much that the tooling can’t keep up, resulting in logs going missing. The root cause can be something as basic as a default configuration value not working out for your thicc monolith, but by the time you get to that discovery, you’ll have wasted a good number of hours or days of productive work time.

Kubernetes won’t magically fix your performance issues

Autoscaling is one of the features that a lot of Kubernetes users like.

You’re having lunch and your service got really popular all of a sudden? No problem, your properly configured HorizontalPodAutoscaler can take care of it!

Autoscaling can save your butt, but it can also introduce additional issues.

For example, deploying a new version of your service can fail because you have too many instances of the service running. Databases, such as PostgreSQL, have a limited number of database connections available. Each instance of your service using up N database connections. If you don’t account for deployments or autoscaling scenarios, then the new instances will fail to start up because they cannot establish new database connections. It’s a good idea to have a few instances' worth of database connections set aside as a buffer.

Unless you’re actually limited by physical constraints, such as CPU time, memory and network bandwidth, then Kubernetes is unlikely to fix any performance issues. You’re better off profiling your application, network and database performance first and making sure that your observability stack gives you enough information to troubleshoot performance issues.

Kubernetes is not a magical uptime machine

It really isn’t.

At some point, you will have downtime because of a Kubernetes configuration issue, taking down your whole service.

Sometimes you’ll involve additional tooling to make working with Kubernetes easier. That can also horrifically backfire due to circumstances not under your control.

You’ll probably have system-wide latency spikes because a critical service got its pods restarted one by one, and the new pods need to warm up their caches again. This is especially true for JVM-based services.

Misconfigured tooling can wreak havoc on your Kubernetes cluster. It’s not fun to troubleshoot why all your pods suddenly disappeared, only to find out later that Karpenter went on a pod massacre.

Vertical scaling can go a long way

If your current stack has a load balancer and a few containers, and you’re not doing anything too inefficient, then you can probably scale up vertically for a very long time.

Servers have made a big leap in performance and capability, resulting in machines with 128+ CPU cores, terabytes of fast memory and lots of room for adding ridiculously fast SSD-based storage.

You can already take advantage of this using your favourite cloud provider by picking a higher-tiered VM. You’ll still be paying the cloud tax, but it’s going to be cheaper than a Kubernetes cluster, and your stack will remain simple, fast and portable.

If you want to go even further, you can buy 2+ physical servers, find a suitable location to host them, and take full advantage of modern hardware. At a certain scale, this will be much cheaper than the cloud, even if you need to hire somebody to manage, maintain and replace them. Physical servers aren’t scary, and you’ll need knowledgeable platform people working for you either way, so why not cut out the complexity and expense of the cloud?⁴

Conclusion

Kubernetes is a perfectly good option to go with, but only at the right level of organizational size and maturity. Unless you’re at that level, you really don’t need to worry about using it.

right after they’re done implementing “AI” and LLM-s on a completely unsuitable use case. ↩︎
the only thing worse than managed Kubernetes is a poorly managed self-hosted one. ↩︎
turns out that it’s a fancy name for a reverse proxy. You know, like nginx. ↩︎
there are benefits to using the cloud, but just like Kubernetes, cloud services have a narrow set of circumstances where their use is appropriate. ↩︎