Sugarkube blog

Your Kubernetes clusters should be ephemeral too

by boosh on 16 October 2019 | Permalink

Devops and developers using Kubernetes are used to the idea that containers should be ephemeral. Not having to babysit containers is what allows us to scale. Containers can be created and deleted automatically depending on the traffic to the Kubernetes cluster. Some applications may use persistent volumes to store state, but things are generally simpler if state is stored external to the cluster, e.g. on hosted databases or storage systems like S3.

The thing is, why stop there? In this post, we’ll explain some of the benefits of making your entire Kubernetes clusters ephemeral as well – able to be created and deleted on-demand with total automation.

Long live the cluster!

Until now creating clusters on-demand was problematic. Sure, there are tools like Kops and Minikube, and hosted services like Amazon’s EKS, Google’s GKE and Azure’s AKS. The trouble is just creating a virgin cluster isn’t enough. For it to be useful, you need to install the applications you care about onto it. Since these often have interdependencies, you need a way of installing things in the correct order. An extra complication is that various applications (or the cluster itself) may require cloud infrastructure to be created, such as DNS zones, load balancers, databases or other cloud services.

Given all this complexity, it’s quite common for organisations to opt for “long-lived” clusters – ones that someone sets up possibly at the start of a project, customises, and then babysits. These long-lived clusters will hopefully have updates applied to them and the devops responsible will hope that they go smoothly. They haven’t got much choice.

There are several major problems with using long-lived Kubernetes clusters:

  • Lack of full automation means manual changes can creep in – this may complicate disaster recovery and auditing, as well as making it more difficult to adopt full automation later.
  • Difficult to test upgrades before applying them – and if they fail there’s a mad rush to fix the failed cluster which may or may not be a production cluster. This can be particularly painful when core parts of the system need upgrading, e.g. etcd.
  • Difficult to create replicas for development/testing/new business cases
  • They may have been configured manually, and the person who set them up may leave the team/organisation

The problems above are serious. They can sap a team’s energy by requiring members to spend more time firefighting issues instead of developing new features. Upgrades and releases can be stressful and make the team’s velocity grind to a halt. It’s all wasted effort, time and money for absolutely no benefit.

But if we think about it, these are really the same types of problems we freed ourselves from when we started using containers in the first place. Immutability provides stronger guarantees about a system, which with correct testing should lead to improved robustness. If we could create fully-functioning Kubernetes clusters whenever we wanted, we wouldn’t have any of the above problems. Many teams already automate the creation of cloud infrastructure with tools like Terraform or by using the APIs of various cloud providers for exactly the reasons described above. So it’s time to adopt automation for the entire cluster.

Ephemeral Kubernetes Clusters

Let’s imagine we could create Kubernetes clusters whenever we wanted to, sized according to what we wanted to use them for, and with the applications installed that we wanted to work on or serve. Here’s what we’d gain from that:

  • Developers could work on their own isolated clusters, and wouldn’t accidentally interfere with each others’ work
  • Faster for developers to create development environments because all of an application’s dependencies would be installed automatically (and only the apps dependencies – not the other 90% of services they don’t need)
  • Everything would need to be 100% automated, so there’d be no risk of uncommitted changes sneaking into clusters
  • Simpler auditing because of the above
  • Easier to launch clusters in multiple regions or to restore a cluster in the event of a disaster
  • Easier production cluster upgrades – an entirely new upgraded cluster could be created and traffic gradually migrated to it. If there’s a problem traffic could be redirected back to the previous production cluster.
  • Easier to make deep architectural changes on a new cluster and test them
  • Encourage more automated testing
  • Use of cloud resources would be more isolated, restricted to individual clusters

Again, there are parallels with using containers in the first place. Deployments become easier, it’d be simpler to rollback upgrades and portability would increase – in this case the entire cluster would be more portable across regions or instance sizes.

This is what Sugarkube allows you to do with only one or two commands. In fact, Sugarkube goes one step further and ultimately can allow you to treat the specific location of a cluster (e.g. remote EKS, local Minikube) as just a detail. This makes it simple to work locally (where it’s fastest to work) before going to the cloud when absolutely necessary. It can even simplify going multi-cloud.

Ephemeral clusters aren’t always the answer though. You may legitimately need to store large amounts of data in a cluster e.g. for monitoring purposes. In that case migrating the data to a new cluster may take a large amount of time and be costly. For those situations, ephemeral clusters can still help by allowing you to easily create a replica of the cluster in which you could stage any upgrades before applying them to the main cluster.

Sugarkube is free and open source. To learn more, check out our documentation and tutorials. For any other questions or to find out about our consultancy services, please contact us.