Think your cloud provider will help when your Kubernetes cluster breaks? Are you sure?
Cloud-managed Kubernetes services like Amazon EKS, Azure AKS, and Google GKE are marketed as a way to escape the pain of managing Kubernetes infrastructure. They offer a neatly packaged Kubernetes-as-a-Service (KaaS) experience, where the cloud provider provisions your control plane, wires up etcd, secures access to the Kubernetes API, installs DNS and networking plugins, and handles upgrades with the click of a button. They also provide a plethora of tooling to integrate your operational experience with their additional tooling (e.g., authentication and authorization).
For many teams, this seems like a perfect trade-off. Kubernetes is hard to manage well, and handing over responsibility for the control plane to a cloud provider feels like a smart way to reduce operational risk, simplify platform management, and accelerate adoption.
But here’s the question most teams don’t ask until something goes wrong: what actually happens when your cluster breaks? Will your cloud provider step in to help? Will they troubleshoot a failed upgrade? Will they fix a CSI driver that has stopped provisioning volumes? Will they investigate strange DNS behavior or inconsistent etcd state?
The answer is almost always no unless you are paying for the right level of support (and in some cases, quite a lot).
This blog explains exactly what is and isn’t included in a managed Kubernetes service. It breaks down what the provider is actually responsible for, what you still need to maintain, and why relying on the default level of support is often a risky and expensive mistake.
What is actually included in KaaS?
Each cloud provider offers a slightly different version of managed Kubernetes, but at a high level, here’s what is typically integrated and managed as part of the service:
-
A fully managed Kubernetes control plane (API server, scheduler, controller manager)
-
A secured etcd datastore, operated by the provider
-
A load-balanced API endpoint with integrated authentication
-
CoreDNS for service discovery, deployed and maintained by the platform
-
A curated Kubernetes distribution, patched and validated by the provider
-
A tested and supported CSI driver for block or file storage integration
-
A default CNI plugin designed to integrate cleanly with the provider’s networking model
-
Automated upgrade tooling for the control plane and node pools
This stack provides strong infrastructure scaffolding and is often a vast improvement over managing the control plane yourself. But it does not guarantee resilience, and it is not a support contract. While the provider installs and maintains the components they ship, they do not offer active remediation when something breaks unless you have purchased the right tier of support.
What’s managed vs what’s your responsibility?
Component | AWS EKS | Azure AKS | Google GKE | Customer Responsibility? |
---|---|---|---|---|
Kubernetes API Server | Fully managed | Fully managed | Fully managed | No |
etcd | Managed (no user access) | Managed (30-min backups) | Managed | No |
Control Plane Upgrades | User-initiated, managed execution | User-initiated, managed execution | Auto or user-initiated | Yes (initiate and validate) |
CoreDNS | Managed via EKS add-on | Managed via system pod | Fully managed | No |
Kubernetes Distribution | AWS-validated version | Microsoft-validated version | Google-patched version | No |
CNI Plugin (default) | AWS VPC CNI (addon-managed) | Azure CNI or kubenet (builtin) | GKE CNI (Calico-based) | No |
Custom CNI Plugin | Installable but unsupported | Installable but unsupported | Not installable | Yes |
CSI Driver (cloud storage) | Must install and upgrade manually | Must enable and manage version | Auto-managed | Yes (AWS and Azure only) |
Custom CSI Driver | Installable but unsupported | Installable but unsupported | Installable but unsupported | Yes |
Node Pool Upgrades | Optional auto-upgrade | Optional auto-upgrade | Optional auto-upgrade | Yes |
Kubernetes Add-ons (custom) | User responsibility | User responsibility | User responsibility | Yes |
Application Compatibility | User responsibility | User responsibility | User responsibility | Yes |
Workload Troubleshooting | User responsibility | User responsibility | User responsibility | Yes |
Upgrades: Not as “managed” as you might expect
All three cloud providers offer version upgrade tooling for the control plane and node pools, either via the web console or CLI. In most cases, these upgrades are not automatic by default. You are required to initiate them and ensure they complete successfully.
More importantly, you are fully responsible for validating that your applications, Helm charts, CRDs, admission plugins, and networking stack are compatible with the new Kubernetes version. If your cluster becomes unstable or stops working after an upgrade, the cloud provider will not assist unless you are on a paid support tier that includes SLA-backed engineering response.
In AWS and Azure, the CSI driver for persistent storage (which provisions volumes for your workloads) must be manually upgraded to stay in sync with the control plane version. If it is not upgraded, your workloads may silently fail to mount volumes, or the driver may enter crash loops. This is not something the cloud provider monitors for you.
Only Google GKE provides automatic management of the default CSI driver, and only for their supported storage backends.
When something breaks, what will your provider actually do?
Let’s consider some real-world examples:
-
A control plane upgrade gets stuck mid-way and never finishes
-
CoreDNS starts failing due to a resource constraint or version mismatch
-
The default CSI driver stops creating volumes after a minor version bump
-
etcd begins returning stale reads or refusing writes under load
-
A networking plugin introduces latency or pod connectivity issues
Unless you have a paid support plan that includes a response SLA, the provider will not investigate these issues on your behalf. You will be directed to public documentation or user forums, even if the failure is in a managed component.
With basic support, there is no guaranteed assistance. With developer support, you may open tickets, but critical incident response is not guaranteed. To get real-time help with a cluster issue, you must be on a support tier that includes 24/7 coverage and production-down response targets.
What does real support actually cost?
If you want the provider to step in during outages, diagnose infrastructure failures, and engage engineering teams, you need to pay for it. Below is a summary of the costs required to unlock support tiers that include a 1-hour SLA for critical issues:
Environment Size | AWS (Business Support) | Azure (Standard Support + SLA Tier) | GCP (Enhanced Support) |
---|---|---|---|
1 Cluster | $172/month | $172/month | $572/month |
3 Clusters | $316/month | $316/month | $716/month |
30 Clusters | $2,260+/month | $2,260+/month | $2,860+/month |
100 Clusters | $7,300+/month | $7,300+/month | $7,800+/month |
To access 15-minute SLAs, dedicated technical account managers, and higher priority escalation paths, you will need to purchase an enterprise support tier, typically starting at $15,000 per month.
So, what's the reality?
Cloud-managed Kubernetes services reduce the operational overhead of provisioning and managing the control plane, but they do not eliminate risk. They provide a framework for automation, not a safety net. They install infrastructure components, but they do not monitor your workloads for breakage or validate your deployments for compatibility.
You still carry the operational responsibility for everything running in your cluster. If your Kubernetes environment underpins production systems or customer-facing applications, it is vital that you budget for support accordingly, treat upgrades with caution, and build the internal expertise required to bridge the gap between infrastructure automation and operational resilience.
Because when something fails, it is not the SLA that restores service. It is your preparedness, your support contract, and your ability to respond that determines the outcome.
Of course, you can also outsource this support to a trusted third party, and this is something that Portainer offers to our customers. With Portainer you get a really simple to understand and operate management control-plane, and optionally, a level 3/4 escalation channel to ensure you have an environment you can rely on.

COMMENTS