“If a tree falls in the woods, and no one was around to hear it, did it make a sound?”… If a node fails and your application restarts a few seconds later, is that a failure?
In most enterprise IT systems, the honest answer is no. A restart window measured in seconds is acceptable. Retries handle transient errors. Users refresh dashboards. Systems self-heal. Kubernetes was built precisely for this kind of environment.
On a manufacturing line, that same interruption can scrap product, damage tooling, or trigger safety interlocks. The tolerance model is different. In some environments, the goal is not rapid recovery. It is uninterrupted continuity.
That distinction changes how we think about Kubernetes at the industrial edge.
How Kubernetes High Availability Actually Works
Kubernetes is a distributed control system built around reconciliation. You declare desired state. The control plane continuously works to ensure that running state converges toward that declaration.
- If a container crashes, it is restarted.
- If a node fails, workloads are rescheduled.
- If a replica disappears, it is recreated.
This behavior is made possible through consensus. The control plane maintains authority via quorum, typically backed by etcd. A majority must survive for the cluster to remain authoritative. This prevents split brain and ensures consistency across the distributed system.
Within those design assumptions, Kubernetes is exceptionally robust.
But it assumes that restart is acceptable and that shared authority is tolerable. Industrial systems do not always share those assumptions.
Replicas, Service Discovery, and Shared Fate
A technically literate reader will reasonably ask whether the problem is already solved by scaling replicas. If three instances of an application are running, and one fails, the remaining two continue serving traffic. From a workload perspective, that provides redundancy. For many enterprise IT services, that model is both effective and sufficient.
The limitation, however, does not sit at the process layer. It sits at the authority and infrastructure layer.
Those replicas exist inside a shared control domain. They depend on the same etcd quorum, the same API server, the same scheduler, and the same cluster networking fabric. Even at the service discovery layer, they rely on shared components such as CoreDNS and dynamic internal routing. In a stable data center environment, these abstractions simplify operations and provide powerful flexibility. They are part of what makes Kubernetes attractive.
At the industrial edge, the environmental assumptions change. Cabinets may sit on different power circuits. Network links may traverse industrial switches subject to segmentation or interference. Hardware may operate in temperature and vibration ranges rarely encountered in cloud environments. Under these conditions, shared infrastructure subsystems represent correlated failure domains.
If the control plane becomes unavailable, the authority to reconcile state is affected for all replicas simultaneously. If etcd experiences issues, the entire cluster’s ability to maintain consistency is impaired. If cluster-wide DNS or networking degrades, service-to-service communication across all replicas is impacted at once. None of this implies fragility in Kubernetes itself. It reflects the reality that replicas inside a single cluster remain coupled through shared authority and shared infrastructure layers.
In enterprise IT, this coupling is generally acceptable because the cluster is treated as a reliable abstraction boundary. In industrial systems, where eliminating correlated failure is often a primary design objective, concentrating authority and service discovery into a single distributed cluster may be viewed as an unnecessary aggregation of risk. The architectural conversation therefore shifts from how many replicas are running to where the boundary of authority and failure containment should actually be drawn.
The A+B+C Redundancy Model for Industrial Edge Computing
Industrial and aerospace engineering provide a useful contrast. Commercial aircraft do not depend on a distributed cluster maintaining quorum in order to remain airborne. Instead, they implement triple modular redundancy. Three independent control systems operate simultaneously, each capable of full function. Their outputs are compared through arbitration logic, and divergence is resolved through voting mechanisms. The defining characteristic of this model is independence rather than internal scaling.
This philosophy is increasingly applied to industrial edge computing.
Rather than constructing a single multi-node Kubernetes cluster and scaling replicas within it, the system is designed as three discrete operating units, commonly described as A, B, and C. Each unit is fully capable of running the complete application stack required by the plant. Each unit has its own compute boundary, its own networking boundary, and its own authority domain.
Availability is achieved through architectural redundancy across independent systems rather than through rescheduling within a shared cluster.
In practical terms, this can be implemented in two common ways.
One approach uses three standalone Docker (or Podman) hosts. Each host runs the identical containerized application stack, but there is no shared cluster quorum, no cross-node scheduler, and no shared control-plane state. The three hosts are treated as a logical deployment group, ensuring that the same application version is deployed consistently to Units A, B, and C. An external load balancer or supervisory controller sits in front of these hosts, performing continuous health checks and directing inbound traffic from the plant to whichever units are healthy. If Unit A fails completely due to hardware, power, or software fault, traffic is withdrawn from it automatically while Units B and C continue operating without interruption.
A second approach uses three single-node Kubernetes clusters rather than plain Docker hosts. Each unit runs its own independent Kubernetes instance, providing declarative deployment, pod lifecycle management, namespaces, and RBAC locally. Crucially, there is no shared etcd across units and no cross-node quorum to maintain. Each cluster is sovereign. The same manifests are applied independently to each cluster, typically via automation tooling that treats A, B, and C as coordinated but separate deployment targets.
Inbound access from the plant is again mediated by an external arbitration layer, typically a load balancer with active health checks. This component continuously evaluates the health of each operating unit and routes traffic accordingly. As long as at least one unit remains healthy, the application remains available to the plant. Failures are isolated to individual authority domains rather than propagating through a shared cluster fabric.
The distinction is subtle but significant. It reframes availability from “how quickly can we recover?” to “how do we prevent correlated interruption altogether?”
Replica scaling within a cluster provides redundancy at the process layer. Independent operating units provide redundancy at the authority layer. In environments where correlated infrastructure failure and shared control-plane dependencies are the primary concern, isolating authority domains can reduce systemic risk in ways that additional replicas inside a single distributed cluster cannot.
At the industrial edge, the design question is not simply how many pods to scale. It is where the boundary of shared authority should sit, and whether that boundary aligns with the physical and operational realities of the plant floor.
{{article-cta}}
From Redundancy to Fleet Management
Designing the system as three independent operating units solves the correlated failure problem, but it introduces a new one. Once you move from a single cluster to discrete authority domains, you now have a fleet to manage.
Deploying the same application consistently to Units A, B, and C is straightforward for one work cell. The complexity emerges when that pattern is repeated across dozens or hundreds of cells, plants, or remote sites. You need a way to ensure that versions remain aligned, configuration drift is controlled, and updates can be rolled out predictably, without reintroducing a shared runtime dependency.
This is where fleet management becomes critical.
Using Portainer’s Edge compute capabilities, each standalone Docker host or single-node Kubernetes cluster can be registered as an independently managed endpoint. These endpoints can be organized into logical deployment groups that reflect the A+B+C redundancy sets. Application definitions can then be targeted to these groups, ensuring consistent deployment across each discrete unit while preserving their operational independence.
The important distinction is that management is centralized at the control level, not at the data or runtime level. Each operating unit remains sovereign. If connectivity to the management plane is lost, workloads continue running locally. Redundancy is preserved because execution authority does not depend on a shared cluster quorum.
In large industrial estates, this separation between runtime independence and centralized governance allows the A+B+C model to scale without reintroducing the very shared failure domains it was designed to eliminate.



