Recovering a Kubernetes Crunchy Data PostgreSQL (PGO) cluster with no primary node
We ran into an interesting issue recently with our PostgreSQL cluster running atop Kubernetes and managed with pgo
. Where no primary node was elected post a node eviction.
We’d recently performed an upgrade of our Kubernetes cluster from 1.18.X to 1.19.X and the containers were evicted from the older nodes to the new ones. In the process the pgo clusters failed to elect a new leader.
$ pgo test zercuritycluster : zercurity
Services
primary (10.103.189.241:5432): UP
pgbouncer (10.109.233.235:5432): UP
replica (10.109.210.110:5432): UP
Instances
replica (zercurity-758dd49969-nzltl): UP
replica (zercurity-erwj-5d5876dc64-gb676): UP
replica (zercurity-vcxh-574fd7cf8b-jdgxc): UP
Not to worry (or so we thought). We can force the election of a new primary node. The pgo failover
command will force a new node to become the primary node within the cluster. You can use the--query
command to see which pods are available for election like so:
$ pgo failover zercurity --query
$ pgo failover zercurity --target zercurity --force
However, what we hadn’t foreseen that we’d also recently updated our pgo
cluster and could no longer run management commands against clusters that had been provisioned by an order version of pgo
.
WARNING: Are you sure? (yes/no): yes
Error: zercurity has not yet been upgraded. Please upgrade the cluster before running this Postgres Operator command.
Nor could we update the cluster as pgo
currently doesn’t support upgrades between major revisions of PostgreSQL. We could of course re-build a new cluster from a recent backup. However, this also became an issue due to how pgo
handles its upgrades.
Oh dear.
Recovering our cluster manually
Fortunately, you’re able to failover a cluster manually using the patronictl failover
command from with inside the PostgreSQL container.
Firstly, grab a list of the available pods:
$ kubectl -n pgo get po
From here, identify one of the replica pods from within our cluster. Using the kubectl exec
command. Create an interactive shell on one of the running containers.
$ kubectl -n pgo exec -i -t zercurity-758dd49969-nzltl -- /bin/bash
We can the simply use the patronictl
command to elect a new primary member. You’ll be presented with a list of candidates for you to select.
Acknowledge the failover task.
$ patronictl failoverCandidate ['zercurity-nzltl', 'zercurity-erwj-wnlql', 'zercurity-vcxh-xmr64'] []: zercurity-nzltlCurrent cluster topology
+ Cluster: zercurity --+------------+---------+---------+----+-----+
| Member | Host | Role | State | T | Lag |
+----------------------+------------+---------+---------+---+------+
| zercurity-nzltl | 192.X.X.13 | Replica | running | 6 | 0 |
| zercurity-erwj-wnlql | 192.X.X.30 | Replica | running | 6 | 0 |
| zercurity-vcxh-xmr64 | 192.X.X.12 | Replica | running | 6 | 0 |
+----------------------+------------+---------+---------+----+-----+
Are you sure you want to failover cluster zercurity? [y/N]: y
Once completed you can check that a leader has been successfully elected.
$ patronictl list+ Cluster: zercurity --+------------+----------+---------+---+-----+
| Member | Host | Role | State | T | Lag |
+----------------------+------------+---------+----------+---+-----+
| zercurity-nzltl | 192.X.X.13 | Leader | running | 6 | |
| zercurity-erwj-wnlql | 192.X.X.30 | Replica | running | | |
| zercurity-vcxh-xmr64 | 192.X.X.12 | Replica | starting | | |
+----------------------+------------+---------+----------+---+-----+
You can then use the pgo test
command to check that you’re cluster is back in a healthy state.
Its all over!
We hope you found this helpful. Please feel free to get in touch if you have any questions.