Recovering a Kubernetes Crunchy Data PostgreSQL (PGO) cluster with no primary node

Zercurity
3 min readApr 13, 2021

--

We ran into an interesting issue recently with our PostgreSQL cluster running atop Kubernetes and managed with pgo. Where no primary node was elected post a node eviction.

We’d recently performed an upgrade of our Kubernetes cluster from 1.18.X to 1.19.X and the containers were evicted from the older nodes to the new ones. In the process the pgo clusters failed to elect a new leader.

$ pgo test zercuritycluster : zercurity
Services
primary (10.103.189.241:5432): UP
pgbouncer (10.109.233.235:5432): UP
replica (10.109.210.110:5432): UP
Instances
replica (zercurity-758dd49969-nzltl): UP
replica (zercurity-erwj-5d5876dc64-gb676): UP
replica (zercurity-vcxh-574fd7cf8b-jdgxc): UP

Not to worry (or so we thought). We can force the election of a new primary node. The pgo failover command will force a new node to become the primary node within the cluster. You can use the--query command to see which pods are available for election like so:

$ pgo failover zercurity --query
$ pgo failover zercurity --target zercurity --force

However, what we hadn’t foreseen that we’d also recently updated our pgo cluster and could no longer run management commands against clusters that had been provisioned by an order version of pgo.

WARNING: Are you sure? (yes/no): yes
Error: zercurity has not yet been upgraded. Please upgrade the cluster before running this Postgres Operator command.

Nor could we update the cluster as pgo currently doesn’t support upgrades between major revisions of PostgreSQL. We could of course re-build a new cluster from a recent backup. However, this also became an issue due to how pgo handles its upgrades.

Oh dear.

Recovering our cluster manually

Fortunately, you’re able to failover a cluster manually using the patronictl failover command from with inside the PostgreSQL container.

Firstly, grab a list of the available pods:

$ kubectl -n pgo get po

From here, identify one of the replica pods from within our cluster. Using the kubectl exec command. Create an interactive shell on one of the running containers.

$ kubectl -n pgo exec -i -t zercurity-758dd49969-nzltl  -- /bin/bash

We can the simply use the patronictl command to elect a new primary member. You’ll be presented with a list of candidates for you to select.

Acknowledge the failover task.

$ patronictl failoverCandidate ['zercurity-nzltl', 'zercurity-erwj-wnlql', 'zercurity-vcxh-xmr64'] []: zercurity-nzltlCurrent cluster topology
+ Cluster: zercurity --+------------+---------+---------+----+-----+
| Member | Host | Role | State | T | Lag |
+----------------------+------------+---------+---------+---+------+
| zercurity-nzltl | 192.X.X.13 | Replica | running | 6 | 0 |
| zercurity-erwj-wnlql | 192.X.X.30 | Replica | running | 6 | 0 |
| zercurity-vcxh-xmr64 | 192.X.X.12 | Replica | running | 6 | 0 |
+----------------------+------------+---------+---------+----+-----+
Are you sure you want to failover cluster zercurity? [y/N]: y

Once completed you can check that a leader has been successfully elected.

$ patronictl list+ Cluster: zercurity --+------------+----------+---------+---+-----+
| Member | Host | Role | State | T | Lag |
+----------------------+------------+---------+----------+---+-----+
| zercurity-nzltl | 192.X.X.13 | Leader | running | 6 | |
| zercurity-erwj-wnlql | 192.X.X.30 | Replica | running | | |
| zercurity-vcxh-xmr64 | 192.X.X.12 | Replica | starting | | |
+----------------------+------------+---------+----------+---+-----+

You can then use the pgo test command to check that you’re cluster is back in a healthy state.

Its all over!

We hope you found this helpful. Please feel free to get in touch if you have any questions.

--

--