Post mortem: Ceph - the worst disaster since I run my own infra (part 1)

A new category on my blog, “post mortem”. I sure hope it will not have many posts, but here is it’s first one.

Backstory

For some years now, I run a Kubernetes cluster for learning and tinkering purposes. At some point I started using it kinda productive, lacking the (financial) resources to have a testing environment that is seperate from the production environment (but at least I know that would be good).

This cluster ran on plain (up until some days ago only virtual) servers. I always had 3 controller nodes (running the k8s control plane, including etcd, apiserver, scheduler, controller-manager) and a changing amount of worker nodes - but all controller nodes are worker nodes, too (again because of lack of financial resources). For storage, I’m using Rook, which automatically creates and manages a ceph cluster inside k8s.

Last state before this story began:

3 controller nodes
- 2 of which were nodes from back when I started this cluster
4 nodes in total
- one with SSD storage
- the other 3 with HDD storage

A new node

Someone in my bubble had a server to give away, not using it themself anymore. Since I had another server (Rahja) at the same hoster, costing the same amount of money with half the RAM and HDDs instead of SSDs, I took this server (from now on called Ifirn) over into my account to replace Rahja (farewell Rahja, might the next one not overload you and then, on top of that, run a Tor node on you). Rahja was my KVM host, running something like 6 VMs at that time (and at some point years before around 20).

Originally I just wanted to migrate the VMs over - do a clean, fresh debian install, install and configure libvirt and VM networking, migrate the VMs one by one. Easy thing, already did this before when I migrated from Hesinde to Rahja (btw, kudos to whoever recognize this names without looking them up :D). But then I thought about my cluster, how Rahja was not fully used with those VMs (anymore) and remembered there was this project to have VMs inside Kubernetes …

So I added Ifirn as a new node into my k8s cluster, installed kubevirt and started migrating VMs. I started with the smaller, less important VMs, fiddling around with the networking (those VMs have their own public IP) but at some point had those working. Then some days passed and I suddenly was in a hurry to get the last two VMs migrated, Praios (Gitlab) and Mokoscha (VPN, dn42).

A small disaster

Migrating Praios was a long job. It had 2 disks with 500GiB each, placed on different physical disks. First I had to clean this up, since when migrated to my cluster, each disk is already replicated - I don’t want RAID1 on Ceph RBD with size=3 (meaning each object is stored three times). In this process, I was able to shrink Praios’ down to a single disk with 300GiB. I left the migration running over night, still trying to have a more normal day/night cycle.

At some point in that night (or before I was fully asleep), I tweeted the first sign of doom - but didn’t realize it was doom, yet.. after all, under high load it sometimes behaved funny, so it wasn’t entirely out of the normal.

A short intro to ceph and my Ceph config

Ceph is a cluster storage system. In my configuration it consists of three monitors (scheduled on controller nodes) and one object storage daemon (OSD) per physical disk (so one per virtual server node and two for ifirn). The monitors are responsible for the cluster state - which servers are part of it, auth, when was something changed the last time and what was changed (not data, but stuff like “which OSDs are currently handling this data”).

You store objects in Ceph. Can be something like an Amazon S3 service (which I run on it) or something like disk images (RBD), which then is made of a lot of same-size objects (like sectors of a harddisk but a lot larger, 4 MiB in my case). I use RBD for persistent volumes in my kubernetes cluster and also use the object storage via Rados Gateways (RGW), which gives an S3-like API over ceph.

Ceph handles so called Pools, which have a configured fault resistance. My RBD pool is configured to be replicated with size=3, meaning that every object in that pool is stored three times. You can even configure how this shall be distributed - on 3 different OSDs? Hosts? Racks? Continents? It can do that. Most of my pools are configured like this, allowing up to two OSDs to fail without any data loss.

There are two pools with a different config, those storing the actual data in my S3-like storage. They are configured to be erasure coded, meaning an object is split into K chunks with M chunks of parity and you can afford to lose up to M disks without any data loss.

Back to the small disaster.

The pools storing the actual data for my S3-like service had the minimum erasure coding config, since I started using this when I still only had 3 nodes in my cluster and couldn’t do anything else - and never looked at this config again: K=2 and M=1. This means losing one disk is fine and recoverable, losing more is bad. Due to the high load OSDs flapped and were considered out, starting a rebalance and recovery of the data stored on those - meaning other OSDs were now in charge and there was even more load - on top of writing the disk image into the ceph cluster.

This then made some OSDs suddenly need a lot more memory, resulting in two controller nodes (the old ones) being completely overwhelmed and the control plane going down. This in turn made kubelets not know if they shall restart a OOM-killed pod, so they didn’t - resulting in more load on the surviving OSDs.

The next morning (or whenever I actually got up) I woke up to 3 of the 5 nodes being NotReady, the OSDs on the two Ready nodes being in an OOM-kill loop and everything being down - me included.
I first worked on getting the control-plane back online, by stopping kubelet and crio on control-plane nodes, killing processes using a lot of memory and then restarting the control-plane services, verifying they are running correctly.
Next on the todo list were the two faulty OSDs.. this is where I made a mistake. On both of these nodes, the OSD was assigned that much PGs (Placement Groups - a group of objects to be placed on the same OSDs. A Pool is made of a given amount of PGs) they were OOM-killed even before fully booted - even when nothing else on the node was using any considerable amount of memory (this cluster is so old, back then, kubelet denied running on a node with swap). Knowing Ceph heals very good, I just nuked the two OSDs and let the cluster sync them back online.

This was the mistake: nuking two OSDs. After that I had all OSDs back up, but 13 PGs were unavailable, other nodes only knowing about them and knowing one of the nuked OSDs had them before. Luckily, 10 of those PGs were empty, so I could just mark them as complete on one OSD. The other 3 PGs were marked lost, once I understood what I did. If you look at the reply below that tweet, you can see I didn’t fully understand yet what happened - erasure coded pools don’t code K objects to each other, resulting in M coding objects - they split a single object each into K chunk and create M coding chunks for them - so no objects from these PGs have survived. Luckily this only affected objects in my external-s3 service, which is what I mostly use for gitlab (artifacts, job logs, docker registry, ..) - annoying to deal with, but not a real bad data loss.

After this I instructed ceph to do a deep scrub and changed the erasure coding parameters to K=3 and M=2 for more fault tolerance in the future. I’m pretty sure I this change via the Rook objects for the two object stores, otherwise surely would have stumbled upon the error ceph gives when issuing a ceph osd pool erasure-code-profile set $existing_profile $new_parameters....

After some time this change propagated to all OSDs and when they wanted to start one of the erasure coded PGs, they crashed on an assert: /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos8/DIST/centos8/MACHINE_SIZE/gigantic/release/16.2.5/rpm/el8/BUILD/ceph-16.2.5/src/osd/ECUtil.h: 34: FAILED ceph_assert(stripe_width % stripe_size == 0)

This is were the real fun began, because suddenly all OSDs went down and storage was completely halted.

To be continued, I hope in not too distant future.

Update 2023-12-18: by now, two people have emailed me about this problem, asking for help to restore their Ceph cluster and recover their data. In both cases we have been successful. I hope I will actually manage to write a sensible part two, but until then, feel free to email me when you run into this problem - at this point it seems likely I’ll be able to help you :)

Fox' blog

Main content

Post mortem: Ceph - the worst disaster since I run my own infra (part 1)

Backstory

A new node

A small disaster

A short intro to ceph and my Ceph config