2020-05-18 08:15 UTC-
-= (4/6) WAL device expansion =

As this step will require a rolling restart of all OSD processes, please complete the previous "Ceph Upgrade" and "OSD Memory Target" first so that those 2 changes which also require OSD restarts are applied at the same time.

The WAL devices are currently only 100MB, and are heavily relied upon during the write-path for small (e.g. 4k) writes. We can expand the device to 1GB during a quick stop as they are on LVM.

The failure domain for ceph is configured per-host, and 2 of the 3 replicas for data is required to be online at any time in order for writes to succeed. For this reason we need to ensure we only stop/start the OSDs on a single host at the same time (since the same host will should not have more than 1 copy of the same bit of data, even on different disks based on this configured failure domain). We must then watch the cluster health (with: juju ssh ceph-mon/0 sudo watch ceph -s) and wait for the OSDs to return to the cluster and the cluster to finish recovery back to HEALH_OK before proceeding - and you must always ensure that the cluster is in the HEALTH_OK before starting on a new host (that may seem redundant but if there is a delay in the time you finish the last host and start the next host, you should always check the cluster health again before starting on a new host).

Note that the "bluefs-bdev-expand" command operates on the raw ceph block devices and does not interact with the ceph-osd process. For this reason we must ensure that the OSD process is stopped and confirmed stopped (with: systemctl status ceph-osd@N) before executing the command.

(1) Set cluster noout (to prevent hosts being marked out while temporarily down): juju ssh ceph-mon/0 sudo ceph osd set noout

For each host in the cluster

(2a) Ensure cluster is HEALTH_OK before proceeding: "juju ssh ceph-mon/0 sudo watch ceph -s"

(2b) Stop all of the ceph-osd processes on the same host
systemctl stop ceph-osd@*

(2c) Confirm that all ceph-osd processes have actually executed
"ps auxww|grep ceph-osd", or "systemctl status ceph-osd@*"

(2d) Expand the LVM partitions
for i in /dev/ceph-wal-*/osd-wal-*; do lvextend -L 1G $i; done

(2e) Expand the bluefs into the new space for each OSD
for i in /var/lib/ceph/osd/ceph-*; do ceph-bluestore-tool bluefs-bdev-expand --path $i; done

(2f) Start all of the ceph-osd process
systemctl start ceph-osd@*

(2g) Confirm that all of the OSDs have started, return to up/in status in the OSD tree

(2h) Wait for cluster HEALTH_OK to return: "juju ssh ceph-mon/0 sudo watch ceph -s"

(2i) Move onto the next host, going back to step 3a

Once all hosts are complete, you can then remove the noout flag. You can also remove this flag if you intend to stop working on hosts for an extended period of time (e.g. overnight)

(3) Unset noout: juju ssh ceph-mon/0 sudo ceph osd unset noout

Although I am not aware of any issues with this tool or process and I could not see any fixes listed in either 12.2.12 or 12.2.13 (which is not yet available in Ubuntu) that would affect this tool working correctly, I suggest that we should briefly this process on a single node first and that we wait at least a small amount of time (maybe an hour) and review the /var/log/ceph-osd.*.log log files for signs of any warnings/errors or other problems and generally verify that the process worked correctly before we move onto the remaining hosts.

At this time we suspect that some of the WAL data which was previously "spilling over" into the bcache device may remain on the bcache device unfortunately there is no way to fix that under the Ceph Luminous version of this tool however increasing the available size should still help and it seems at least some data will move over some of the time.