Introduction
Seeing as Proxmox supports Ceph out of the box I decided to try my hand at it. However, in the process I managed to completely break corosync and ceph in the process. After a night of troubleshooting I finally managed to fix it, but not thanks to Proxmox.
How I ruined it
The core of the problem was my fault. I spun up a new proxmox host, but, and this is the real issue, didn't connect it to my existing cluster. Having this node active I installed Ceph on it via the GUI and added the first OSDs. I then decided to add the node to the cluster. And thats where it all broke. Because Proxmox maintains the ceph.conf
, and I think various other parts of the config, via its own sync system on /etc/pve
, everything broke. The original ceph install on my new node was no where to be found, even though some of the files were still present. I also couldn't remove the old monitor files, because hey, no ceph cluster was active!
How I solved it
The core of the solution was to really start from scratch. But how do you get to your base state when you've just ruined most of the install? Fortunately Linux is all files, so with a nice text editor most can be fixed. The first step was to just remove Ceph entirely from Proxmox, so I could mess with the files in peace. Proxmox has a nice command for this: pveceph purge
, which I ran on all nodes.
It spit out some errors sometimes, but the core, namely removing the ceph.conf
was done.
At that point I wanted to uninstall the Ceph packages, but this was not that easy. Because the install was so broken apt refused to remove the ceph packages for the osd, mds, mgr and base, with the following error:
Failed to stop ceph.service: Unit ceph.service not loaded.
invoke-rc.d: initscript ceph, action "stop" failed.
However, because the install in this state, there was no ceph to stop. The solution was to edit all the *.prerm
files used by dpkg in the folder /var/lib/dpkg/info/
. Stopping dpkg from trying to interact with Ceph meant I could just remove the packages.
With all packages removed, the next step was to remove all the old Ceph files for the monitors. These are in the folder /var/lib/ceph/
.
At this point I got a clean slate, and a working cluster.
References
https://linoxide.com/linux-how-to/hwto-configure-single-node-ceph-cluster/ https://forum.proxmox.com/threads/ceph-config-broken.54122/page-2 https://forum.proxmox.com/threads/ceph-cluster-can-not-start-monitors.34689/ https://forum.proxmox.com/threads/ceph-cluster-can-not-start-monitors.34689/ https://forum.proxmox.com/threads/reinstall-ceph-on-proxmox-6.57691/