Ceph Proxmox

Overview

There are two main use-cases;

  • Virtual Machines
  • Bulk Storage

They both provide High Availability. But VMs need speed whereas Bulk should be economical. What makes Ceph awesome is you can do both - all with the same set of disks.

Preparation

Computers

Put 3 or 4 PCs on a LAN. Have at least a couple HDDs in addition to a boot drive. 1G of RAM per TB of disk is recommended1 and it will use it. You can have less RAM, it will just be slower.

Network

The speed of Ceph is essentially a third or half your network speed. With a 1 Gig NIC you’ll average around 60 MB/Sec for file operations (multiple copies are being saved behind the scenes). This sounds terrible, but in reality its fine for a cluster that serves up small files and/or media streams. But it will take a long time to get data onto the system.

If you need more, install a secondary NIC for Ceph traffic. Do that before you configure the system, if you can. Doing it after is hard. You can use a mesh config via the PVE docs2 or purchase a switch. Reasonable 2.5 Gb switches and NICs can now be had.

Installation

Proxmox

Install PVE and cluster the servers.

Ceph

Double check the Ceph repo is current by comparing what you have enabled:

grep ceph  /etc/apt/sources.list.d/*

Against what PVE has available.

curl -s https://enterprise.proxmox.com/debian/ | grep ceph

If you don’t have the latest, take a look at Install PVE to update the Ceph repo.

After that, log into the PVE web GUI, click on each server and in that server’s submenu, click on Ceph. It will prompt for permission to install. When the setup windows appears, select the newest version and the No-Subscription repository. You can also refer to the official notes.

The wizard will ask some additional configuration options for which you can take the defaults and finish. If you have an additional ceph-specific network hardware, set it up with a separate IP range and choose interface for both the public network and cluster network.

Configuration

Ceph uses multiple daemons on each node.

  • Monitors, to keep track of what’s up and down.
  • Managers, to gather performance data.
  • OSDs, a service per disk to store and retrieve data.
  • Metadata Servers, to handle file permissions and such.

To configure them, you will:

  • Add a monitor and manager to each node
  • Add each node’s disks as OSD, or (Object Storage Devices)
  • Add metadata servers to each node
  • Create Pools, where you group OSDs and choose the level of resiliency
  • Create a Filesystem

Monitors and Managers

Ceph recommends at least three monitors3 and manager processes4. To install, click on a server’s Ceph menu and select the Monitor menu entry. Add a monitor and manager to the first three nodes.

OSDs

The next step is to add disks - aka Object Storage Devices. Select a server from the left-hand menu, and from that server’s Ceph menu, select OSD. Then click Create: OSD, select the disks and click create. If they are missing, enter the server’s shell and issue the command wipefs -a /dev/sdX on the disks in question.

If you have a lot of disks, you can also do this at the shell

# Assuming you have 5 disks, starting at 'b'
for X in {b..f}; do echo pveceph osd create /dev/sd$X; done

Metadata Servers

To add MDSs, click on the CephFS submenu for each server and click Create under the Metadata Servers section. Don’t create a CephFS yet, however.

Pools and Resiliency

We are going to create two pools; a replicated pool for VMs and an erasure coded pool for bulk storage.

If you want to mix SSDs and HDDs see the storage tiering page before creating the pools. You’ll want to set up classes of storage and create pools based on that page.

Replicated Pool

On this pool we’ll use the default replication level that gives us three copies of everything. These are guaranteed at a per-host level. Loose any one or two hosts, no problem. But loose individual disks from all three hoots at the same time and you’re out of luck.

This is the GUI default so creation is easy. Navigate to a host and select Ceph –> Pools –> Create. The defaults are fine and all you need do is give it a name, like “VMs”. You may notice there is already a .mgr pool. That’s created by the manager service and safe to ignore.

If you only need storage for VMs and containers, you can actually stop here. You can create VMs directly on your new pool, and containers on local storage then migrate (Server –> Container –> Resources –> Root Disk -> Volume Action –> Target Storage)

Erasure Coded Pool

Erasure coding requires that you determine how many data and parity bits to use, and issue the create command in the terminal. For the first question, it’s pretty simple - if you have three servers you’ll need 2 data and 1 parity. The more systems you have the more efficient you’ll be, though when you get to 6 you should probably increase your parity. Unlike the replicated pool, you can only loose one host with this level of protection.

Here’s the command5 for a 3 node system that can withstand one node loss (2,1). For a 4 node system you’d use (3,1) and so on. Increase the first number as your node count goes up, and the second as you desire more resilience. Ceph doesn’t require a ‘balanced’ cluster in terms of drives, but you’ll loose some capacity if you don’t have roughly the same amount of space on each node.

# k is for data and m is for parity
pveceph pool create POOLNAME --erasure-coding k=2,m=1 --pg_autoscale_mode on --application cephfs

Note that we specified application in the command. If you don’t, you won’t be able to use it for a filesystem later on. We also specified PGs (placement groups) as auto-scaling. This is how Ceph chunks data as it gets sent to storage. If you know how much data you have to load, you can specify the starting number of PGs with the --pg_num parameter. This will make things a little faster for an initial copy. Redhat suggests6 the OSD*100 / K+M. You’ll get a warning7 from Ceph if it’s not a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512) so use the closest number, such as --pg_num 512.

If you don’t know how much data you’re going to bing in, but expect it to be a lot, you can turn on the bulk flag, rather than specifying pg_num.

# Make sure to add '-data' at the end
ceph osd pool set POOLNAME-data bulk true

When you look at the pools in the GUI you’ll see it also created two pools, one for data and one for metadata, which isn’t compatible with EC pools yet. You’ll also notice that you can put VMs and containers on this pool just like the replicated pool. It will just be slower.

Filesystem

The GUI won’t allow you to choose the erasure coded pool you just created so you’ll use the command line again. The name you pick for your Filesystem will be how it’s mounted.

ceph osd pool set POOLNAME-data bulk true
ceph fs new FILE-SYSTEM-NAME POOLNAME-metadata POOLNAME-data --force

To mount it cluster-wide, go back to the Web GUI and select Datacenter at the top left, then Storage. Click Add and select CephFS as the type. In the subsequent dialog box, put the name you’d like it mounted as in the ID field, such as “bulk” and leave the rest at their defaults.

You can now find it mounted at /mnt/pve/IDNAME and you can bind mount it to your Containers or setup NFS for your VMs.

Operation

Failure is Always an Option

Ceph defaults to a failure domain of ‘host’. That means you can loose a whole host with all it’s disks and continue operating. You can also lose individual disks from different hosts, but operations are NOT guaranteed. If you have two copies, as well and continue operating. After a short time, Ceph will re-establish parity as disks fail or hosts remain off-line. Should they come back, it will re-adjust. Though in both cases this can take some time.

Rebooting a host

Ceph immediately panics and starts reestablishing resilience. When the host comes back up, it starts redoing it back. This is OK, but Redhat suggests to avoid it with a few steps.

On the node you want to reboot:

sudo ceph osd set noout
sudo ceph osd set 
sudo reboot

# Log back in and check that the pgmap reports all pgs as normal (active+clean). 
sudo ceph -s

# Continue on to the next node
sudo reboot
sudo ceph -s

# When done
sudo ceph osd unset noout
sudo ceph osd unset norebalance

# Perform a final status check to make sure the cluster reports HEALTH_OK:
sudo ceph status

Troubleshooting

Pool Creation Error

If you created a pool but left off the –application flag it will be set to RDP by default. You’d have to change it from RDP to CephFS like so, for both the data and metadata

ceph osd pool application enable srv-data cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-data rdb --yes-i-really-mean-it
ceph osd pool application enable srv-metadata cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-metadata rdb --yes-i-really-mean-it

Cluster IP Address Change

If you want to change your IP addresses, you may be able to just change the public network in the /etc/pve/ceph.conf and then destroy and recreate what it says. This worked. I don’t know if it’s good. I think the OSD cluster network needed changed also.

Based on https://www.reddit.com/r/Proxmox/comments/p7s8ne/change_ceph_network/


Last modified February 18, 2025: Site restructure (2b4b418)