Ceph Proxmox
Overview
There are two main use-cases;
- Virtual Machines
- Bulk Storage
They both provide High Availability. But VMs need speed whereas Bulk should be economical. What makes Ceph awesome is you can do both - all with the same set of disks.
Preparation
Computers
Put 3 or 4 PCs on a LAN. Have at least a couple HDDs in addition to a boot drive. 1G of RAM per TB of disk is recommended1 and it will use it. You can have less RAM, it will just be slower.
Network
The speed of Ceph is essentially a third or half your network speed. With a 1 Gig NIC you’ll average around 60 MB/Sec for file operations (multiple copies are being saved behind the scenes). This sounds terrible, but in reality its fine for a cluster that serves up small files and/or media streams. But it will take a long time to get data onto the system.
If you need more, install a secondary NIC for Ceph traffic. Do that before you configure the system, if you can. Doing it after is hard. You can use a mesh config via the PVE docs2 or purchase a switch. Reasonable 2.5 Gb switches and NICs can now be had.
Installation
Proxmox
Install PVE and cluster the servers.
Ceph
Double check the Ceph repo is current by comparing what you have enabled:
grep ceph /etc/apt/sources.list.d/*
Against what PVE has available.
curl -s https://enterprise.proxmox.com/debian/ | grep ceph
If you don’t have the latest, take a look at Install PVE to update the Ceph repo.
After that, log into the PVE web GUI, click on each server and in that server’s submenu, click on Ceph. It will prompt for permission to install. When the setup windows appears, select the newest version and the No-Subscription repository. You can also refer to the official notes.
The wizard will ask some additional configuration options for which you can take the defaults and finish. If you have an additional ceph-specific network hardware, set it up with a separate IP range and choose interface for both the public network and cluster network.
Configuration
Ceph uses multiple daemons on each node.
- Monitors, to keep track of what’s up and down.
- Managers, to gather performance data.
- OSDs, a service per disk to store and retrieve data.
- Metadata Servers, to handle file permissions and such.
To configure them, you will:
- Add a monitor and manager to each node
- Add each node’s disks as OSD, or (Object Storage Devices)
- Add metadata servers to each node
- Create Pools, where you group OSDs and choose the level of resiliency
- Create a Filesystem
Monitors and Managers
Ceph recommends at least three monitors3 and manager processes4. To install, click on a server’s Ceph menu and select the Monitor menu entry. Add a monitor and manager to the first three nodes.
OSDs
The next step is to add disks - aka Object Storage Devices. Select a server from the left-hand menu, and from that server’s Ceph menu, select OSD. Then click Create: OSD, select the disks and click create. If they are missing, enter the server’s shell and issue the command wipefs -a /dev/sdX
on the disks in question.
If you have a lot of disks, you can also do this at the shell
# Assuming you have 5 disks, starting at 'b'
for X in {b..f}; do echo pveceph osd create /dev/sd$X; done
Metadata Servers
To add MDSs, click on the CephFS submenu for each server and click Create under the Metadata Servers section. Don’t create a CephFS yet, however.
Pools and Resiliency
We are going to create two pools; a replicated pool for VMs and an erasure coded pool for bulk storage.
If you want to mix SSDs and HDDs see the storage tiering page before creating the pools. You’ll want to set up classes of storage and create pools based on that page.
Replicated Pool
On this pool we’ll use the default replication level that gives us three copies of everything. These are guaranteed at a per-host level. Loose any one or two hosts, no problem. But loose individual disks from all three hoots at the same time and you’re out of luck.
This is the GUI default so creation is easy. Navigate to a host and select Ceph –> Pools –> Create. The defaults are fine and all you need do is give it a name, like “VMs”. You may notice there is already a .mgr pool. That’s created by the manager service and safe to ignore.
If you only need storage for VMs and containers, you can actually stop here. You can create VMs directly on your new pool, and containers on local storage then migrate (Server –> Container –> Resources –> Root Disk -> Volume Action –> Target Storage)
Erasure Coded Pool
Erasure coding requires that you determine how many data and parity bits to use, and issue the create command in the terminal. For the first question, it’s pretty simple - if you have three servers you’ll need 2 data and 1 parity. The more systems you have the more efficient you’ll be, though when you get to 6 you should probably increase your parity. Unlike the replicated pool, you can only loose one host with this level of protection.
Here’s the command5 for a 3 node system that can withstand one node loss (2,1). For a 4 node system you’d use (3,1) and so on. Increase the first number as your node count goes up, and the second as you desire more resilience. Ceph doesn’t require a ‘balanced’ cluster in terms of drives, but you’ll loose some capacity if you don’t have roughly the same amount of space on each node.
# k is for data and m is for parity
pveceph pool create POOLNAME --erasure-coding k=2,m=1 --pg_autoscale_mode on --application cephfs
Note that we specified application in the command. If you don’t, you won’t be able to use it for a filesystem later on. We also specified PGs (placement groups) as auto-scaling. This is how Ceph chunks data as it gets sent to storage. If you know how much data you have to load, you can specify the starting number of PGs with the --pg_num
parameter. This will make things a little faster for an initial copy. Redhat suggests6 the OSD*100 / K+M. You’ll get a warning7 from Ceph if it’s not a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512) so use the closest number, such as --pg_num 512
.
If you don’t know how much data you’re going to bing in, but expect it to be a lot, you can turn on the bulk
flag, rather than specifying pg_num
.
# Make sure to add '-data' at the end
ceph osd pool set POOLNAME-data bulk true
When you look at the pools in the GUI you’ll see it also created two pools, one for data and one for metadata, which isn’t compatible with EC pools yet. You’ll also notice that you can put VMs and containers on this pool just like the replicated pool. It will just be slower.
Filesystem
The GUI won’t allow you to choose the erasure coded pool you just created so you’ll use the command line again. The name you pick for your Filesystem will be how it’s mounted.
ceph osd pool set POOLNAME-data bulk true
ceph fs new FILE-SYSTEM-NAME POOLNAME-metadata POOLNAME-data --force
To mount it cluster-wide, go back to the Web GUI and select Datacenter at the top left, then Storage. Click Add and select CephFS as the type. In the subsequent dialog box, put the name you’d like it mounted as in the ID field, such as “bulk” and leave the rest at their defaults.
You can now find it mounted at /mnt/pve/IDNAME
and you can bind mount it to your Containers or setup NFS for your VMs.
Operation
Failure is Always an Option
Ceph defaults to a failure domain of ‘host’. That means you can loose a whole host with all it’s disks and continue operating. You can also lose individual disks from different hosts, but operations are NOT guaranteed. If you have two copies, as well and continue operating. After a short time, Ceph will re-establish parity as disks fail or hosts remain off-line. Should they come back, it will re-adjust. Though in both cases this can take some time.
Rebooting a host
Ceph immediately panics and starts reestablishing resilience. When the host comes back up, it starts redoing it back. This is OK, but Redhat suggests to avoid it with a few steps.
On the node you want to reboot:
sudo ceph osd set noout
sudo ceph osd set
sudo reboot
# Log back in and check that the pgmap reports all pgs as normal (active+clean).
sudo ceph -s
# Continue on to the next node
sudo reboot
sudo ceph -s
# When done
sudo ceph osd unset noout
sudo ceph osd unset norebalance
# Perform a final status check to make sure the cluster reports HEALTH_OK:
sudo ceph status
Troubleshooting
Pool Creation Error
If you created a pool but left off the –application flag it will be set to RDP by default. You’d have to change it from RDP to CephFS like so, for both the data and metadata
ceph osd pool application enable srv-data cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-data rdb --yes-i-really-mean-it
ceph osd pool application enable srv-metadata cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-metadata rdb --yes-i-really-mean-it
Cluster IP Address Change
If you want to change your IP addresses, you may be able to just change the public network in the /etc/pve/ceph.conf and then destroy and recreate what it says. This worked. I don’t know if it’s good. I think the OSD cluster network needed changed also.
Based on https://www.reddit.com/r/Proxmox/comments/p7s8ne/change_ceph_network/
-
https://docs.ceph.com/en/mimic/start/hardware-recommendations/#hard-disk-drives ↩︎
-
https://pve.proxmox.com/wiki/Full_Mesh_Network_for_Ceph_Server ↩︎
-
https://docs.ceph.com/en/reef/rados/configuration/mon-config-ref/#initial-members ↩︎
-
https://docs.ceph.com/en/reef/mgr/administrator/#high-availability ↩︎
-
https://docs.redhat.com/en/documentation/red_hat_ceph_storage/4/html/storage_strategies_guide/placement_groups_pgs#calculating_pg_count ↩︎
-
https://docs.ceph.com/en/reef/rados/operations/health-checks/#pool-pg-num-not-power-of-two ↩︎
Feedback
Was this page helpful?
Glad to hear it! Please tell us how we can improve.
Sorry to hear that. Please tell us how we can improve.