This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Ceph

Ceph is a distributed object storage system that supports both replicated and erasure coding of data. So you can design it to be fast or efficient. Or even both.

It’s complex compared to other solutions, but it’s also the main focus of Redhat and other development. So it may eclipse other technologies just through adaption.

It also comes ‘baked-in’ with Proxmox. If you’re already using PVE it’s worth deploying over others, as long as you’re willing to devote the time to learning it.

1 - Ceph Proxmox

Overview

There are two main use-cases;

  • Virtual Machines
  • Bulk Storage

They both provide High Availability. But VMs need speed whereas Bulk should be economical. What makes Ceph awesome is you can do both - all with the same set of disks.

Preparation

Computers

Put 3 or 4 PCs on a LAN. Have at least a couple HDDs in addition to a boot drive. 1G of RAM per TB of disk is recommended1 and it will use it. You can have less RAM, it will just be slower.

Network

The speed of Ceph is essentially a third or half your network speed. With a 1 Gig NIC you’ll average around 60 MB/Sec for file operations (multiple copies are being saved behind the scenes). This sounds terrible, but in reality its fine for a cluster that serves up small files and/or media streams. But it will take a long time to get data onto the system.

If you need more, install a secondary NIC for Ceph traffic. Do that before you configure the system, if you can. Doing it after is hard. You can use a mesh config via the PVE docs2 or purchase a switch. Reasonable 2.5 Gb switches and NICs can now be had.

Installation

Proxmox

Install PVE and cluster the servers.

Ceph

Double check the Ceph repo is current by comparing what you have enabled:

grep ceph  /etc/apt/sources.list.d/*

Against what PVE has available.

curl -s https://enterprise.proxmox.com/debian/ | grep ceph

If you don’t have the latest, take a look at Install PVE to update the Ceph repo.

After that, log into the PVE web GUI, click on each server and in that server’s submenu, click on Ceph. It will prompt for permission to install. When the setup windows appears, select the newest version and the No-Subscription repository. You can also refer to the official notes.

The wizard will ask some additional configuration options for which you can take the defaults and finish. If you have an additional ceph-specific network hardware, set it up with a separate IP range and choose interface for both the public network and cluster network.

Configuration

Ceph uses multiple daemons on each node.

  • Monitors, to keep track of what’s up and down.
  • Managers, to gather performance data.
  • OSDs, a service per disk to store and retrieve data.
  • Metadata Servers, to handle file permissions and such.

To configure them, you will:

  • Add a monitor and manager to each node
  • Add each node’s disks as OSD, or (Object Storage Devices)
  • Add metadata servers to each node
  • Create Pools, where you group OSDs and choose the level of resiliency
  • Create a Filesystem

Monitors and Managers

Ceph recommends at least three monitors3 and manager processes4. To install, click on a server’s Ceph menu and select the Monitor menu entry. Add a monitor and manager to the first three nodes.

OSDs

The next step is to add disks - aka Object Storage Devices. Select a server from the left-hand menu, and from that server’s Ceph menu, select OSD. Then click Create: OSD, select the disks and click create. If they are missing, enter the server’s shell and issue the command wipefs -a /dev/sdX on the disks in question.

If you have a lot of disks, you can also do this at the shell

# Assuming you have 5 disks, starting at 'b'
for X in {b..f}; do echo pveceph osd create /dev/sd$X; done

Metadata Servers

To add MDSs, click on the CephFS submenu for each server and click Create under the Metadata Servers section. Don’t create a CephFS yet, however.

Pools and Resiliency

We are going to create two pools; a replicated pool for VMs and an erasure coded pool for bulk storage.

If you want to mix SSDs and HDDs see the storage tiering page before creating the pools. You’ll want to set up classes of storage and create pools based on that page.

Replicated Pool

On this pool we’ll use the default replication level that gives us three copies of everything. These are guaranteed at a per-host level. Loose any one or two hosts, no problem. But loose individual disks from all three hoots at the same time and you’re out of luck.

This is the GUI default so creation is easy. Navigate to a host and select Ceph –> Pools –> Create. The defaults are fine and all you need do is give it a name, like “VMs”. You may notice there is already a .mgr pool. That’s created by the manager service and safe to ignore.

If you only need storage for VMs and containers, you can actually stop here. You can create VMs directly on your new pool, and containers on local storage then migrate (Server –> Container –> Resources –> Root Disk -> Volume Action –> Target Storage)

Erasure Coded Pool

Erasure coding requires that you determine how many data and parity bits to use, and issue the create command in the terminal. For the first question, it’s pretty simple - if you have three servers you’ll need 2 data and 1 parity. The more systems you have the more efficient you’ll be, though when you get to 6 you should probably increase your parity. Unlike the replicated pool, you can only loose one host with this level of protection.

Here’s the command5 for a 3 node system that can withstand one node loss (2,1). For a 4 node system you’d use (3,1) and so on. Increase the first number as your node count goes up, and the second as you desire more resilience. Ceph doesn’t require a ‘balanced’ cluster in terms of drives, but you’ll loose some capacity if you don’t have roughly the same amount of space on each node.

# k is for data and m is for parity
pveceph pool create POOLNAME --erasure-coding k=2,m=1 --pg_autoscale_mode on --application cephfs

Note that we specified application in the command. If you don’t, you won’t be able to use it for a filesystem later on. We also specified PGs (placement groups) as auto-scaling. This is how Ceph chunks data as it gets sent to storage. If you know how much data you have to load, you can specify the starting number of PGs with the --pg_num parameter. This will make things a little faster for an initial copy. Redhat suggests6 the OSD*100 / K+M. You’ll get a warning7 from Ceph if it’s not a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512) so use the closest number, such as --pg_num 512.

If you don’t know how much data you’re going to bing in, but expect it to be a lot, you can turn on the bulk flag, rather than specifying pg_num.

# Make sure to add '-data' at the end
ceph osd pool set POOLNAME-data bulk true

When you look at the pools in the GUI you’ll see it also created two pools, one for data and one for metadata, which isn’t compatible with EC pools yet. You’ll also notice that you can put VMs and containers on this pool just like the replicated pool. It will just be slower.

Filesystem

The GUI won’t allow you to choose the erasure coded pool you just created so you’ll use the command line again. The name you pick for your Filesystem will be how it’s mounted.

ceph osd pool set POOLNAME-data bulk true
ceph fs new FILE-SYSTEM-NAME POOLNAME-metadata POOLNAME-data --force

To mount it cluster-wide, go back to the Web GUI and select Datacenter at the top left, then Storage. Click Add and select CephFS as the type. In the subsequent dialog box, put the name you’d like it mounted as in the ID field, such as “bulk” and leave the rest at their defaults.

You can now find it mounted at /mnt/pve/IDNAME and you can bind mount it to your Containers or setup NFS for your VMs.

Operation

Failure is Always an Option

Ceph defaults to a failure domain of ‘host’. That means you can loose a whole host with all it’s disks and continue operating. You can also lose individual disks from different hosts, but operations are NOT guaranteed. If you have two copies, as well and continue operating. After a short time, Ceph will re-establish parity as disks fail or hosts remain off-line. Should they come back, it will re-adjust. Though in both cases this can take some time.

Rebooting a host

Ceph immediately panics and starts reestablishing resilience. When the host comes back up, it starts redoing it back. This is OK, but Redhat suggests to avoid it with a few steps.

On the node you want to reboot:

sudo ceph osd set noout
sudo ceph osd set 
sudo reboot

# Log back in and check that the pgmap reports all pgs as normal (active+clean). 
sudo ceph -s

# Continue on to the next node
sudo reboot
sudo ceph -s

# When done
sudo ceph osd unset noout
sudo ceph osd unset norebalance

# Perform a final status check to make sure the cluster reports HEALTH_OK:
sudo ceph status

Troubleshooting

Pool Creation Error

If you created a pool but left off the –application flag it will be set to RDP by default. You’d have to change it from RDP to CephFS like so, for both the data and metadata

ceph osd pool application enable srv-data cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-data rdb --yes-i-really-mean-it
ceph osd pool application enable srv-metadata cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-metadata rdb --yes-i-really-mean-it

Cluster IP Address Change

If you want to change your IP addresses, you may be able to just change the public network in the /etc/pve/ceph.conf and then destroy and recreate what it says. This worked. I don’t know if it’s good. I think the OSD cluster network needed changed also.

Based on https://www.reddit.com/r/Proxmox/comments/p7s8ne/change_ceph_network/

2 - Ceph Tiering

Overview

If you have a mix of workloads you should create a mix of pools. Cache tiering is out1. So use a mix of NVMEs, SSDs, and HDDs with rules to control what pool uses what class of device.

In this example, we’ll create a replicated SSD pool for our VMs, and a erasure coded HDD pool for our content and media files.

Initial Rules

When an OSD is added, its device class is automatically assigned. The typical ones are ssd or hdd. But either way, the default config will use them all as soon as you add them. Let’s change that by creating some additional rules

Replicated Data

For replicated data, it’s as easy as creating a couple new rules and then migrating data, if you have any.

New System

If you haven’t yet created any pools, great! we can create the rules so they are available when creating pools. Add all your disks as OSDs (visit the PVE ceph menu for each server in the cluster). Then add these rules at the command line of any PVE server to update the global config.

# Add rules for both types
#
# The format is
#    ceph osd crush rule create-replicated RULENAME default host CLASSTYPE
#
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

And you’re done! When you create a pool in the PVE GUI, click the advanced button and choose the appropriate CRUSH rule from the drop-down. Or you can create one now while you’re at the command line.

# Create a pool for your VMs on replicated SSD. Default replication is used (so 3 copies)
#  pveceph pool create POOLNAME --crush_rule RULENAME --pg_autoscale_mode on
pveceph pool create VMs --crush_rule  replicated_ssd_rule --pg_autoscale_mode on

Existing System

With an existing system you must migrate your data. If you’ve haven’t added your SSDs yet, do so now. It will start moving data using the default rule, but we’ll apply a new rule that will take over.

# Add rules for both types
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

# If you've just added SSDs, apply the new rule right away to minimize the time spent waiting for data moves.
# Use the SSD or HDD rule as you prefer. In this example we're moving POOLNAME to SSDs
ceph osd pool set VMs crush_rule replicated_sdd_rule

Erasure Coded Data

On A New System

EC data is a little different. You need a profile to describe the resilience and class, and Ceph manages the CRUSH rule directly. But you can have the pveceph to do this for you.

# Create pool name 'Content' with 2 data and 1 parity. Add --application cephfs as we're using this for file storage. The --crush_rule affects the metadata pool so its on fast storage.
pveceph pool create Content --erasure-coding k=2,m=1,device-class=hdd --crush_rule  replicated_ssd_rule --pg_autoscale_mode on --application cephfs

You’ll notice separate pools for data and metadata were automatically created as the latter doesn’t support EC pools yet.

On An Existing System

Normally, you set device class as part of creating a profile and you cannot change the profile after creating the pool2. However, you can change the CRUSH rule and that’s all we need for changing the class.

# Create a new profile that to base a CRUSH rule on. This one uses HDD
#  ceph osd erasure-code-profile set PROFILENAME crush-device-class=CLASS k=2 m=1
ceph osd erasure-code-profile set ec_hdd_2_1_profile crush-device-class=hdd k=2 m=1

# ceph osd crush rule create-erasure RULENAME PROFILENAME (from above)
ceph osd crush rule create-erasure erasure_hdd_rule ec_hdd_2_1_profile

# ceph osd pool set POOLNAME crush_rule RULENAME
ceph osd pool set Content-data crush_rule erasure_hdd_rule

Don’t forget about the metadata pool and it’s a good time to turn on bulk setting if you’re going to store a lot of data.

# Put the metadata pool on SSD for speed
ceph osd pool set Content-metadata crush_rule replicated_ssd_rule
ceph osd pool set Content-data bulk true

Other Notes

NVME

There’s some reports that NVMe aren’t separated from SSDs. You may need to create that class and turn off auto detection, though this is quite old information.

Investigation

When investigating a system, you may want to drill down with thees commands.

ceph osd lspools
ceph osd pool get VMs crush_rule
ceph osd crush rule dumpreplicated_ssd_rule
# or view rules with
ceph osd getcrushmap | crushtool -d -

Data Loading

The fastest way is to use a Ceph Client at the source of the data, or at least separate the interfaces.

With 1Gb NIC, one of the Ceph storage servers also connected to an external NFS and coping data to a CephFS.

  • 12 MB/sec

Same, but reversed with NFS server itself running the Ceph client pushing the data.

  • 103 MB/sec

Creating and Destroying

https://dannyda.com/2021/04/10/how-to-completely-remove-delete-or-reinstall-ceph-and-its-configuration-from-proxmox-ve-pve/

# Adjust letters as needed
for X in {b..h}; do pveceph osd create /dev/sd${X};done
mkdir -p /var/lib/ceph/mon/
mkdir  /var/lib/ceph/osd

# Adjust numbers as needed
for X in {16..23};do systemctl stop ceph-osd@${X}.service;done
for X in {0..7}; do umount /var/lib/ceph/osd/ceph-$X;done
for X in {a..h}; do ceph-volume lvm zap /dev/sd$X --destroy;done

3 - Ceph Client

This assumes you already have a working cluster and a ceph file system.

Install

You need the ceph software. You use the cephadm tool, or add the repos and packages manually. You also need to pick what version by it’s release name; ‘Octopus, Nautilus, etc’

sudo apt install software-properties-common gnupg2
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -

# Discover the current release. PVE is a good place to check when at the command line
curl -s https://enterprise.proxmox.com/debian/ | grep ceph

# Note the release name after debian, 'debian-squid' in this example.
sudo apt-add-repository 'deb https://download.ceph.com/debian-squid/ bullseye main'

sudo apt update; sudo apt install ceph-common -y

#
# Alternatively 
#

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/squid/src/cephadm/cephadm
chmod +x cephadm
./cephadm add-repo --release squid
./cephadm install ceph-common

Configure

On a cluster member, generate a basic conf and keyring for the client

# for a client named 'minecraft'

ceph config generate-minimal-conf > /etc/ceph/minimal.ceph.conf

ceph-authtool --create-keyring /etc/ceph/ceph.client.minecraft.keyring --gen-key -n client.minecraft

You must add file system permissions by adding lines to the bottom of the keyring, then import it to the cluster.

nano  /etc/ceph/ceph.client.minecraft.keyring

# Allowing the client to read the root and write to the subdirectory '/srv/minecraft'
caps mds = "allow rwps path=/srv/minecraft"
caps mon = "allow r"
caps osd = "allow *"

Import the keyring to the cluster and copy it to the client

ceph auth import -i /etc/ceph/ceph.client.minecraft.keyring
scp minimal.ceph.conf ceph.client.minecraft.keyring [email protected]:

On the client, copy the keyring and rename and move the basic config file.

ssh [email protected]

sudo cp ceph.client.minecraft.keyring /etc/ceph
sudo cp minimal.ceph.conf /etc/ceph/ceph.conf

Now, you may mount the filesystem

# the format is "User ID" @ "Cluster ID" . "Filesystem Name" = "/some/folder/on/the/server" "/some/place/on/the/client"
# You can get the cluster ID from your server's ceph.conf file and the filesystem name 
# with a ceph fs ls, if you don't already know it. It will be the part after name, as in "name: XXXX, ..."

sudo mount.ceph [email protected]=/srv/minecraft /mnt

You can and entry to your fstab like so

[email protected]=/srv/minecraft /mnt ceph noatime,_netdev    0       2

Troubleshooting

source mount path was not specified
unable to parse mount source: -22

You might have accidentally installed the distro’s older version of ceph. The mount notation above is based on “quincy” aka ver 17

ceph --version

  ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)

Try an apt remove --purge ceph-common and then apt update before trying a apt install ceph-common again.

**unable to get monitor info from DNS SRV with service name: ceph-mon**

Check your client’s ceph.conf. You may not have the right file in place

**mount error: no mds server is up or the cluster is laggy**

This is likely a problem with your client file.

Sources

https://docs.ceph.com/en/quincy/install/get-packages/ https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/ https://docs.ceph.com/en/nautilus/cephfs/fstab/