Ceph Tiering

Overview

If you have a mix of workloads you should create a mix of pools. Cache tiering is out¹. So use a mix of NVMEs, SSDs, and HDDs with rules to control what pool uses what class of device.

In this example, we’ll create a replicated SSD pool for our VMs, and a erasure coded HDD pool for our content and media files.

Initial Rules

When an OSD is added, its device class is automatically assigned. The typical ones are ssd or hdd. But either way, the default config will use them all as soon as you add them. Let’s change that by creating some additional rules

Replicated Data

For replicated data, it’s as easy as creating a couple new rules and then migrating data, if you have any.

New System

If you haven’t yet created any pools, great! we can create the rules so they are available when creating pools. Add all your disks as OSDs (visit the PVE ceph menu for each server in the cluster). Then add these rules at the command line of any PVE server to update the global config.

# Add rules for both types
#
# The format is
#    ceph osd crush rule create-replicated RULENAME default host CLASSTYPE
#
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

And you’re done! When you create a pool in the PVE GUI, click the advanced button and choose the appropriate CRUSH rule from the drop-down. Or you can create one now while you’re at the command line.

# Create a pool for your VMs on replicated SSD. Default replication is used (so 3 copies)
#  pveceph pool create POOLNAME --crush_rule RULENAME --pg_autoscale_mode on
pveceph pool create VMs --crush_rule  replicated_ssd_rule --pg_autoscale_mode on

Existing System

With an existing system you must migrate your data. If you’ve haven’t added your SSDs yet, do so now. It will start moving data using the default rule, but we’ll apply a new rule that will take over.

# Add rules for both types
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

# If you've just added SSDs, apply the new rule right away to minimize the time spent waiting for data moves.
# Use the SSD or HDD rule as you prefer. In this example we're moving POOLNAME to SSDs
ceph osd pool set VMs crush_rule replicated_sdd_rule

Erasure Coded Data

On A New System

EC data is a little different. You need a profile to describe the resilience and class, and Ceph manages the CRUSH rule directly. But you can have the pveceph to do this for you.

# Create pool name 'Content' with 2 data and 1 parity. Add --application cephfs as we're using this for file storage. The --crush_rule affects the metadata pool so its on fast storage.
pveceph pool create Content --erasure-coding k=2,m=1,device-class=hdd --crush_rule  replicated_ssd_rule --pg_autoscale_mode on --application cephfs

You’ll notice separate pools for data and metadata were automatically created as the latter doesn’t support EC pools yet.

On An Existing System

Normally, you set device class as part of creating a profile and you cannot change the profile after creating the pool². However, you can change the CRUSH rule and that’s all we need for changing the class.

# Create a new profile that to base a CRUSH rule on. This one uses HDD
#  ceph osd erasure-code-profile set PROFILENAME crush-device-class=CLASS k=2 m=1
ceph osd erasure-code-profile set ec_hdd_2_1_profile crush-device-class=hdd k=2 m=1

# ceph osd crush rule create-erasure RULENAME PROFILENAME (from above)
ceph osd crush rule create-erasure erasure_hdd_rule ec_hdd_2_1_profile

# ceph osd pool set POOLNAME crush_rule RULENAME
ceph osd pool set Content-data crush_rule erasure_hdd_rule

Don’t forget about the metadata pool and it’s a good time to turn on bulk setting if you’re going to store a lot of data.

# Put the metadata pool on SSD for speed
ceph osd pool set Content-metadata crush_rule replicated_ssd_rule
ceph osd pool set Content-data bulk true

Other Notes

NVME

There’s some reports that NVMe aren’t separated from SSDs. You may need to create that class and turn off auto detection, though this is quite old information.

Investigation

When investigating a system, you may want to drill down with thees commands.

ceph osd lspools
ceph osd pool get VMs crush_rule
ceph osd crush rule dumpreplicated_ssd_rule
# or view rules with
ceph osd getcrushmap | crushtool -d -

Data Loading

The fastest way is to use a Ceph Client at the source of the data, or at least separate the interfaces.

With 1Gb NIC, one of the Ceph storage servers also connected to an external NFS and coping data to a CephFS.

12 MB/sec

Same, but reversed with NFS server itself running the Ceph client pushing the data.

103 MB/sec

Creating and Destroying

https://dannyda.com/2021/04/10/how-to-completely-remove-delete-or-reinstall-ceph-and-its-configuration-from-proxmox-ve-pve/

# Adjust letters as needed
for X in {b..h}; do pveceph osd create /dev/sd${X};done
mkdir -p /var/lib/ceph/mon/
mkdir  /var/lib/ceph/osd

# Adjust numbers as needed
for X in {16..23};do systemctl stop ceph-osd@${X}.service;done
for X in {0..7}; do umount /var/lib/ceph/osd/ceph-$X;done
for X in {a..h}; do ceph-volume lvm zap /dev/sd$X --destroy;done

Feedback

Was this page helpful?

Glad to hear it! Please tell us how we can improve.

Sorry to hear that. Please tell us how we can improve.

Last modified February 18, 2025: Site restructure (2b4b418)