1 - Ceph Proxmox
Overview
There are two main use-cases;
- Virtual Machines
- Bulk Storage
They both provide High Availability. But VMs need speed whereas Bulk should be economical. What makes Ceph awesome is you can do both - all with the same set of disks.
Preparation
Computers
Put 3 or 4 PCs on a LAN. Have at least a couple HDDs in addition to a boot drive. 1G of RAM per TB of disk is recommended and it will use it. You can have less RAM, it will just be slower.
Network
The speed of Ceph is essentially a third or half your network speed. With a 1 Gig NIC you’ll average around 60 MB/Sec for file operations (multiple copies are being saved behind the scenes). This sounds terrible, but in reality its fine for a cluster that serves up small files and/or media streams. But it will take a long time to get data onto the system.
If you need more, install a secondary NIC for Ceph traffic. Do that before you configure the system, if you can. Doing it after is hard. You can use a mesh config via the PVE docs or purchase a switch. Reasonable 2.5 Gb switches and NICs can now be had.
Installation
Proxmox
Install PVE and cluster the servers.
Ceph
Double check the Ceph repo is current by comparing what you have enabled:
grep ceph /etc/apt/sources.list.d/*
Against what PVE has available.
curl -s https://enterprise.proxmox.com/debian/ | grep ceph
If you don’t have the latest, take a look at Install PVE to update the Ceph repo.
After that, log into the PVE web GUI, click on each server and in that server’s submenu, click on Ceph. It will prompt for permission to install. When the setup windows appears, select the newest version and the No-Subscription repository. You can also refer to the official notes.
The wizard will ask some additional configuration options for which you can take the defaults and finish. If you have an additional ceph-specific network hardware, set it up with a separate IP range and choose interface for both the public network and cluster network.
Configuration
Ceph uses multiple daemons on each node.
- Monitors, to keep track of what’s up and down.
- Managers, to gather performance data.
- OSDs, a service per disk to store and retrieve data.
- Metadata Servers, to handle file permissions and such.
To configure them, you will:
- Add a monitor and manager to each node
- Add each node’s disks as OSD, or (Object Storage Devices)
- Add metadata servers to each node
- Create Pools, where you group OSDs and choose the level of resiliency
- Create a Filesystem
Monitors and Managers
Ceph recommends at least three monitors and manager processes. To install, click on a server’s Ceph menu and select the Monitor menu entry. Add a monitor and manager to the first three nodes.
OSDs
The next step is to add disks - aka Object Storage Devices. Select a server from the left-hand menu, and from that server’s Ceph menu, select OSD. Then click Create: OSD, select the disks and click create. If they are missing, enter the server’s shell and issue the command wipefs -a /dev/sdX
on the disks in question.
If you have a lot of disks, you can also do this at the shell
# Assuming you have 5 disks, starting at 'b'
for X in {b..f}; do echo pveceph osd create /dev/sd$X; done
To add MDSs, click on the CephFS submenu for each server and click Create under the Metadata Servers section. Don’t create a CephFS yet, however.
Pools and Resiliency
We are going to create two pools; a replicated pool for VMs and an erasure coded pool for bulk storage.
If you want to mix SSDs and HDDs see the storage tiering page before creating the pools. You’ll want to set up classes of storage and create pools based on that page.
Replicated Pool
On this pool we’ll use the default replication level that gives us three copies of everything. These are guaranteed at a per-host level. Loose any one or two hosts, no problem. But loose individual disks from all three hoots at the same time and you’re out of luck.
This is the GUI default so creation is easy. Navigate to a host and select Ceph –> Pools –> Create. The defaults are fine and all you need do is give it a name, like “VMs”. You may notice there is already a .mgr pool. That’s created by the manager service and safe to ignore.
If you only need storage for VMs and containers, you can actually stop here. You can create VMs directly on your new pool, and containers on local storage then migrate (Server –> Container –> Resources –> Root Disk -> Volume Action –> Target Storage)
Erasure Coded Pool
Erasure coding requires that you determine how many data and parity bits to use, and issue the create command in the terminal. For the first question, it’s pretty simple - if you have three servers you’ll need 2 data and 1 parity. The more systems you have the more efficient you’ll be, though when you get to 6 you should probably increase your parity. Unlike the replicated pool, you can only loose one host with this level of protection.
Here’s the command for a 3 node system that can withstand one node loss (2,1). For a 4 node system you’d use (3,1) and so on. Increase the first number as your node count goes up, and the second as you desire more resilience. Ceph doesn’t require a ‘balanced’ cluster in terms of drives, but you’ll loose some capacity if you don’t have roughly the same amount of space on each node.
# k is for data and m is for parity
pveceph pool create POOLNAME --erasure-coding k=2,m=1 --pg_autoscale_mode on --application cephfs
Note that we specified application in the command. If you don’t, you won’t be able to use it for a filesystem later on. We also specified PGs (placement groups) as auto-scaling. This is how Ceph chunks data as it gets sent to storage. If you know how much data you have to load, you can specify the starting number of PGs with the --pg_num
parameter. This will make things a little faster for an initial copy. Redhat suggests the OSD*100 / K+M. You’ll get a warning from Ceph if it’s not a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512) so use the closest number, such as --pg_num 512
.
If you don’t know how much data you’re going to bing in, but expect it to be a lot, you can turn on the bulk
flag, rather than specifying pg_num
.
# Make sure to add '-data' at the end
ceph osd pool set POOLNAME-data bulk true
When you look at the pools in the GUI you’ll see it also created two pools, one for data and one for metadata, which isn’t compatible with EC pools yet. You’ll also notice that you can put VMs and containers on this pool just like the replicated pool. It will just be slower.
Filesystem
The GUI won’t allow you to choose the erasure coded pool you just created so you’ll use the command line again. The name you pick for your Filesystem will be how it’s mounted.
ceph osd pool set POOLNAME-data bulk true
ceph fs new FILE-SYSTEM-NAME POOLNAME-metadata POOLNAME-data --force
To mount it cluster-wide, go back to the Web GUI and select Datacenter at the top left, then Storage. Click Add and select CephFS as the type. In the subsequent dialog box, put the name you’d like it mounted as in the ID field, such as “bulk” and leave the rest at their defaults.
You can now find it mounted at /mnt/pve/IDNAME
and you can bind mount it to your Containers or setup NFS for your VMs.
Operation
Failure is Always an Option
Ceph defaults to a failure domain of ‘host’. That means you can loose a whole host with all it’s disks and continue operating. You can also lose individual disks from different hosts, but operations are NOT guaranteed. If you have two copies, as well and continue operating. After a short time, Ceph will re-establish parity as disks fail or hosts remain off-line. Should they come back, it will re-adjust. Though in both cases this can take some time.
Rebooting a host
Ceph immediately panics and starts reestablishing resilience. When the host comes back up, it starts redoing it back. This is OK, but Redhat suggests to avoid it with a few steps.
On the node you want to reboot:
sudo ceph osd set noout
sudo ceph osd set
sudo reboot
# Log back in and check that the pgmap reports all pgs as normal (active+clean).
sudo ceph -s
# Continue on to the next node
sudo reboot
sudo ceph -s
# When done
sudo ceph osd unset noout
sudo ceph osd unset norebalance
# Perform a final status check to make sure the cluster reports HEALTH_OK:
sudo ceph status
Troubleshooting
Pool Creation Error
If you created a pool but left off the –application flag it will be set to RDP by default. You’d have to change it from RDP to CephFS like so, for both the data and metadata
ceph osd pool application enable srv-data cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-data rdb --yes-i-really-mean-it
ceph osd pool application enable srv-metadata cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-metadata rdb --yes-i-really-mean-it
Cluster IP Address Change
If you want to change your IP addresses, you may be able to just change the public network in the /etc/pve/ceph.conf and then destroy and recreate what it says. This worked. I don’t know if it’s good. I think the OSD cluster network needed changed also.
Based on https://www.reddit.com/r/Proxmox/comments/p7s8ne/change_ceph_network/
2 - Ceph Tiering
Overview
If you have a mix of workloads you should create a mix of pools. Cache tiering is out. So use a mix of NVMEs, SSDs, and HDDs with rules to control what pool uses what class of device.
In this example, we’ll create a replicated SSD pool for our VMs, and a erasure coded HDD pool for our content and media files.
Initial Rules
When an OSD is added, its device class is automatically assigned. The typical ones are ssd or hdd. But either way, the default config will use them all as soon as you add them. Let’s change that by creating some additional rules
Replicated Data
For replicated data, it’s as easy as creating a couple new rules and then migrating data, if you have any.
New System
If you haven’t yet created any pools, great! we can create the rules so they are available when creating pools. Add all your disks as OSDs (visit the PVE ceph menu for each server in the cluster). Then add these rules at the command line of any PVE server to update the global config.
# Add rules for both types
#
# The format is
# ceph osd crush rule create-replicated RULENAME default host CLASSTYPE
#
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd
And you’re done! When you create a pool in the PVE GUI, click the advanced button and choose the appropriate CRUSH rule from the drop-down. Or you can create one now while you’re at the command line.
# Create a pool for your VMs on replicated SSD. Default replication is used (so 3 copies)
# pveceph pool create POOLNAME --crush_rule RULENAME --pg_autoscale_mode on
pveceph pool create VMs --crush_rule replicated_ssd_rule --pg_autoscale_mode on
Existing System
With an existing system you must migrate your data. If you’ve haven’t added your SSDs yet, do so now. It will start moving data using the default rule, but we’ll apply a new rule that will take over.
# Add rules for both types
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd
# If you've just added SSDs, apply the new rule right away to minimize the time spent waiting for data moves.
# Use the SSD or HDD rule as you prefer. In this example we're moving POOLNAME to SSDs
ceph osd pool set VMs crush_rule replicated_sdd_rule
Erasure Coded Data
On A New System
EC data is a little different. You need a profile to describe the resilience and class, and Ceph manages the CRUSH rule directly. But you can have the pveceph
to do this for you.
# Create pool name 'Content' with 2 data and 1 parity. Add --application cephfs as we're using this for file storage. The --crush_rule affects the metadata pool so its on fast storage.
pveceph pool create Content --erasure-coding k=2,m=1,device-class=hdd --crush_rule replicated_ssd_rule --pg_autoscale_mode on --application cephfs
You’ll notice separate pools for data and metadata were automatically created as the latter doesn’t support EC pools yet.
On An Existing System
Normally, you set device class as part of creating a profile and you cannot change the profile after creating the pool. However, you can change the CRUSH rule and that’s all we need for changing the class.
# Create a new profile that to base a CRUSH rule on. This one uses HDD
# ceph osd erasure-code-profile set PROFILENAME crush-device-class=CLASS k=2 m=1
ceph osd erasure-code-profile set ec_hdd_2_1_profile crush-device-class=hdd k=2 m=1
# ceph osd crush rule create-erasure RULENAME PROFILENAME (from above)
ceph osd crush rule create-erasure erasure_hdd_rule ec_hdd_2_1_profile
# ceph osd pool set POOLNAME crush_rule RULENAME
ceph osd pool set Content-data crush_rule erasure_hdd_rule
Don’t forget about the metadata pool and it’s a good time to turn on bulk setting if you’re going to store a lot of data.
# Put the metadata pool on SSD for speed
ceph osd pool set Content-metadata crush_rule replicated_ssd_rule
ceph osd pool set Content-data bulk true
Other Notes
NVME
There’s some reports that NVMe aren’t separated from SSDs. You may need to create that class and turn off auto detection, though this is quite old information.
Investigation
When investigating a system, you may want to drill down with thees commands.
ceph osd lspools
ceph osd pool get VMs crush_rule
ceph osd crush rule dumpreplicated_ssd_rule
# or view rules with
ceph osd getcrushmap | crushtool -d -
Data Loading
The fastest way is to use a Ceph Client at the source of the data, or at least separate the interfaces.
With 1Gb NIC, one of the Ceph storage servers also connected to an external NFS and coping data to a CephFS.
Same, but reversed with NFS server itself running the Ceph client pushing the data.
Creating and Destroying
https://dannyda.com/2021/04/10/how-to-completely-remove-delete-or-reinstall-ceph-and-its-configuration-from-proxmox-ve-pve/
# Adjust letters as needed
for X in {b..h}; do pveceph osd create /dev/sd${X};done
mkdir -p /var/lib/ceph/mon/
mkdir /var/lib/ceph/osd
# Adjust numbers as needed
for X in {16..23};do systemctl stop ceph-osd@${X}.service;done
for X in {0..7}; do umount /var/lib/ceph/osd/ceph-$X;done
for X in {a..h}; do ceph-volume lvm zap /dev/sd$X --destroy;done
3 - Ceph Client
This assumes you already have a working cluster and a ceph file system.
Install
You need the ceph software. You use the cephadm
tool, or add the repos and packages manually. You also need to pick what version by it’s release name; ‘Octopus, Nautilus, etc’
sudo apt install software-properties-common gnupg2
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -
# Discover the current release. PVE is a good place to check when at the command line
curl -s https://enterprise.proxmox.com/debian/ | grep ceph
# Note the release name after debian, 'debian-squid' in this example.
sudo apt-add-repository 'deb https://download.ceph.com/debian-squid/ bullseye main'
sudo apt update; sudo apt install ceph-common -y
#
# Alternatively
#
curl --silent --remote-name --location https://github.com/ceph/ceph/raw/squid/src/cephadm/cephadm
chmod +x cephadm
./cephadm add-repo --release squid
./cephadm install ceph-common
On a cluster member, generate a basic conf and keyring for the client
# for a client named 'minecraft'
ceph config generate-minimal-conf > /etc/ceph/minimal.ceph.conf
ceph-authtool --create-keyring /etc/ceph/ceph.client.minecraft.keyring --gen-key -n client.minecraft
You must add file system permissions by adding lines to the bottom of the keyring, then import it to the cluster.
nano /etc/ceph/ceph.client.minecraft.keyring
# Allowing the client to read the root and write to the subdirectory '/srv/minecraft'
caps mds = "allow rwps path=/srv/minecraft"
caps mon = "allow r"
caps osd = "allow *"
Import the keyring to the cluster and copy it to the client
ceph auth import -i /etc/ceph/ceph.client.minecraft.keyring
scp minimal.ceph.conf ceph.client.minecraft.keyring [email protected]:
On the client, copy the keyring and rename and move the basic config file.
ssh [email protected]
sudo cp ceph.client.minecraft.keyring /etc/ceph
sudo cp minimal.ceph.conf /etc/ceph/ceph.conf
Now, you may mount the filesystem
# the format is "User ID" @ "Cluster ID" . "Filesystem Name" = "/some/folder/on/the/server" "/some/place/on/the/client"
# You can get the cluster ID from your server's ceph.conf file and the filesystem name
# with a ceph fs ls, if you don't already know it. It will be the part after name, as in "name: XXXX, ..."
sudo mount.ceph [email protected]=/srv/minecraft /mnt
You can and entry to your fstab like so
Troubleshooting
source mount path was not specified
unable to parse mount source: -22
You might have accidentally installed the distro’s older version of ceph. The mount notation above is based on “quincy” aka ver 17
ceph --version
ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)
Try an apt remove --purge ceph-common
and then apt update
before trying a apt install ceph-common
again.
**unable to get monitor info from DNS SRV with service name: ceph-mon**
Check your client’s ceph.conf. You may not have the right file in place
**mount error: no mds server is up or the cluster is laggy**
This is likely a problem with your client file.
Sources
https://docs.ceph.com/en/quincy/install/get-packages/
https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/
https://docs.ceph.com/en/nautilus/cephfs/fstab/