This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Gluster

1: Gluster on XCP-NG

Gluster is a distributed file system that supports both replicated and dispersed data.

Supporting dispersed data is a differentiating feature. Only a few can distribute the data in a erasure-coded or RAID-like fashion, making efficient use of space while providing redundancy. Have 5 cluster members? Just add one ‘parity bit’ for just a 20% overhead and you can loose a host. Add more parity if you like with the incremental cost. Other systems require you to duplicate your data for a 50% hit.

It’s also generally perceived as less complex than competitors like Ceph, as it has fewer moving parts and is focused on block storage. And since it uses native filesystems, you can always access your data directly. Redhat has ceased it’s corporate sponsorship, but the project is still quite active.

So you just need file storage and you have a lot of data, use gluster.

1 - Gluster on XCP-NG

Let’s set up a distributed and dispersed example cluster. We’ll XCP-NG for this. This is similar to an erasure-coded ceph pool.

Preparation

We use three hosts, each connected to a common network. With three we can disperse data enough to take one host at a time out of service. We use 4 disks on each host in this example but any number will work as long as they are all the same.

Network

Hostname Resolution

Gluster requires¹ the hosts be resolvable by hostname. Verify all the hosts can ping each other by name. You may want to create a hosts file and copy to all three to help.

If you have free ports on each server, consider using the second interface for storage, or a mesh network for better performance.

# Normal management and or guest network
192.168.1.1 xcp-ng-01.lan
192.168.1.2 xcp-ng-02.lan
192.168.1.3 xcp-ng-03.lan

# Storage network in a different subnet (if you have a second interface)
192.168.10.1 xcp-ng-01.storage.lan
192.168.10.2 xcp-ng-02.storage.lan
192.168.10.3 xcp-ng-03.storage.lan

Firewall Rules

Gluster requires a few rules; one for the daemon itself and one per ‘brick’ (drive) on the server. You can also just allow the cluster members cart-blanc access. We’ll do both examples here. Add these to all cluster members.

vi /etc/sysconfig/iptables

# Note that the last line in the existing file is a REJECT. Make sure to insert these new rules BEFORE that line.
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-01.storage.lan -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-02.storage.lan -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-03.storage.lan -j ACCEPT 

# Possibly for clients
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -s client-01.storage.lan -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -s client-01.storage.lan -j ACCEPT

service iptables restart

vi /etc/sysconfig/iptables

# The gluster daemon needs ports 24007 and 24008
# Individual bricks need ports starting at 49152. Add an additional port per brick.
# Here we have 49152-49155 open for 4 brickes. 

# TODO - test this command
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-01.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-01.storage.lan --dport 49152:49155 -j ACCEPT

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-02.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-02.storage.lan --dport 49152:49155 -j ACCEPT

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-03.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-03.storage.lan --dport 49152:49155 -j ACCEPT

Disk

Gluster works with filesystems. This is convenient because if all else fails, you still have files on disks you can access. XFS is well regarded with gluster admins, so we’ll use that.

# Install the xfs programs
yum install -y xfsprogs

# Wipe the disks before using, then format the whole disk. Repeat for each disk
wipefs -a /dev/sda
mkfs.xfs /dev/sda

Let’s mount those disks. The convention² is to put them in /data organized by volume. We’ll use ‘volume01’ later in the config, so lets use that here as well.

On each server

# For 4 disks - Note, gluster likes to call them 'bricks'
mkdir -p /data/glusterfs/volume01/brick0{1,2,3,4}
mount /dev/sda /data/glusterfs/volume01/brick01
mount /dev/sdb /data/glusterfs/volume01/brick02
mount /dev/sdc /data/glusterfs/volume01/brick03
mount /dev/sdd /data/glusterfs/volume01/brick04

Add the appropriate config to your /etc/fstab so they mount at boot

Installation

A Note About Versions

XCP-NG is CentOS 7 based and provides GlusterFS v8 in their Repo. This version went EOL in 2021. You can add the CentOS Storage Special Interest group repo to get to v9, but no current version can be installed.

# Not recommended
yum install centos-release-gluster  --enablerepo=epel,base,updates,extras

# On each host

yum install -y glusterfs-server
systemctl enable --now glusterd


# On the first host

gluster peer probe xcp-ng-02.storage.lan
gluster peer probe xcp-ng-03.storage.lan

gluster pool list

UUID                                  Hostname              State
a103d6a5-367b-4807-be93-497b06cf1614  xcp-ng-02.storage.lan Connected 
10bc7918-364d-4e4d-aa16-85c1c879963a  xcp-ng-03.storage.lan Connected 
d00ea7e3-ed94-49ed-b56d-e9ca4327cb82  localhost             Connected

# Note - localhost will always show up for the host you're running the command on

Configuration

Gluster talks about data as being distributed and dispersed.

Distributed

# Distribute data amongst 3 servers, each with a single brick
gluster volume create MyVolume server1:/brick1 server2:brick1 server3:brick1

Any time you have more that one drive, it’s distributed. That can be across different disks on the same host, or across different hosts. There is no redundancy, however, and any loss of disk is loss of data.

Disperse

# Disperse data amongst 3 bricks, each on a different server
gluster volume create MyVolume disperse server1:/brick1 server2:/brick1 server3:/brick1

Dispersed is how you build redundancy across servers. Any one of these servers or bricks can fail and the data is safe.

# Disperse data amongst 6 six bricks, but some on the same server. Problem!
gluster volume create MyVolume disperse \
  server1:/brick1 server2:/brick1 server3:/brick1
  server1:/brick2 server2:/brick2 server3:/brick2

If you try and disperse your data across multiple bricks on the same server, you’ll run into the problem of sub-optimal parity. You’ll see the error message:

Multiple bricks of a disperse volume are present on the same server. This setup is not >optimal. Bricks should be on different nodes to have best fault tolerant configuration

Distributed-Disperse

# Disperse data into 3 brick subvolumes before distributing
gluster volume create MyVolume disperse 3 \
  server1:/brick1 server2:/brick1 server3:/brick1
  server1:/brick2 server2:/brick2 server3:/brick2

By specifying disperse COUNT you tell gluster that you want to create a subvolumes every COUNT bricks. In the above example, it’s every three bricks, so two subvolumes get created from the six bricks. This ensures the parity is optimally handled as it’s distributed.

You can also take advantage of bash shell expansion like below. Each subvolume is one line, repeated for each of the 4 bricks it will be distributed across.

gluster volume create volume01 disperse 3 \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick01/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick02/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick03/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick04/brick

Operation

Mounting and Optimizing Volumes

mount -t glusterfs xcp-ng-01.storage.lan:/volume01 /mnt

gluster volume set volume01 group metadata-cache

gluster volume set volume01 performance.readdir-ahead on 
gluster volume set volume01 performance.parallel-readdir on

gluster volume set volume01 group nl-cache 
gluster volume set volume01 nl-cache-positive-entry on

Adding to XCP-NG

mount -t glusterfs xcp-ng-01.lan:/volume01/media.2 /root/mnt2/
mkdir mnt2/xcp-ng

xe sr-create content-type=user type=glusterfs name-label=GlusterSharedStorage shared=true \
  device-config:server=xcp-ng-01.lan:/volume01/xcp-ng \
  device-config:backupservers=xcp-ng-02.lan:xcp-ng-03.lan

Scrub and Bitrot

Scrub is off by default. You can enable scrub at which point the scrub daemon will begin “signing” files³ (by calculating checksum). The file-system parity isn’t used. So if you enable and immediately begin a scrub you will see many “Skipped files” as their checksum hasn’t been calculated yet.

Client Installation

The FUSE client is recommended⁴. The docs cover a .deb based install, but you can also install from the repo. On Debian:

sudo apt install lsb-release gnupg

OS=$(lsb_release --codename --short)

# Assuming the current version of gluster is 11 
wget -O - https://download.gluster.org/pub/gluster/glusterfs/11/rsa.pub | sudo pt-key add -
echo deb [arch=amd64] https://download.gluster.org/pub/gluster/glusterfs/11/LATEST/Debian/${OS}/amd64/apt ${OS} main | sudo tee /etc/apt/sources.list.d/gluster.list
sudo apt update; sudo apt install glusterfs-client

You need quite a few options to use this successfully at boot in the fstab

192.168.3.235:/volume01 /mnt glusterfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0

How to reboot a node

You may find that your filesystem has paused during a reboot. Take a look at your network timeout and see if setting it lower helps.

https://unix.stackexchange.com/questions/452939/glusterfs-graceful-reboot-of-brick

gluster volume set volume01 network.ping-timeout 5

Using notes from https://www.youtube.com/watch?v=TByeZBT4hfQ