This is the multi-page printable view of this section. Click here to print.

Return to the regular view of this page.

Infrastructure

1 - Data Management

Data has unique challenges around quality, compliance, availability, and scalability that require specialized skills, tools, and practices distinct from core IT Infrastructure.

1.1 - Backup

A modern, non-enterprise, backup solution for an individual client should be:

  • Non-generational (i.e. not have to rely on full and incremental chains)
  • De-Duplicated
  • Support Pruning (of old backups)
  • Support Cloud Storage (and encryption)
  • Open Source (Ideally)

For built-in solutions, Apple has Time Machine, Windows has File History (and Windows Backup), and Linux has…well, a lot of things.

Rsync is a perennial favorite and a good (short) historical post on the evolution of rsync style backups is here over at Reddit. Though I hesitate to think of it as backup, because it meets none of the above features.

<https://www.reddit.com/r/linux/comments/42feqz/i_asked_here_for_the_optimal_backup_solution_and/czbeuby?

Duplicity is a well established traditional solution that supports Amazon Cloud Drive, but it relies on generational methods meaning a regular full backup is required. That’s resource intensive with large data sets.

Restic is interesting, but doesn’t work with many cloud providers; specifically Amazon Cloud Drive

https://github.com/restic/restic/issues/212

Obnam and Borg are also interesting, but similarly fail with Amazon Cloud Drive.

restic panic: rename
https://www.bountysource.com/issues/23684796-storage-backend-amazon-cloud-drive

Duplicati supports ACD as long as you’re willing to add mono. Though it’s still beta as of this writing.

sudo /usr/lib/duplicati/Duplicati.Server.exe --webservice-port=8200 --webservice-interface=any

And some other background.

<http://silverskysoft.com/open-stack-xwrpr/2015/08/the-search-for-the-ideal-backup-tool-part-1-of-2/>
<http://changelog.complete.org/archives/9353-roundup-of-remote-encrypted-deduplicated-backups-in-linux>
<http://www.acronis.com/en-us/resource/tips-tricks/2004/generational-backup.html>

1.2 - Database

1.3 - Directory Services

1.3.1 - Apple and AD

Here’s the troubleshooting process

Verify DNS Records according to apple’s document.

DOMAIN=gattis.org
dns-sd -q _ldap._tcp.$DOMAIN SRV
dns-sd -q _kerberos._tcp.$DOMAIN SRV
dns-sd -q _kpasswd._tcp.$DOMAIN SRV
dns-sd -q _gc._tcp.$DOMAIN SRV

Ping the results. Then test for ports according the Microsoft’s document.

HOST=dc01.gattis.org
nc -z -v -u $HOST 88
nc -z -v -u $HOST 135
nc -z -v $HOST 135
nc -z -v -u $HOST 389
nc -z -v -u $HOST 445
nc -z -v $HOST 445
nc -z -v -u $HOST 464
nc -z -v $HOST 464
nc -z -v $HOST 3268
nc -z -v $HOST 3269
nc -z -v $HOST 53
nc -z -v -u $HOST 53
nc -z -v -u $HOST 123

A useful script is like so

#!/bin/bash

HOST=dc01.gattis.local
#HOST=dc02.gattis.local


## declare an array of the commands to run
declare -a COMMANDS=(\
"nc -z -u $HOST 88" 
"nc -z -u $HOST 135" 
"nc -z    $HOST 135" 
"nc -z -u $HOST 389" 
"nc -z -u $HOST 445" 
"nc -z    $HOST 445" 
"nc -z -u $HOST 464" 
"nc -z    $HOST 464" 
"nc -z    $HOST 3268" 
"nc -z    $HOST 3269" 
"nc -z    $HOST 53" 
"nc -z -u $HOST 53" 
"nc -z -u $HOST 123")

PIDS=""
for i in "${COMMANDS[@]}";do
    $i &
    PIDS+="$! "
done

1.3.2 - LDAP

sudo apt-get install libnss-ldap ldap-utils

# To get the attribute 'memberOf'

# Simple Bind with TLS
ldapsearch -v -x -Z -D "[email protected]" -W -H ldap://ad.domain.local -b  'OU=People,DC=domain,DC=local' '(sAMAccountName=someuser)' memberOf

# older style
ldapsearch -v -D "[email protected]" -w Passw0rd -H ldap://ad1.domain.local -b 'OU=People,DC=domain,DC=local' '(sAMAccountName=someuser)' memberOf

# Get all user accounts from AD created since 2007-07.
ldapsearch -v -x -Z -D "[email protected]" -W -H ldap://ad1.domain.local -b 'DC=domain,DC=local' -E pr=1000/noprompt '(&(objectClass=user)(whenCreated>=20100701000000.0Z))' sAMAccountName description whenCreated > all

1.4 - File Systems

1.4.1 - BTRFS

1.4.1.1 - Kernel Updates for BTRFS

You man want to use newer BTRFS features on an older OS.

With debian your choices are:

  • Install from backports
  • Install from release candidates
  • Install from generic
  • Build from source

Install from Backports

It’s often recommended to install from backports. These are newer versions of apps that have been explicitly taken out of testing and packaged for the stable release. i.e. if you’re running ‘buster’ you would install from buster-backports.

It’s possible that one would use the identifiers ‘stable’ and ’testing’ which pegs you to whatever is current, rather than to a specific release.

echo deb http://deb.debian.org/debian buster-backports main | sudo tee /etc/apt/sources.list.d/buster-backports.list
sudo apt update

# seach for the most recent amd64 image.
sudo apt search -t buster-backports linux-image-5
sudo apt install -t buster-backports linux-image-5.2.0-0.bpo.3-amd64-unsigned
sudo apt install -t buster-backports btrfs-progs

Install from Release Candidate

If there’s no backport, you can install from the release candidate. These are simply upcoming versions of debian that haven’t been released yet

To install the kernel from the experimental version of debian, add the repo to your sources and explicitly add the kernel (this is safe to add to your repos because experimental packages aren’t installed by default)

sudo su -c "echo deb http://deb.debian.org/debian experimental main > /etc/apt/sources.list.d/experimental.list"
sudo apt update
sudo apt -t experimental search linux-image
sudo apt -t experimental install linux-image-XXX
sudo apt -t experimental install btrfs-progs

Install from Generic

You can also download the packages and manually install.

Navigate to

http://kernel.ubuntu.com/~kernel-ppa/mainline/

And download, similar to this (from a very long time ago :-)

wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/linux-headers-5.0.0-050000_5.0.0-050000.201903032031_all.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/linux-headers-5.0.0-050000-generic_5.0.0-050000.201903032031_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/linux-image-unsigned-5.0.0-050000-generic_5.0.0-050000.201903032031_amd64.deb
wget https://kernel.ubuntu.com/~kernel-ppa/mainline/v5.0/linux-modules-5.0.0-050000-generic_5.0.0-050000.201903032031_amd64.deb

Troubleshooting

The following signatures couldn’t be verified because the public key is not available:

sudo apt-key adv –recv-key –keyserver keyserver.ubuntu.com 648ACFD622F3D138

Sources

https://unix.stackexchange.com/questions/432406/install-the-latest-rc-kernel-on-debian https://wiki.debian.org/HowToUpgradeKernel https://www.tecmint.com/upgrade-kernel-in-ubuntu/ https://raspberrypi.stackexchange.com/questions/12258/where-is-the-archive-key-for-backports-debian-org/60051#60051

1.4.2 - Ceph

Ceph is a distributed object storage system that supports both replicated and erasure coding of data. So you can design it to be fast or efficient. Or even both.

It’s complex compared to other solutions, but it’s also the main focus of Redhat and other development. So it may eclipse other technologies just through adaption.

It also comes ‘baked-in’ with Proxmox. If you’re already using PVE it’s worth deploying over others, as long as you’re willing to devote the time to learning it.

1.4.2.1 - Ceph Proxmox

Overview

There are two main use-cases;

  • Virtual Machines
  • Bulk Storage

They both provide High Availability. But VMs need speed whereas Bulk should be economical. What makes Ceph awesome is you can do both - all with the same set of disks.

Preparation

Computers

Put 3 or 4 PCs on a LAN. Have at least a couple HDDs in addition to a boot drive. 1G of RAM per TB of disk is recommended1 and it will use it. You can have less RAM, it will just be slower.

Network

The speed of Ceph is essentially a third or half your network speed. With a 1 Gig NIC you’ll average around 60 MB/Sec for file operations (multiple copies are being saved behind the scenes). This sounds terrible, but in reality its fine for a cluster that serves up small files and/or media streams. But it will take a long time to get data onto the system.

If you need more, install a secondary NIC for Ceph traffic. Do that before you configure the system, if you can. Doing it after is hard. You can use a mesh config via the PVE docs2 or purchase a switch. Reasonable 2.5 Gb switches and NICs can now be had.

Installation

Proxmox

Install PVE and cluster the servers.

Ceph

Double check the Ceph repo is current by comparing what you have enabled:

grep ceph  /etc/apt/sources.list.d/*

Against what PVE has available.

curl -s https://enterprise.proxmox.com/debian/ | grep ceph

If you don’t have the latest, take a look at Install PVE to update the Ceph repo.

After that, log into the PVE web GUI, click on each server and in that server’s submenu, click on Ceph. It will prompt for permission to install. When the setup windows appears, select the newest version and the No-Subscription repository. You can also refer to the official notes.

The wizard will ask some additional configuration options for which you can take the defaults and finish. If you have an additional ceph-specific network hardware, set it up with a separate IP range and choose interface for both the public network and cluster network.

Configuration

Ceph uses multiple daemons on each node.

  • Monitors, to keep track of what’s up and down.
  • Managers, to gather performance data.
  • OSDs, a service per disk to store and retrieve data.
  • Metadata Servers, to handle file permissions and such.

To configure them, you will:

  • Add a monitor and manager to each node
  • Add each node’s disks as OSD, or (Object Storage Devices)
  • Add metadata servers to each node
  • Create Pools, where you group OSDs and choose the level of resiliency
  • Create a Filesystem

Monitors and Managers

Ceph recommends at least three monitors3 and manager processes4. To install, click on a server’s Ceph menu and select the Monitor menu entry. Add a monitor and manager to the first three nodes.

OSDs

The next step is to add disks - aka Object Storage Devices. Select a server from the left-hand menu, and from that server’s Ceph menu, select OSD. Then click Create: OSD, select the disks and click create. If they are missing, enter the server’s shell and issue the command wipefs -a /dev/sdX on the disks in question.

If you have a lot of disks, you can also do this at the shell

# Assuming you have 5 disks, starting at 'b'
for X in {b..f}; do echo pveceph osd create /dev/sd$X; done

Metadata Servers

To add MDSs, click on the CephFS submenu for each server and click Create under the Metadata Servers section. Don’t create a CephFS yet, however.

Pools and Resiliency

We are going to create two pools; a replicated pool for VMs and an erasure coded pool for bulk storage.

If you want to mix SSDs and HDDs see the storage tiering page before creating the pools. You’ll want to set up classes of storage and create pools based on that page.

Replicated Pool

On this pool we’ll use the default replication level that gives us three copies of everything. These are guaranteed at a per-host level. Loose any one or two hosts, no problem. But loose individual disks from all three hoots at the same time and you’re out of luck.

This is the GUI default so creation is easy. Navigate to a host and select Ceph –> Pools –> Create. The defaults are fine and all you need do is give it a name, like “VMs”. You may notice there is already a .mgr pool. That’s created by the manager service and safe to ignore.

If you only need storage for VMs and containers, you can actually stop here. You can create VMs directly on your new pool, and containers on local storage then migrate (Server –> Container –> Resources –> Root Disk -> Volume Action –> Target Storage)

Erasure Coded Pool

Erasure coding requires that you determine how many data and parity bits to use, and issue the create command in the terminal. For the first question, it’s pretty simple - if you have three servers you’ll need 2 data and 1 parity. The more systems you have the more efficient you’ll be, though when you get to 6 you should probably increase your parity. Unlike the replicated pool, you can only loose one host with this level of protection.

Here’s the command5 for a 3 node system that can withstand one node loss (2,1). For a 4 node system you’d use (3,1) and so on. Increase the first number as your node count goes up, and the second as you desire more resilience. Ceph doesn’t require a ‘balanced’ cluster in terms of drives, but you’ll loose some capacity if you don’t have roughly the same amount of space on each node.

# k is for data and m is for parity
pveceph pool create POOLNAME --erasure-coding k=2,m=1 --pg_autoscale_mode on --application cephfs

Note that we specified application in the command. If you don’t, you won’t be able to use it for a filesystem later on. We also specified PGs (placement groups) as auto-scaling. This is how Ceph chunks data as it gets sent to storage. If you know how much data you have to load, you can specify the starting number of PGs with the --pg_num parameter. This will make things a little faster for an initial copy. Redhat suggests6 the OSD*100 / K+M. You’ll get a warning7 from Ceph if it’s not a power of 2 (2, 4, 8, 16, 32, 64, 128, 256, 512) so use the closest number, such as --pg_num 512.

If you don’t know how much data you’re going to bing in, but expect it to be a lot, you can turn on the bulk flag, rather than specifying pg_num.

# Make sure to add '-data' at the end
ceph osd pool set POOLNAME-data bulk true

When you look at the pools in the GUI you’ll see it also created two pools, one for data and one for metadata, which isn’t compatible with EC pools yet. You’ll also notice that you can put VMs and containers on this pool just like the replicated pool. It will just be slower.

Filesystem

The GUI won’t allow you to choose the erasure coded pool you just created so you’ll use the command line again. The name you pick for your Filesystem will be how it’s mounted.

ceph osd pool set POOLNAME-data bulk true
ceph fs new FILE-SYSTEM-NAME POOLNAME-metadata POOLNAME-data --force

To mount it cluster-wide, go back to the Web GUI and select Datacenter at the top left, then Storage. Click Add and select CephFS as the type. In the subsequent dialog box, put the name you’d like it mounted as in the ID field, such as “bulk” and leave the rest at their defaults.

You can now find it mounted at /mnt/pve/IDNAME and you can bind mount it to your Containers or setup NFS for your VMs.

Operation

Failure is Always an Option

Ceph defaults to a failure domain of ‘host’. That means you can loose a whole host with all it’s disks and continue operating. You can also lose individual disks from different hosts, but operations are NOT guaranteed. If you have two copies, as well and continue operating. After a short time, Ceph will re-establish parity as disks fail or hosts remain off-line. Should they come back, it will re-adjust. Though in both cases this can take some time.

Rebooting a host

Ceph immediately panics and starts reestablishing resilience. When the host comes back up, it starts redoing it back. This is OK, but Redhat suggests to avoid it with a few steps.

On the node you want to reboot:

sudo ceph osd set noout
sudo ceph osd set 
sudo reboot

# Log back in and check that the pgmap reports all pgs as normal (active+clean). 
sudo ceph -s

# Continue on to the next node
sudo reboot
sudo ceph -s

# When done
sudo ceph osd unset noout
sudo ceph osd unset norebalance

# Perform a final status check to make sure the cluster reports HEALTH_OK:
sudo ceph status

Troubleshooting

Pool Creation Error

If you created a pool but left off the –application flag it will be set to RDP by default. You’d have to change it from RDP to CephFS like so, for both the data and metadata

ceph osd pool application enable srv-data cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-data rdb --yes-i-really-mean-it
ceph osd pool application enable srv-metadata cephfs --yes-i-really-mean-it
ceph osd pool application disable srv-metadata rdb --yes-i-really-mean-it

Cluster IP Address Change

If you want to change your IP addresses, you may be able to just change the public network in the /etc/pve/ceph.conf and then destroy and recreate what it says. This worked. I don’t know if it’s good. I think the OSD cluster network needed changed also.

Based on https://www.reddit.com/r/Proxmox/comments/p7s8ne/change_ceph_network/

1.4.2.2 - Ceph Tiering

Overview

If you have a mix of workloads you should create a mix of pools. Cache tiering is out1. So use a mix of NVMEs, SSDs, and HDDs with rules to control what pool uses what class of device.

In this example, we’ll create a replicated SSD pool for our VMs, and a erasure coded HDD pool for our content and media files.

Initial Rules

When an OSD is added, its device class is automatically assigned. The typical ones are ssd or hdd. But either way, the default config will use them all as soon as you add them. Let’s change that by creating some additional rules

Replicated Data

For replicated data, it’s as easy as creating a couple new rules and then migrating data, if you have any.

New System

If you haven’t yet created any pools, great! we can create the rules so they are available when creating pools. Add all your disks as OSDs (visit the PVE ceph menu for each server in the cluster). Then add these rules at the command line of any PVE server to update the global config.

# Add rules for both types
#
# The format is
#    ceph osd crush rule create-replicated RULENAME default host CLASSTYPE
#
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

And you’re done! When you create a pool in the PVE GUI, click the advanced button and choose the appropriate CRUSH rule from the drop-down. Or you can create one now while you’re at the command line.

# Create a pool for your VMs on replicated SSD. Default replication is used (so 3 copies)
#  pveceph pool create POOLNAME --crush_rule RULENAME --pg_autoscale_mode on
pveceph pool create VMs --crush_rule  replicated_ssd_rule --pg_autoscale_mode on

Existing System

With an existing system you must migrate your data. If you’ve haven’t added your SSDs yet, do so now. It will start moving data using the default rule, but we’ll apply a new rule that will take over.

# Add rules for both types
ceph osd crush rule create-replicated replicated_hdd_rule default host hdd
ceph osd crush rule create-replicated replicated_ssd_rule default host ssd

# If you've just added SSDs, apply the new rule right away to minimize the time spent waiting for data moves.
# Use the SSD or HDD rule as you prefer. In this example we're moving POOLNAME to SSDs
ceph osd pool set VMs crush_rule replicated_sdd_rule

Erasure Coded Data

On A New System

EC data is a little different. You need a profile to describe the resilience and class, and Ceph manages the CRUSH rule directly. But you can have the pveceph to do this for you.

# Create pool name 'Content' with 2 data and 1 parity. Add --application cephfs as we're using this for file storage. The --crush_rule affects the metadata pool so its on fast storage.
pveceph pool create Content --erasure-coding k=2,m=1,device-class=hdd --crush_rule  replicated_ssd_rule --pg_autoscale_mode on --application cephfs

You’ll notice separate pools for data and metadata were automatically created as the latter doesn’t support EC pools yet.

On An Existing System

Normally, you set device class as part of creating a profile and you cannot change the profile after creating the pool2. However, you can change the CRUSH rule and that’s all we need for changing the class.

# Create a new profile that to base a CRUSH rule on. This one uses HDD
#  ceph osd erasure-code-profile set PROFILENAME crush-device-class=CLASS k=2 m=1
ceph osd erasure-code-profile set ec_hdd_2_1_profile crush-device-class=hdd k=2 m=1

# ceph osd crush rule create-erasure RULENAME PROFILENAME (from above)
ceph osd crush rule create-erasure erasure_hdd_rule ec_hdd_2_1_profile

# ceph osd pool set POOLNAME crush_rule RULENAME
ceph osd pool set Content-data crush_rule erasure_hdd_rule

Don’t forget about the metadata pool and it’s a good time to turn on bulk setting if you’re going to store a lot of data.

# Put the metadata pool on SSD for speed
ceph osd pool set Content-metadata crush_rule replicated_ssd_rule
ceph osd pool set Content-data bulk true

Other Notes

NVME

There’s some reports that NVMe aren’t separated from SSDs. You may need to create that class and turn off auto detection, though this is quite old information.

Investigation

When investigating a system, you may want to drill down with thees commands.

ceph osd lspools
ceph osd pool get VMs crush_rule
ceph osd crush rule dumpreplicated_ssd_rule
# or view rules with
ceph osd getcrushmap | crushtool -d -

Data Loading

The fastest way is to use a Ceph Client at the source of the data, or at least separate the interfaces.

With 1Gb NIC, one of the Ceph storage servers also connected to an external NFS and coping data to a CephFS.

  • 12 MB/sec

Same, but reversed with NFS server itself running the Ceph client pushing the data.

  • 103 MB/sec

Creating and Destroying

https://dannyda.com/2021/04/10/how-to-completely-remove-delete-or-reinstall-ceph-and-its-configuration-from-proxmox-ve-pve/

# Adjust letters as needed
for X in {b..h}; do pveceph osd create /dev/sd${X};done
mkdir -p /var/lib/ceph/mon/
mkdir  /var/lib/ceph/osd

# Adjust numbers as needed
for X in {16..23};do systemctl stop ceph-osd@${X}.service;done
for X in {0..7}; do umount /var/lib/ceph/osd/ceph-$X;done
for X in {a..h}; do ceph-volume lvm zap /dev/sd$X --destroy;done

1.4.2.3 - Ceph Client

This assumes you already have a working cluster and a ceph file system.

Install

You need the ceph software. You use the cephadm tool, or add the repos and packages manually. You also need to pick what version by it’s release name; ‘Octopus, Nautilus, etc’

sudo apt install software-properties-common gnupg2
wget -q -O- 'https://download.ceph.com/keys/release.asc' | sudo apt-key add -

# Discover the current release. PVE is a good place to check when at the command line
curl -s https://enterprise.proxmox.com/debian/ | grep ceph

# Note the release name after debian, 'debian-squid' in this example.
sudo apt-add-repository 'deb https://download.ceph.com/debian-squid/ bullseye main'

sudo apt update; sudo apt install ceph-common -y

#
# Alternatively 
#

curl --silent --remote-name --location https://github.com/ceph/ceph/raw/squid/src/cephadm/cephadm
chmod +x cephadm
./cephadm add-repo --release squid
./cephadm install ceph-common

Configure

On a cluster member, generate a basic conf and keyring for the client

# for a client named 'minecraft'

ceph config generate-minimal-conf > /etc/ceph/minimal.ceph.conf

ceph-authtool --create-keyring /etc/ceph/ceph.client.minecraft.keyring --gen-key -n client.minecraft

You must add file system permissions by adding lines to the bottom of the keyring, then import it to the cluster.

nano  /etc/ceph/ceph.client.minecraft.keyring

# Allowing the client to read the root and write to the subdirectory '/srv/minecraft'
caps mds = "allow rwps path=/srv/minecraft"
caps mon = "allow r"
caps osd = "allow *"

Import the keyring to the cluster and copy it to the client

ceph auth import -i /etc/ceph/ceph.client.minecraft.keyring
scp minimal.ceph.conf ceph.client.minecraft.keyring [email protected]:

On the client, copy the keyring and rename and move the basic config file.

ssh [email protected]

sudo cp ceph.client.minecraft.keyring /etc/ceph
sudo cp minimal.ceph.conf /etc/ceph/ceph.conf

Now, you may mount the filesystem

# the format is "User ID" @ "Cluster ID" . "Filesystem Name" = "/some/folder/on/the/server" "/some/place/on/the/client"
# You can get the cluster ID from your server's ceph.conf file and the filesystem name 
# with a ceph fs ls, if you don't already know it. It will be the part after name, as in "name: XXXX, ..."

sudo mount.ceph [email protected]=/srv/minecraft /mnt

You can and entry to your fstab like so

[email protected]=/srv/minecraft /mnt ceph noatime,_netdev    0       2

Troubleshooting

source mount path was not specified
unable to parse mount source: -22

You might have accidentally installed the distro’s older version of ceph. The mount notation above is based on “quincy” aka ver 17

ceph --version

  ceph version 17.2.3 (dff484dfc9e19a9819f375586300b3b79d80034d) quincy (stable)

Try an apt remove --purge ceph-common and then apt update before trying a apt install ceph-common again.

**unable to get monitor info from DNS SRV with service name: ceph-mon**

Check your client’s ceph.conf. You may not have the right file in place

**mount error: no mds server is up or the cluster is laggy**

This is likely a problem with your client file.

Sources

https://docs.ceph.com/en/quincy/install/get-packages/ https://knowledgebase.45drives.com/kb/creating-client-keyrings-for-cephfs/ https://docs.ceph.com/en/nautilus/cephfs/fstab/

1.4.3 - Command Line Full Text

For occasional searches you’d use grep. Something like:

grep --ignore-case --files-with-matches --recursive foo /some/file/path

This can take quite a while. There are some tweaks to grep you can add, but for source code, ack is traditional. Even faster is ag a.k.a. The Silver Searcher (get it, ag - is silver in the periodic table… and possibly a play on words with The Silver Surfer)

apt install  silversearcher-ag

# Almost a drop in for grep
ag --ignore-case --files-with-matches --recurse foo /some/file/path

You’d think an index would be great - but then you realize that for unstructured text (i.e. full test searching) you have to build an index of every word in every file that the index is larger than the contents.

Though lucene/elasticseach and spinx come up in conversation.

https://github.com/ggreer/the_silver_searcher

1.4.4 - Gluster

Gluster is a distributed file system that supports both replicated and dispersed data.

Supporting dispersed data is a differentiating feature. Only a few can distribute the data in a erasure-coded or RAID-like fashion, making efficient use of space while providing redundancy. Have 5 cluster members? Just add one ‘parity bit’ for just a 20% overhead and you can loose a host. Add more parity if you like with the incremental cost. Other systems require you to duplicate your data for a 50% hit.

It’s also generally perceived as less complex than competitors like Ceph, as it has fewer moving parts and is focused on block storage. And since it uses native filesystems, you can always access your data directly. Redhat has ceased it’s corporate sponsorship, but the project is still quite active.

So you just need file storage and you have a lot of data, use gluster.

1.4.4.1 - Gluster on XCP-NG

Let’s set up a distributed and dispersed example cluster. We’ll XCP-NG for this. This is similar to an erasure-coded ceph pool.

Preparation

We use three hosts, each connected to a common network. With three we can disperse data enough to take one host at a time out of service. We use 4 disks on each host in this example but any number will work as long as they are all the same.

Network

Hostname Resolution

Gluster requires1 the hosts be resolvable by hostname. Verify all the hosts can ping each other by name. You may want to create a hosts file and copy to all three to help.

If you have free ports on each server, consider using the second interface for storage, or a mesh network for better performance.

# Normal management and or guest network
192.168.1.1 xcp-ng-01.lan
192.168.1.2 xcp-ng-02.lan
192.168.1.3 xcp-ng-03.lan

# Storage network in a different subnet (if you have a second interface)
192.168.10.1 xcp-ng-01.storage.lan
192.168.10.2 xcp-ng-02.storage.lan
192.168.10.3 xcp-ng-03.storage.lan

Firewall Rules

Gluster requires a few rules; one for the daemon itself and one per ‘brick’ (drive) on the server. You can also just allow the cluster members cart-blanc access. We’ll do both examples here. Add these to all cluster members.

vi /etc/sysconfig/iptables
# Note that the last line in the existing file is a REJECT. Make sure to insert these new rules BEFORE that line.
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-01.storage.lan -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-02.storage.lan -j ACCEPT 
-A RH-Firewall-1-INPUT -p tcp -s xcp-ng-03.storage.lan -j ACCEPT 

# Possibly for clients
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 24007:24008 -s client-01.storage.lan -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport 49152:49156 -s client-01.storage.lan -j ACCEPT
service iptables restart

OR

vi /etc/sysconfig/iptables

# The gluster daemon needs ports 24007 and 24008
# Individual bricks need ports starting at 49152. Add an additional port per brick.
# Here we have 49152-49155 open for 4 brickes. 

# TODO - test this command
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-01.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-01.storage.lan --dport 49152:49155 -j ACCEPT

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-02.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-02.storage.lan --dport 49152:49155 -j ACCEPT

-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-03.storage.lan --dport 24007:24008 -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp -s xcp-ng-03.storage.lan --dport 49152:49155 -j ACCEPT

Disk

Gluster works with filesystems. This is convenient because if all else fails, you still have files on disks you can access. XFS is well regarded with gluster admins, so we’ll use that.

# Install the xfs programs
yum install -y xfsprogs

# Wipe the disks before using, then format the whole disk. Repeat for each disk
wipefs -a /dev/sda
mkfs.xfs /dev/sda

Let’s mount those disks. The convention2 is to put them in /data organized by volume. We’ll use ‘volume01’ later in the config, so lets use that here as well.

On each server

# For 4 disks - Note, gluster likes to call them 'bricks'
mkdir -p /data/glusterfs/volume01/brick0{1,2,3,4}
mount /dev/sda /data/glusterfs/volume01/brick01
mount /dev/sdb /data/glusterfs/volume01/brick02
mount /dev/sdc /data/glusterfs/volume01/brick03
mount /dev/sdd /data/glusterfs/volume01/brick04

Add the appropriate config to your /etc/fstab so they mount at boot

Installation

A Note About Versions

XCP-NG is CentOS 7 based and provides GlusterFS v8 in their Repo. This version went EOL in 2021. You can add the CentOS Storage Special Interest group repo to get to v9, but no current version can be installed.

# Not recommended
yum install centos-release-gluster  --enablerepo=epel,base,updates,extras
# On each host

yum install -y glusterfs-server
systemctl enable --now glusterd


# On the first host

gluster peer probe xcp-ng-02.storage.lan
gluster peer probe xcp-ng-03.storage.lan

gluster pool list

UUID                                  Hostname              State
a103d6a5-367b-4807-be93-497b06cf1614  xcp-ng-02.storage.lan Connected 
10bc7918-364d-4e4d-aa16-85c1c879963a  xcp-ng-03.storage.lan Connected 
d00ea7e3-ed94-49ed-b56d-e9ca4327cb82  localhost             Connected

# Note - localhost will always show up for the host you're running the command on

Configuration

Gluster talks about data as being distributed and dispersed.

Distributed

# Distribute data amongst 3 servers, each with a single brick
gluster volume create MyVolume server1:/brick1 server2:brick1 server3:brick1

Any time you have more that one drive, it’s distributed. That can be across different disks on the same host, or across different hosts. There is no redundancy, however, and any loss of disk is loss of data.

Disperse

# Disperse data amongst 3 bricks, each on a different server
gluster volume create MyVolume disperse server1:/brick1 server2:/brick1 server3:/brick1

Dispersed is how you build redundancy across servers. Any one of these servers or bricks can fail and the data is safe.

# Disperse data amongst 6 six bricks, but some on the same server. Problem!
gluster volume create MyVolume disperse \
  server1:/brick1 server2:/brick1 server3:/brick1
  server1:/brick2 server2:/brick2 server3:/brick2

If you try and disperse your data across multiple bricks on the same server, you’ll run into the problem of sub-optimal parity. You’ll see the error message:

Multiple bricks of a disperse volume are present on the same server. This setup is not >optimal. Bricks should be on different nodes to have best fault tolerant configuration

Distributed-Disperse

# Disperse data into 3 brick subvolumes before distributing
gluster volume create MyVolume disperse 3 \
  server1:/brick1 server2:/brick1 server3:/brick1
  server1:/brick2 server2:/brick2 server3:/brick2

By specifying disperse COUNT you tell gluster that you want to create a subvolumes every COUNT bricks. In the above example, it’s every three bricks, so two subvolumes get created from the six bricks. This ensures the parity is optimally handled as it’s distributed.

You can also take advantage of bash shell expansion like below. Each subvolume is one line, repeated for each of the 4 bricks it will be distributed across.

gluster volume create volume01 disperse 3 \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick01/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick02/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick03/brick \
  xcp-ng-0{1..3}.storage.lan:/data/glusterfs/volume01/brick04/brick 

Operation

Mounting and Optimizing Volumes

mount -t glusterfs xcp-ng-01.storage.lan:/volume01 /mnt

gluster volume set volume01 group metadata-cache

gluster volume set volume01 performance.readdir-ahead on 
gluster volume set volume01 performance.parallel-readdir on

gluster volume set volume01 group nl-cache 
gluster volume set volume01 nl-cache-positive-entry on

Adding to XCP-NG

mount -t glusterfs xcp-ng-01.lan:/volume01/media.2 /root/mnt2/
mkdir mnt2/xcp-ng

xe sr-create content-type=user type=glusterfs name-label=GlusterSharedStorage shared=true \
  device-config:server=xcp-ng-01.lan:/volume01/xcp-ng \
  device-config:backupservers=xcp-ng-02.lan:xcp-ng-03.lan

Scrub and Bitrot

Scrub is off by default. You can enable scrub at which point the scrub daemon will begin “signing” files3 (by calculating checksum). The file-system parity isn’t used. So if you enable and immediately begin a scrub you will see many “Skipped files” as their checksum hasn’t been calculated yet.

Client Installation

The FUSE client is recommended4. The docs cover a .deb based install, but you can also install from the repo. On Debian:

sudo apt install lsb-release gnupg

OS=$(lsb_release --codename --short)

# Assuming the current version of gluster is 11 
wget -O - https://download.gluster.org/pub/gluster/glusterfs/11/rsa.pub | sudo pt-key add -
echo deb [arch=amd64] https://download.gluster.org/pub/gluster/glusterfs/11/LATEST/Debian/${OS}/amd64/apt ${OS} main | sudo tee /etc/apt/sources.list.d/gluster.list
sudo apt update; sudo apt install glusterfs-client

You need quite a few options to use this successfully at boot in the fstab

192.168.3.235:/volume01 /mnt glusterfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10 0 0

How to reboot a node

You may find that your filesystem has paused during a reboot. Take a look at your network timeout and see if setting it lower helps.

https://unix.stackexchange.com/questions/452939/glusterfs-graceful-reboot-of-brick

gluster volume set volume01 network.ping-timeout 5

Using notes from https://www.youtube.com/watch?v=TByeZBT4hfQ

1.4.5 - NFS

NFS is the fastest way move files around a small network. It beats both samba and afp in throughput (Circa 2014) in my testing and with a little extra config works well between apple and linux.

1.4.5.1 - General Use

The NFS server supports multiple protocol versions, but we’ll focus on the current 4.X version of the protocol. It’s been out since 2010 and simplifies security.

Installation

Linux Server

This will install the server and a few requisites.

sudo  apt-get install nfs-kernel-server 

Configuration

Set NFSv4 only

In order to streamline the ports needed (in case one uses firewalls) and reduce required services, we will limit the server to v41 only.

Edit nfs-common

sudo vi /etc/default/nfs-common

NEED_STATD=“no” NEED_IDMAPD=“yes”

And the defaults

sudo vi /etc/default/nfs-kernel-server

RPCNFSDOPTS="-N 2 -N 3" RPCMOUNTDOPTS="–manage-gids -N 2 -N 3"

Disable rpcbind

sudo systemctl mask rpcbind.service sudo systemctl mask rpcbind.socket

Create Exports

In NFS parlance, you ’export’ a folder when you share it. We’ll use the same location for our exports as suggested in the Debian example.

sudo vim /etc/exports

/srv/nfs4 192.168.1.0/24(rw,async,fsid=0,crossmnt,no_subtree_check,all_squash,anonuid=1000,anongid=1000,insecure)

         /srv/nfs4 # This is the actual folder on the server's file system you're sharing
    192.168.1.0/24 # This is the network you're sharing with
                rw # Read-Write mode
             async # Allow cached writes
            fsid=0 # This signifies this is the 'root' of the exported file system and that
                   # clients will mount this share as '/'
          crossmnt # Allow subfolders that are seperate filesystem to be accessed also
  no_subtree_check # Disable checking for access rights outside the exported file system
        all_squash # all user IDs will translated to anonymous
      anonuid=1000 # all anonymous connections will be mapped to this user account in /etc/passwd
      anongid=1000 # all anonymous connections will be mapped to this group account in /etc/passwd
          insecure # Allows macs to mount using source ports from non-root source ports

If you can’t put all your content under this folder, it’s recommended you create pseudo file system for security reasons. See the notes for more info on that, but keep things simple if you can.

Configure Host-Based Firewall

If you have a system with ufw you can get this working fairly easily. NFS is already defined as a well-known service.

ufw allow from 192.168.1.0/24 to any port nfs

Restart the Service

You don’t actually need to restart. You put your changed into effect by issuing the exportfs command. This is best practice so you don’t to disrupt currently connected clients.

exportfs -rav

Client Configuration

Apple OS X

Modern Macs support NFSv4 with a couple tweaks

# In a terminal, issue the command
sudo mount -t nfs -o nolocks,resvport,locallocks 192.168.1.2:/srv ./mnt

You can also mount in finder with a version 4 flag. That may help but is somewhat awkward

nfs://vers=4,192.168.1.5/srv/nfs4

You can also edit the mac’s config file. This will allow you to use the finder to mount NFS 4 exports.

sudo vim /etc/nfs.conf

#
# nfs.conf: the NFS configuration file
#
#nfs.client.mount.options = nolock
#
nfs.client.mount.options = vers=4.1,nolocks,resvport,locallocks

You can now hit command-k and enter the string below to connect

nfs://my.server.or.ip/

Some sources suggest editing the autofs.conf file to add ’nolocks,locallocks to the automount options. This may or may not have an effect.

sudo vim  /etc/autofs.conf
AUTOMOUNTD_MNTOPTS=nosuid,nodev,nolocks,locallocks

Troubleshooting

Must use v3

If you must use v3, you can set static ports. Use the internet for this.

lockd: cannot monitor

You may want to check your mac’s nfs options and set ’nolock’ or possibly ‘vers=4’ as above. Don’t set them both on at once as in the next issue.

mount_nfs: can’t mount / from home onto /Volumes/mnt: Invalid argument

You can’t combine -o vers=4 with options like ’nolocks’, presumably because it’s not implemented fully. This may have changed by now.

https://developer.apple.com/library/mac/documentation/Darwin/Reference/Manpages/man8/mount_nfs.8.html

No Such File or Directory mount.nfs: mounting some.ip:/srv failed, reason given by server: No such file or directory

Version 4 maps directories and starts with ‘/’. Try mounting just the root path as opposed to /srv/nfs4.

mount  -o nfsvers=4.1 some.ip:/ /srv

<There was a problem ….

Check that you have ‘insecure’ in your nfs export options on the server

/srv  192.168.1.0/24(rw,async,fsid=0,insecure,crossmnt,no_subtree_check)

Can’t create or see files

Don’t forget that file permissions apply as the user you specified above. Set chown and chmod accordingly

Can Create Files But Not Modify or Delete

Check the parent directory permissions

NFS doesn’t mount at boot

Try adding some mount [options].

some.ip:/ /srv  nfs nofail,x-systemd.automount,x-systemd.requires=network-online.target,x-systemd.device-timeout=10,vers=4.1 0 0

mount.nfs: requested NFS version or transport protocol is not supported

Try specifying the nfs version

mount  -o nfsvers=4.1 some.ip:/ /srv

1.4.5.2 - Armbian NFS Server

This is usually a question of overhead. NFS has less CPU overhead and faster speeds circa 2023, and anecdotal testing showed fewer issues with common clients like VLC, Infuse and Kodi. However, there’s no advertisement1 like SMB has, so you have to pre-configure all clients.

This is the basic config for an anonymous, read-only share.

apt install nfs-kernel-server

echo "/mnt/pool *(fsid=0,ro,all_squash,no_subtree_check)" >> /etc/exports

exportfs -rav

  1. mDNS SRV records has some quasi supports, but not with common clients ↩︎

1.4.5.3 - NFS Container

This is problematic. NFS requires kernel privileges so the usual answer is “don’t”. Client’s also. So from a security and config standpoint, it’s better have PVE act as the NFS client and use bind mounts for the containers. But this can blur the line between services and infrastructure.

Either way, here’s my notes from setting up an Alpine NFS server.

Create privileged container and enable nesting

https://forum.proxmox.com/threads/is-it-possible-to-run-a-nfs-server-within-a-lxc.24403/page-2

Create a privileged container by unchecking “Unprivileged” during creation. May be possible to convert an existing container from unprivileged to privileged by backing-up and restoring. In the container Options -> Features, enable Nesting. (The NFS feature doesn’t seem necessary for running an NFS server. May be required for an NFS client - I haven’t checked

For Alpine, CAP_SETPCAP is also needed

vi /etc/pve/lxc/100.conf

# clear cap.drop
lxc.cap.drop:

# copy drop list from /usr/share/lxc/config/common.conf
lxc.cap.drop = mac_admin mac_override sys_time sys_module sys_rawio

# copy drop list from /usr/share/lxc/config/alpine.common.conf with setpcap commented

lxc.cap.drop = audit_write
lxc.cap.drop = ipc_owner
lxc.cap.drop = mknod
# lxc.cap.drop = setpcap
lxc.cap.drop = sys_nice
lxc.cap.drop = sys_pacct
lxc.cap.drop = sys_ptrace
lxc.cap.drop = sys_rawio
lxc.cap.drop = sys_resource
lxc.cap.drop = sys_tty_config
lxc.cap.drop = syslog
lxc.cap.drop = wake_alarm

Then proceed with https://wiki.alpinelinux.org/wiki/Setting_up_a_nfs-server.

1.4.6 - Replication

Not backup, it’s simply coping data between multiple locations. More like mirroring.

1.4.6.1 - rsync

This is used enough that it deserves several pages.

1.4.6.1.1 - Basic Rsync

If you regularly copy lots of files it’s best to use rsync. It’s efficient, as it only copies what you need, and secure, being able to use SSH. Many other tools such as BackupPC, Duplicity etc. use rsync under the hood, and when you are doing cross-platform data replication it may be the only tool that works, so you’re best to learn it.

Local Copies

Generally, it’s 10% slower than just using cp -a. Sometimes start with that and finish up with this.

rsync \
--archive \
--delete \
--dry-run \
--human-readable \
--inplace \
--itemize-changes \
--progress \
--verbose \
/some/source/Directory \
/some/destination/

The explanations of the more interesting options are:

--archive: Preserves all the metadata, as you'd expect
--delete : Removes extraneous files at the destination that no longer exist at the source (i.e. _not_ a merge)
--dry-run: Makes no changes. This is important for testing. Remove for the actual run
--inplace: This overwrites the file directly, rather than the default behavior that is to build a copy on the other end before moving it into place. This is slightly faster and better when space is limited (I've read)

If you don’t trust the timestamps at your destination, you can add the --checksum option, though when you’re local this may be slower than just recopying the whole thing.

A note about trailing slashes: In the source above, there is no trailing slash. But we could have added one, or even a /*. Here’s what happens when you do that.

  • No trailing slash - This will sync the directory as you’d expect.
  • Trailing slash - It will sync the contents of the directory to the location, rather than the directory itself.
  • Trailing /* - Try not to do this. It will sync each of the items in the source directory as if you had typed them individually. but not delete destination files that no longer exist on source, and so everything will be a merge regardless of if you issued the –delete parameter.

Across the Network

This uses SSH for encryption and authentication.

rsync \
--archive \
--delete \
--dry-run \
--human-readable \
--inplace \
--itemize-changes \
--progress \
--verbose \
/srv/Source_Directory/* \
[email protected]:/srv/Destination_Directory

Windows to Linux

One easy way to do this is to grab a bundled version of rsync and ssh for windows from the cwRsync folks

<https://www.itefix.net/content/cwrsync-free-edition>

Extract the standalone client to a folder and edit the .cmd file to add this at the end (the ^ is the windows CRNL escape)

rsync ^
--archive ^
--delete ^
--dry-run ^
--human-readable ^
--inplace ^
--itemize-changes ^
--no-group ^
--no-owner ^
--progress ^
--verbose ^
--stats ^
[email protected]:/srv/media/video/movies/* /cygdrive/D/Media/Video/Movies/

pause

Mac OS X to Linux

The version that comes with recent versions of OS X is a 2.6.9 (or so) variant. You can use that, or obtain the more recent 3.0.9 that has some slight speed improvements and features. To get the newest (you have to build it yourself) install brew, then issue the commands:

brew install https://raw.github.com/Homebrew/homebrew-dupes/master/rsync.rb
brew install rsync

One of the issues with syncing between OS X and Linux is the handling of Mac resource forks (file metadata). Lets assume that you are only interested in data files (such as mp4) and are leaving out the extended attributes that apple uses to store icons and other assorted data (replacing the old resource fork).

Since we are going between file systems, rather than use the ‘a’ option that preserves file attributes, we specify only ‘recursive’ and ’times’. We also use some excludes keep mac specific files from tagging along.

/usr/local/bin/rsync 
    --exclude .DS*
    --exclude ._*        
    --human-readable 
    --inplace 
    --progress 
    --recursive  
    --times 
    --verbose 
    --itemize-changes 
    --dry-run       
    "/Volumes/3TB/source/" 
    [email protected]:"/Volumes/3TB/"

Importantly, we are ‘itemizing’ and doing a ‘dry-run’. When you do, you will see a report like:

skipping non-regular file "Photos/Summer.2004"
skipping non-regular file "Photos/Summer.2005"
.d..t....... Documents/
.d..t....... Documents/Work/
cd++++++++++ ISOs/
<f++++++++++ ISOs/Office.ISO

The line with cd+++ indicate a directory will be created and <f+++ indicate a file is going to be copied. When it says ‘skipping’ a non regular file, that’s (in this case, at least) a symlink. You can include them, but if your paths don’t match up on both systems, these links will fail.

Spaces in File Names

Generally you quote and escape.

rsync 
  --archive ^
  --itemize-changes ^
  --progress ^
  [email protected]:"/srv/media/audio/Music/Basil\ Poledouris" ^
  /cygdrive/c/Users/Allen/Music

Though it’s rumored that you can single quote and escape with the –protect-args option

--protect-args ^
[email protected]:'/srv/media/audio/Music/Basil Poledouris' ^

List of Files

You may want to combine find and rsync to get files of a specific criteria. Use the --from-file parameter

ssh server.gattis.org find /srv/media/video -type f -mtime -360 > list

rsync --progress --files-from=list server.gattis.org:/ /mnt/media/video/

Seeding an Initial Copy

If you have no data on the destination to begin with, rsync will be somewhat slower than a straight copy. On a local system simply use ‘cp -a’ (to preserve file times). On a remote system, you can use tar to minimize the file overhead.

tar -c /path/to/dir | ssh remote_server 'tar -xvf - -C /absolute/path/to/remotedir'

It is also possible to use rsync with the option --whole-file and this will skip the things that slow rsync down though I have not tested it’s speed

Time versus size

Rsync uses time and size to determine if a file should be updated. If you have already copied files and you are trying to do a sync, you may find your modification times are off. Add the –size-only or the –modify-window=NUM. Even better, correct your times. (this requires on OS X the coreutils to get the GNU ls command and working with the idea here)

http://notemagnet.blogspot.com/2009/10/getting-started-with-rsync-for-paranoid.html http://www.chrissearle.org/blog/technical/mac_homebrew_and_homebrew_alt http://ubuntuforums.org/showthread.php?t=1806213

1.4.6.1.2 - Scheduled Rsync

Running rsync via cron has been around a long time. Ideally, you use public keys and limit the account. You do it something like this.

  • On the source
    • Configure SSHD to handle user keys
    • Create a control script to restrict users to rsync
    • Add an account specific to backups
    • Generate user keys and limit to the control script
  • On the destination
    • Copy the private key
    • Create a script and cronjob

Source

# Add a central location for keys and have sshd look there. Notice the
# '%u'. It's substituted with user ID at login to match the correct filename
sudo mkdir /etc/ssh/authorized_keys
echo "AuthorizedKeysFile /etc/ssh/authorized_keys/%u.pub" > /etc/ssh/sshd_config.d/authorized_users.conf
systemctl restart ssh.service

# Create the script logic that makes sure it's an rsync command. You can modify this to allow other cmds as needed.
sudo tee /etc/ssh/authorized_keys/checkssh.sh << "EOF"
#!/bin/bash
if [ -n "$SSH_ORIGINAL_COMMAND" ]; then
    if [[ "$SSH_ORIGINAL_COMMAND" =~ ^rsync\  ]]; then
        echo $SSH_ORIGINAL_COMMAND | systemd-cat -t rsync
        exec $SSH_ORIGINAL_COMMAND
    else
        echo DENIED $SSH_ORIGINAL_COMMAND | systemd-cat -t rsync
    fi
fi
EOF

chmod +x /etc/ssh/authorized_keys/checkssh.sh

# Add the user account and create keys for them
THE_USER="backup-account-1"
sudo adduser --no-create-home --home /nonexistent --disabled-password --gecos "" ${THE_USER}
ssh-keygen -f /etc/ssh/authorized_keys/${THE_USER} -q -N "" -C "${THE_USER}"

# Add the key stipulations that invoke the script and limit ssh options.
# command="/etc/ssh/authorized_keys/checkssh.sh\",no-port-forwarding,no-X11-forwarding,no-agent-forwarding,no-pty 
THE_COMMAND="\
command=\
\"/etc/ssh/authorized_keys/checkssh.sh\",\
no-port-forwarding,\
no-X11-forwarding,\
no-agent-forwarding,\
no-pty "

# Insert the command in front of the user's key - the whole file remains a single line
sed -i "1s|^|$THE_COMMAND|" /etc/ssh/authorized_keys/${THE_USER}.pub

# Finally, copy the account's private key to the remote location
scp /etc/ssh/authorized_keys/${THE_USER} [email protected]:

Destination

It’s usually best to create a script that uses rsync and call that from cron. Preferably one that doesn’t step on itself for long running syncs. Like this:

vi ~/schedule-rsync
#!/bin/bash

THE_USER="backup-account-1"
THE_KEY="~/backup-account-1" # If you move the key, make sure to adjust this

SCRIPT_NAME=$(basename "$0")
PIDOF=$(pidof -x $SCRIPT_NAME)

for PID in $PIDOF; do
    if [ $PID != $$ ]; then
        echo "[$(date)] : $SCRIPT_NAME : Process is already running with PID $PID"
        exit 1
    fi
done



rsync \
--archive \
--bwlimit=5m \
--delete \
--delete-excluded \
--exclude .DS* \
--exclude ._* \
--human-readable \
--inplace \
--itemize-changes \
--no-group \
--no-owner \
--no-perms \
--progress \
--recursive \
--rsh "ssh -i ${THE_KEY}" \
--verbose \
--stats \
${THE_USER}@some.server.org\
:/mnt/pool01/folder.1 \
:/mnt/pool01/folder.2 \
:/mnt/pool01/folder.2 \
/mnt/pool02/

Then, call it from a file in the cron drop folder.

echo "0 1 * * * /home/$USER/schedule-rsync  >> /home/$USER/rsync-video.log 2>&1" > /etc/cron.d/schedule-rsync

Notes

Why not use rrsync?

The rrsync script is similar to the script we use, but is distributed and maintained as part of the rsync package. It’s arguably a better choice. I like the checkssh.sh approach as it’s more flexible, allows for things other than rsync, and doesn’t force relative paths. But if you’re only doing rsync, consider using rrsync like this;

THE_COMMAND="\
command=\
\"rrsync -ro /mnt/pool01\",\
no-port-forwarding,\
no-X11-forwarding,\
no-agent-forwarding,\
no-pty "

In your client’s rsync command, make the paths relative to path rrsync expects above.

rsync \
...
...
${THE_USER}@some.server.org\
:folder.1 \
:folder.2 \
:folder.2 \
/mnt/pool02/

If you see the client-side error message:

rrsync error: option -L has been disabled on this server

You discovered that following symlinks has been disabled by default in rrsync. You can enable with an edit to the script.

sudo sed -i 's/KLk//' /usr/bin/rrsync

# This changes
#    short_disabled_subdir = 'KLk'
        to
#    short_disabled_subdir = ''  

Troubleshooting

Sources

https://peterbabic.dev/blog/transfer-files-between-servers-using-rrsync/ http://gergap.de/restrict-ssh-to-rsync.html https://superuser.com/questions/641275/make-linux-server-allow-rsync-scp-sftp-but-not-a-terminal-login

1.4.6.1.3 - Rsync Daemon

Some low-power devices, such as the Raspbery Pi, struggle with the encryption overheard of rsync default network transport, ssh.

If you don’t need encryption or authentication, you can significantly speed things up by using rsync in daemon mode.

Push Config

In this example, we’ll push data from our server to the low-power client.

Create a Config File

Create a config file on the sever that we’ll send over to the client later.

nano client-rsyncd.conf
log file = /var/log/rsync.log
pid file = /var/run/rsyncd.pid
lock file = /var/run/rsync.lock

# This is the name you refer to in rsync. The path is where that maps to.
[media]
        path = /var/media
        comment = Media
        read only = false
        timeout = 300
        uid = you
        gid = you

Start and Push On-Demand

The default port is hi-level and doesn’t require root privileges.

# Send the daemon config over to the home dir
scp client-rsyncd.conf [email protected]:

# Launch rsync in daemon mode
ssh [email protected]: rsync --daemon --config ./client-rsyncd.conf

# Send the data over
rsync \
--archive \
--delete \
--human-readable \
--inplace \
--itemize-changes \
--no-group \
--no-owner \
--no-perms \
--omit-dir-times \
--progress \
--recursive \
--verbose \
--stats \
/mnt/pool01/media/movies rsync://client.some.lan:8730/media

# Terminate the remote instance
ssh [email protected] killall rsync

1.4.6.1.4 - Tunneled Rsync

One common task is to rsync through a bastion host to an internal system. Do it with the rsync shell options

rsync \
--archive \
--delete \
--delete-excluded \
--exclude "lost+found" \
--human-readable \
--inplace \
--progress \
--rsh='ssh -o "ProxyCommand ssh [email protected] -W %h:%p"' \
--verbose \
[email protected]:/srv/plex/* \
/data/

There is a -J or ProxyJUmp option on new versions of SSH as well.

https://superuser.com/questions/964244/rsyncing-directories-through-ssh-tunnel https://unix.stackexchange.com/questions/183951/what-do-the-h-and-p-do-in-this-command https://superuser.com/questions/1115715/rsync-files-via-intermediate-host

1.4.6.2 - Tar Pipe

AKA - The Fastest Way to Copy Files.

When you don’t want to copy a whole file system, many admins suggest the fastest way is with a ’tar pipe'.

Locally

From one disk to another on the same system. This uses pv to buffer.

(cd /src; tar cpf - .) | pv -trab -B 500M | (cd /dst; tar xpf -)

Across the network

NetCat

You can add netcat to the mix (as long as you don’t need encryption) to get it across the network.

On the receiver:

(change to the directory you want to receive the files or directories in)

nc -l -p 8989 | tar -xpzf -

On the sender:

(change to the directory that has the file or directory - like ‘pics’ - in it)

tar -czf - pics | nc some.server 8989

mbuffer

This takes the place of pc and nc and is somewhat faster.

On the receiving side

    mbuffer -4 -I 9090 | tar -xf -

On the sending side

    sudo tar -c plexmediaserver | mbuffer -m 1G -O SOME.IP:9090

SSH

You can use ssh when netcat isn’t appropriate or you want to automate with a SSH key and limited interaction with the other side. This examples ‘pulls’ from a remote server.

 (ssh [email protected] tar -czf - /srv/http/someSite) | (tar -xzf -)

NFS

If you already have a NFS server on one of the systems though, it’s basically just as fast. At least in informal testing, it behaves more steadily as opposed to a tar pipe’s higher peaks and lower troughs. A simple cp -a will suffice though for lots of little files a tar pipe still may be faster.

rsync

rsync is generally best if you can or expect the transfer to be interrupted. In my testing, rsync achieved about 15% less throughput with about 10% more processor overhead.

http://serverfault.com/questions/43014/copying-a-large-directory-tree-locally-cp-or-rsync http://unix.stackexchange.com/questions/66647/faster-alternative-to-cp-a http://serverfault.com/questions/18125/how-to-copy-a-large-number-of-files-quickly-between-two-servers

1.4.6.3 - Unison

Unison offers several features that make it more useful than rsync;

  • Multi-Way File Sync
  • Detect Renames and Copies
  • Delta copies

Multi-Way File Sync

Rsync is good at one-way synchronization. i.e. one to many. But when you need to sync multiple authoritative systems, i.e. many to many, you want to use unison. It allows you to merge changes.

Detect Renames and Copies (xferbycopying)

Another problem with rsync is that when you rename a file, it re-sends it. This is because a re-named file appears ’new’ to the sync utility. Unison however, maintains a hash of every file you’ve synced and if there is already a local copy (i.e. the file before you renamed it), it will use that and do a ’local copy’ rather than sending it. So a rename effectively is a local copy and a delete. Not perfect, but better than sending it across the wire.

Delta Copies

Unison uses it’s own implementation of the rsync delta copy algorithm. However, for large files the authors recommend an option that wraps rsync itself as you can optimize it for large files. Use Unison can use config files in your ~/.unison folder. If you type ‘unison’ without any arguments, it will use the ‘default.prf’ file. Here is a sample

# Unison preferences file

# Here are the two server 'roots' i.e., the start of where we will pick out things to sync.
# The first root is local, and the other remote over ssh
root = /mnt/someFolder
root = ssh://[email protected]//mnt/someFolder

# The 'path' is simply the name of a folder or file you want to sync. Notice the spaces are preserved. No not excape them.
path = A Folder Inside someFolder

# We're 'forcing' the first root to win all conflicts. This sort of negates the multi-way
# sync feature but it's just an example 
force = /mnt/someFolder

# This instructs unison to copy the contents of sym links, rather than the link itself
follow = Regex .*


# You can also ignore files and paths explicitly or pattern. See the 'Path specification' 
ignore = Name .AppleDouble
ignore = Name .DS_Store
ignore = Name .Parent
ignore = Name ._*

# Here we are invoking an external engine (rsync) when a file is over 10M, and passing it some arguments 
copythreshold = 10000
copyprog      = rsync --inplace
copyprogrest  = rsync --partial --inplace

Hostname is important. Unison builds a hash of all the files to determine what’s changed (similar to md5sum with rsync, but faster). If you get repeated messages about ‘…first time being run…’ you may have an error in your path

http://www.cis.upenn.edu/~bcpierce/unison/download/releases/stable/unison-manual.html

1.4.7 - sshfs

You can mount a remote system via sshfs. It’s slow, but better than nothing.

# Mount a host dir you have ssh access to
sshfs [email protected]:/var/www/html /var/www/html

# Mount a remote system over a proxy jump host
sshfs [email protected]:/var/www/html /var/www/html -o ssh_command="ssh -J [email protected]",allow_other,default_permissions

It’s often handy to add a shortcut to your ssh config so you don’t have to type as much.

cat .ssh/config 
Host some.host
    ProxyJump some.in.between.host

1.4.8 - ZFS

Overview

ZFS: the last word in file systems - at least according to Sun Microsystems back in 2004. But it’s pretty much true for traditional file servers. You add disks to a pool where you decide how much redundancy you want, then create file systems that sit on top. ZFS directly manages it all. No labeling, partitioning or formatting required.

There is error detection and correction, zero-space clones and snapshots, compression and deduplication, and delta-based replication for backup. It’s also highly resistent to corruption from power loss and crashes because it uses Copy-On-Write.

This last feature means that as files are changed, only the changed-bits are written out, and then the metadata updated at the end as a separate and final step to include these changed bits. The original file stays the same until the very end. An interruption in the middle of a write (such as from a crash) leaves the file undamaged.

Lastly, everything is checksummed and automatically repaired should you ever suffer from silent corruption.

1.4.8.1 - Basics

The Basics

Let’s create a pool and mount a file system.

Create a Pool

A ‘pool’ is group of disks and that’s where RAID levels are established. You can choose to mirror drives, or use distributed parity. A common choice is RAIDZ1 so that you can sustain the loss of one drive. You can increase the redundancy to RAIDZ2 and RAIDZ3 as desired.

zpool create pool01 raidz1 /dev/sdc /dev/sdd /dev/sde /dev/sdf

zpool list
NAME               SIZE  ALLOC   FREE  
pool01              40T     0T    40T 

Create a Filesystem

A default root file system is created and mounted automatically when the pool is created. Using this is fine and you can easily change where it’s mounted.

zfs list

NAME     USED  AVAIL     REFER  MOUNTPOINT
pool01     0T    40T        0T  /pool01

zfs set mountpoint=/mnt pool01

But often, you need more than one file system - say one for holding working files and another for long term storage. This allows you to easily back up things different things, differently.

# Get rid of the initial fs and pool
zpool destroy pool01
# Create it again but leave the root filesystem unmounted with the `-m` option. You can also use drive short-names
zpool create -m none pool01 raidz1 sdc sdd sde sdf

# Add a couple filesystems and mount under the /srv directory
zfs create pool01/working -o  mountpoint=/srv/working
zfs create pool01/archive -o  mountpoint=/srv/archive             
              ^     ^
             /       \
pool name --          filesystem name

Now you can do things like snapshot the archive folder regularly while skipping the working folder. The only downside is that they are separate filessytems. Moving things doesn’t happen instantly.

Compression

Compression is on by default and this will save space for things that can benefit from it. It also makes things faster as moving compressed data takes less time. CPU use for the default algorithm lz4, is negligible and it quickly detects files that aren’t compressible and gives up so that CPU time isn’t wasted.

zfs get compression pool01

NAME    PROPERTY     VALUE           SOURCE
pool01  compression  lz4             local

Next Step

Now you’ve got a filesystem, take a look at creating and work with snapshots.

1.4.8.2 - Snapshot

Create a Snapshot

About to do something dangerous? Let’s create a ‘save’ point so you can reload your game, so to speak. They don’t take any space (to start with) and are nearly instant.

# Create a snapshot named `save-1`
zfs snapshot pool01/archive@save-1

The snapshot is a read-only copy of the filesystem at that time. It’s mounted by default in a hidden directory and you can examine and even copy things out of it, if you desire.

ls /srv/archive/.zfs/save-1

Delete a Snapshot

While a snapshot doesn’t take up any space to start with, it begins to as you make changes. Anything you delete stays around on the snapshot. Things you edit consume space as well for the changed bits. So when you’re done with a snapshot, it’s easy to remove.

zfs snapshot destroy pool01/archive@save-1

Rollback to a Snapshot

Mess things up in your archive folder after all? Not a problem, just roll back to the same state as your snapshot.

zfs rollback pool01/archive@save-1

Importantly, this is a one-way trip back. You can’t branch and jump around like it was a filesystem multiverse of alternate possibilities. ZFS will warn you about this and if you have more than one snapshot in between you and where you’re going, it will let you know they are about to be deleted.

Auto Snapshot

One of the most useful tools is the zfs-auto-snapshot utility. This will create periodic snapshots of your filesystem and keep them pruned for efficiency. By default, it creates a snapshot every 15 min, and then prunes them down so you have one:

  • Every 15 min for an hour
  • Every hour for a day
  • Every day for a week
  • Every week for month
  • Every month

Install with the command:

sudo apt install zfs-auto-snapshot

That’s it. You’ll see new folders based on time created in the hidden .zfs folder at the root of your filesystems. Each filesystem will get it’s own. Anytime you need to look for a file you’ve deleted, you’ll find it there.

# Look at snapshots
ls /srv/archive/.zfs/

Excluding datasets from auto snapshot

# Disable
zfs set com.sun:auto-snapshot=false rpool/export

Excluding frequent or other auto-snapshot

There are sub-properties you can set under the basic auto-snapshot value

zfs set com.sun:auto-snapshot=true pool02/someDataSet
zfs set com.sun:auto-snapshot:frequent=false  pool02/someDataSet

zfs get com.sun:auto-snapshot pool02/someDataSet
zfs get com.sun:auto-snapshot:frequent pool02/someDataSet

# Possibly also the number to keep if other than the default is desired
zfs set com.sun:auto-snapshot:weekly=true,keep=52

# Take only weekly
zfs set com.sun:auto-snapshot:weekly=true rpool/export

Deleting Lots of auto-snapshot files

You can’t use globbing or similar to mass-delete snapshots, but you can string together a couple commands.

# Disable auto-snap as needed
zfs set com.sun:auto-snapshot=false pool04
zfs list -H -o name -t snapshot | grep auto | xargs -n1 zfs destroy

Missing Auto Snapshots

On some centOS-based systems, like XCP-NG, you will only see frequent snapshots. This is because only the frequent cron job uses the correct path. You must add a PATH statement to the other cron jobs

https://forum.level1techs.com/t/setting-up-zfs-auto-snapshot-on-centos-7/129574/12

Next Step

Now that you have snapshots, let’s send them somewhere for backup with replication.

References

https://www.reddit.com/r/zfs/comments/829v5a/zfs_ubuntu_1604_delete_snapshots_with_wildcard/

1.4.8.3 - Replication

Replication is how you backup and copy ZFS. It turns a snapshot into a bit-stream that you can pipe to something else. Usually, you pipe it over the network to another system where you connect it to zfs receive.

It is also the only way. The snapshot is what allows point-in-time handling and the receive ensures consistency. And a snapshot is a filesystem, but it’s more than just the files. Two identically named filesystems with the same files you put in place by rsync are not the same filesystem and you can’t jump-start a sync this way.

Basic Examples

# On the receiving side, create a pool to hold the filesystem
zpool create -m none pool02 raidz1 sdc sdd sde sdf

# On the sending side, pipe over SSH. The -F forces the filesystem on the receiving side to be replaced
zfs snapshot pool01@snap1
zfs send pool01@snap1 | ssh some.other.server zfs receive -F pool02

Importantly, this replaces the root filesystem on the receiving side. The filesystem you just copied over is accessible when the replication is finished - assuming it’s mounted and your only using the default root. If you’re using multiple filesystems, you’ll want to recursively send things so you can pick up children like the archive filesystem.

# The -r and -R trigger recursive operations
zfs snapshot -r pool01@snap1
zfs send -R pool01@snap1 | ssh some.other.server zfs receive -F pool02

You can also pick a specific filesystem to send. You can name it whatever you like on the other side, or replace something already named.

# Sending just the archive filesystem
zfs snapshot pool01/archive@snap1
zfs send pool01/archive@snap1 | ssh some.other.server zfs receive -F pool02/archive

And of course, you may have two pools on the same system. One line in a terminal is all you need.

zfs send -R pool01@snap1 | zfs receive -F pool02

Using Mbuffer or Netcat

These are much faster than ssh if you don’t care about someone capturing the traffic. But it does require you to start both ends separately.

# On the receiving side
ssh some.other.system
mbuffer -4 -s 128k -m 1G -I 8990 | zfs receive -F pool02/archive

# On the sending side
zfs send pool01/archive@snap1 | mbuffer -s 128k -m 1G -O some.other.system:8990

You can also use good-ole netcat. It’s a little slower but still faster than SSH. Combine it with pv for some visuals.

# On the receiving end
nc -l 8989 | pv -trab -B 500M | zfs recv -F pool02/archive

# On the sending side
zfs send pool01/archive@snap1 | nc some.other.system 8989

Estimating The Size

You may want to know how big the transmit is to estimate time or capacity. You can do this with a dry-run.

zfs send -nv pool01/archive@snap1

Use a Resumable Token

Any interruptions and you have to start all over again - or do you? If you’re sending a long-running transfer, add a token on the receiving side and you can restart from where it broke, turning a tragedy into just an annoyance.

# On the receiving side, add -s
ssh some.other.system
mbuffer -4 -s 128k -m 1G -I 8990 | zfs receive -s -F pool01/archive

# Send the stream normally
zfs send pool01/archive@snap1 | mbuffer -s 128k -m 1G -O some.other.system:8990

# If you get interrupted, on the receiving side, look up the token
zfs get -H -o value receive_resume_token pool01

# Then use that on the sending side to resume where you left off
zfs send -t 'some_long_key' | mbuffer -s 128k -m 1G -O some.other.system:8990

If you decide you don’t want to resume, clean up with the -A command to release the space consumed by the pending transfer.

# On the receiving side
zfs recv -A pool01/archive

Sending an Incremental Snapshot

After you’ve sent the initial snapshot, subsequent ones are much smaller. Even very large backups can be kept current up if you ‘pre-seed’ before taking the other pool remote.

# at some point in past
zfs snapshot pool01/archive@snap1
zfs send pool01/archive@snap1 ....

# now we'll snap again and send just the changes between the two using  -i 
zfs snapshot pool01/archive@snap2
zfs send -i pool01/archive/@snap1 pool01/archive@snap2 ...

Sending Intervening Snapshots

If you are jumping more than one snapshot ahead, the intervening ones are skipped. If you want to include them for some reason, use the -I option.

# This will send all the snaps between 1 and 9
zfs send -I pool01/archive/@snap1 pool01/archive@snap9 ...

Changes Are Always Detected

You’ll often need to use -F to force changes as even though you haven’t used the remote system, it may think you have if it’s mounted and atimes are on.

You must have a snapshot in common

You need at least 1 snapshot in common at both locations. This must have been sent from one to the other, not just named the same. Say you create snap1 and send it. Later, you create snap2 and, thinking you don’t need it anymore, delete snap1. Even though you have snap1 on the destination you cannot send snap2 as a delta. You need snap1 in both locations to generate the delta. You are out of luck and must send snap2 as a full snapshot.

You can use a zfs feature called a bookmark as an alternative, but that is something you set up in advance and won’t save you from the above scenario.

A Full Example of an Incremental Send

Take a snapshot, estimate the size with a dry-run, then use a resumable token and force changes

zfs snapshot pool01/archive@12th
zfs send -nv -i pool01/archive@11th pool01/archive@12th
zfs send -i pool01/archive@11th pool01/archive@12th | pv -trab -B 500M | ssh some.other.server zfs recv -F -s pool01/archive

Here’s an example of a recursive snapshot to a file. The snapshot takes a -r for recursive, and the send a -R.

# Take the snapshot
zfs snapshot -r pool01/archive@21

# Estimate the size
zfs send -nv -R -i pool01/archive@20 pool01/archive@21

# Mount a portable drive to /mnt - a traditional ext4 or whatever
zfs send -Ri pool01/archive@20 pool01/archive@21 > /mnt/backupfile

# When you get to the other location, receive from the file
zfs receive -s -F pool02/archive < /mnt/backupfile

Sending The Whole Pool

This is a common thought when you’re starting, but it ends up being deceptive because pools aren’t things that can be sent. Only filesets. So what you’re actually doing is a recursive send of the root fileset with a implicit snapshot created on the fly. This is find, but you won’t be able to refer to it later for updates, so you’re better off not.

# Unmount -a (all available filesystems) on the given pool
zfs unmount -a pool01

# Send the unmounted filesystem with an implicit snapshot
zfs send -R pool01 ...

Auto Replication

You don’t want to do this by hand all the time. One way is with a simple script. If you’ve already installed zfs-auto-snapshot you may have something that looks like this:

# use '-o name' to get just the snapshot name without all the details 
# use '-s creation' to sort by creation time
zfs list -t snapshot -o name -s creation pool01/archive

pool01/archive@auto-2024-10-13_00-00
pool01/archive@auto-2024-10-20_00-00
pool01/archive@auto-2024-10-27_00-00
pool01/archive@auto-2024-11-01_00-00
pool01/archive@auto-2024-11-02_00-00

You can get the last two like this, then use send and receive. Adjust the grep to get just the daily as needed.

CURRENT=$(zfs list -t snapshot -o name -s creation pool01/replication | grep auto | tail -1 )
LAST=$(zfs list -t snapshot -o name -s creation pool01/replication | grep auto | tail -2 | head -1)

zfs send -i $LAST $CURRENT | pv -trab -B 500M | ssh some.other.server zfs recv -F -s pool01/archive

This is pretty basic and can fall out of sync. You can bring it up a notch by asking the other side to list it’s snapshots with a zfs list over ssh and comparing against yours to find the most recent match. And add a resume token. But by that point you may consider just using a tool like:

This looks to replace your auto snapshots as well, and that’s probably fine. I have’t used it myself as I scripted back in the day. I will probably start, though.

Next Step

Sometimes, good disks go bad. Learn how to catch them before they do, and replace them when needed.

Scrub and Replacement

1.4.8.4 - Disk Replacement

Your disks will fail, but you’ll usually get some warnings because ZFS proactively checks every occupied bit to guard against silent corruption. This is normally done every month, but you can launch one manually if you’re suspicious. They take a long time, but operate at low priority.

# Start a scrub
zpool scrub pool01

# Check the status
zpool scrub -s pool01

You can check the status of your pool at anytime with the command zpool status. When there’s a problem, you’ll see this:

zpool status

NAME            STATE     READ WRITE CKSUM
pool01          DEGRADED     0     0     0
  raidz1        DEGRADED     0     0     0
    /dev/sda    ONLINE       0     0     0
    /dev/sdb    ONLINE       0     0     0
    /dev/sdc    ONLINE       0     0     0
    /dev/sdd    FAULTED     53     0     0  too many errors    

Time to replace that last drive before it goes all the way bad

# You don't need to manually offline the drive if it's faulted, but it's good practice to as there's other states it can be in
zpool offline pool01 /dev/sdd

# Physically replace that drive. If you're shutting down to do this, the replacement usually has the same device path
zpool replace pool01 /dev/sdd

There’s a lot of strange things that can happen with drives and depending on your version of ZFS it might be using UUIDs or other drive identification strings. Check the link below for some of those conditions.

1.4.8.5 - Large Pools

You’ve guarded against disk failure this by adding redundancy, but was it enough? There’s a very mathy calculator at https://jro.io/r2c2/ that will allow you chart different parity configs. But a reasonable rule-of-thumb is to devote 20%, or 1 in 5 drives, to parity.

  • RAIDZ1 - up to 5 Drives
  • RAIDZ2 - up to 10 Drives
  • RAIDZ3 - up to 15 Drives

Oracle however, recommends Virtual Devices when you go past 9 disks.

Pools and Virtual Devices

When you get past 15 drives, you can’t increase parity. You can however, create virtual devices. Best practice from Oracle says to do this even earlier as a VDev should be less than 9 disks 1. So given 24 disks, you should have 3 VDevs of 8 each. Here’s an example with 2 parity. Slightly better than 1 in 5 and suitable for older disks.

### Build a 3-Wide RAIDZ2  across 24 disks
zpool create \
  pool01 \
  -m none \
  -f \
  raidz2 sdb sdc sdd sde sdf sdg sdh sdi \
  raidz2 sdj sdk sdl sdm sdn sdo sdp sdq \
  raidz2 sdr sds sdt sdu sdv sdw sdx sdy 

Using Disk IDs

Drive letters can be hard to trace back to a physical drive. A a better2 way is the /dev/disk/by-id identifiers.

ls /dev/disk/by-id | grep ata | grep -v part

zpool create -m none -o ashift=12 -O compression=lz4 \
  pool04 \
    raidz2 \
      ata-ST4000NM0035-1V4107_ZC11AHH9 \
      ata-ST4000NM0035-1V4107_ZC116F11 \
      ata-ST4000NM0035-1V4107_ZC1195V5 \
      ata-ST4000NM0035-1V4107_ZC11CDMB \
      ata-ST4000NM0035-1V4107_ZC1195PR \
      ata-ST4000NM0024-1HT178_Z4F164WG \
      ata-ST4000NM0024-1HT178_Z4F17SJK \
      ata-ST4000NM0024-1HT178_Z4F17M6B \
      ata-ST4000NM0024-1HT178_Z4F18FZE \
      ata-ST4000NM0024-1HT178_Z4F18G35 

Hot and Distributed Spares

Spares vs Parity

You may not be reach a location quickly when a disk fails. In such a case, is it better to have a Z3 filesystem run in degraded performance mode (i.e. calculating parity the whole time) or a Z2 system that replaces the failed disk automatically?

It’s better to have a more parity until you go past the guidelines of 15 drives in a Z3 config. If you have 16 bays, add a hot spare.

Distributed vs Dedicated

A distributed spare is a newer feature that allows you reserve space on all of your disks, rather than just one. That allows resilvering to go much faster as you’re no longer limited by the speed of one disk. Here’s an example of such a pool that has 16 total devices.

# This pool has 3 parity, 12 data, 16 total count, with 1 spare
```bash
zpool create -f pool02 \
  draid3:12d:16c:1s \
    ata-ST4000NM000A-2HZ100_WJG04M27 \
    ata-ST4000NM000A-2HZ100_WJG09BH7 \
    ata-ST4000NM000A-2HZ100_WJG0QJ7X \
    ata-ST4000NM000A-2HZ100_WS20ECCD \
    ata-ST4000NM000A-2HZ100_WS20ECFH \
    ata-ST4000NM000A-2HZ100_WS20JXTA \
    ata-ST4000NM0024-1HT178_Z4F14K76 \
    ata-ST4000NM0024-1HT178_Z4F17SJK \
    ata-ST4000NM0024-1HT178_Z4F17YBP \
    ata-ST4000NM0024-1HT178_Z4F1BJR1 \
    ata-ST4000NM002A-2HZ101_WJG0GBXB \
    ata-ST4000NM002A-2HZ101_WJG11NGC \
    ata-ST4000NM0035-1V4107_ZC1168N3 \
    ata-ST4000NM0035-1V4107_ZC116F11 \
    ata-ST4000NM0035-1V4107_ZC116MSW \
    ata-ST4000NM0035-1V4107_ZC116NZM \

References

1.4.8.6 - Pool Testing

Best Practices

Best practice from Oracle says a VDev should be less than 9 disks1. So given 24 disks, you should have 3 VDevs. They further recommend the following amount of parity vs data:

  • single-parity starting at 3 disks (2+1)
  • double-parity starting at 6 disks (4+2)
  • triple-parity starting at 9 disks (6+3)

It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).

Reasons For These Practices

I interpret this as meaning that when a single IO write operation is given to the VDev, it won’t write anything else until it’s done. But if you have multiple VDevs, you can hand out a writes to other VDevs while you’re waiting on the first. Reading is probably unaffected, but writes will be faster with more VDevs.

Also, when resilvering the array, you have to read from each of the drives in the VDev to calculate the parity bit. If you have 24 drives in a VDev, then you have to read a block of data from all 24 drives to produce the parity bit. If you have only 8, then you have only 1/3 as much data to read. Meanwhile, the rest of the VDevs are available for real work.

Rebuilding the array also introduces stress which can cause other disks to fail, so it’s best to limit that to a smaller set of drives. I’ve heard many times of resilvering causing sister drives that were already on the edge, to go over and fail the array.

Calculating Failure Rates

You can calculate the failure rates of different configurations with an on-line tool2. The chart scales the X axis by 50, so the differences in failure rates are not as large as it would seem, but if they didn’t you wouldn’t be able to see the lines. But in most cases, there’s not a large difference say between a 4x9 and a 3x12.

When To Use a Hot Spare

Given 9 disks where one fails, is it better to drop from 3 parity to 2 and run in degraded mode, or 2 parity that drops to 1 and a spare that recovers without intervention. The math2 says its better to have parity. But what about speed? When you loose a disk, 1 out of every 9 IOPS requires that you reconstruct it from parity. Anecdotally, observed performance penalties are minor. So the only times to use a hot spare is:

  • When you have unused capacity in RAIDZ3 (i.e. almost never)
  • When IOPS require a mirror pool

Say you have 16 bays of 4TB Drives. A 2x8 Z2 config gives you 48TB but you only want 32TB. Change that to a 2x8 Z3 and get 40TB. Still only need 32 TB? Change that to a 2x7 Z3 with 2 hot spares. Now you have 32TB with the maximum protection and the insurance of an automatic replacement.

Or maybe you have a 37 bay system. You do something that equals 36 plus a spare.

The other case is when your IOPS demands push past what RAIDZ can do an you must use a mirror pool. A failure there looses all redundancy and a hot spare is your only option.

When To Use a Distributed Spare

A distributed spare recovers in half the time3 from a disk loss, and is always better than a dedicated spare - though you should almost never use a spare anyway. The only time to use a normal hot spare is when you have a single global spare.

Testing Speed

The speed difference isn’t charted. So let’s test that some.

Given 24 disks, and deciding to live dangerously, should you should have a single, 24 disk vdev with three parity disks, or three VDevs with a single parity disk each? The reason for the 1st case is better resiliency, and the latter better write speed and recovery from disk failures.

Build a 3-Wide RAIDZ1

Create the pool across 24 disks

zpool create \
-f -m /srv srv \
raidz sdb sdc sdd sde sdf sdg sdh sdi \
raidz sdj sdk sdl sdm sdn sdo sdp sdq \
raidz sdr sds sdt sdu sdv sdw sdx sdy 

Now copy a lot of random data to it

#!/bin/bash

no_of_files=1000
counter=0
while [[ $counter -le $no_of_files ]]
 do echo Creating file no $counter
   touch random-file.$counter
   shred -n 1 -s 1G random-file.$counter
   let "counter += 1"
 done

Now yank (literally) one of the physical disks and replace it

allen@server:~$ sudo zpool status  
                                                                             
  pool: srv                                                                                              
 state: DEGRADED                                                                                         
status: One or more devices could not be used because the label is missing or                            
        invalid.  Sufficient replicas exist for the pool to continue                                     
        functioning in a degraded state.                                                                 
action: Replace the device using 'zpool replace'.                                                        
   see: http://zfsonlinux.org/msg/ZFS-8000-4J                                                            
  scan: none requested                                                                                   
config:                                                                                                  
                                                                                                         
        NAME                     STATE     READ WRITE CKSUM                                              
        srv                      DEGRADED     0     0     0                                              
          raidz1-0               DEGRADED     0     0     0                                              
            sdb                  ONLINE       0     0     0                                              
            6847353731192779603  UNAVAIL      0     0     0  was /dev/sdc1                               
            sdd                  ONLINE       0     0     0                                              
            sde                  ONLINE       0     0     0                                              
            sdf                  ONLINE       0     0     0                                              
            sdg                  ONLINE       0     0     0                                              
            sdh                  ONLINE       0     0     0                                              
            sdi                  ONLINE       0     0     0                                              
          raidz1-1               ONLINE       0     0     0                                              
            sdr                  ONLINE       0     0     0                                              
            sds                  ONLINE       0     0     0                                              
            sdt                  ONLINE       0     0     0                                              
            sdu                  ONLINE       0     0     0                                              
            sdj                  ONLINE       0     0     0                                              
            sdk                  ONLINE       0     0     0                                              
            sdl                  ONLINE       0     0     0                                              
            sdm                  ONLINE       0     0     0  
          raidz1-2               ONLINE       0     0     0  
            sdv                  ONLINE       0     0     0  
            sdw                  ONLINE       0     0     0  
            sdx                  ONLINE       0     0     0  
            sdy                  ONLINE       0     0     0  
            sdn                  ONLINE       0     0     0  
            sdo                  ONLINE       0     0     0  
            sdp                  ONLINE       0     0     0  
            sdq                  ONLINE       0     0     0             
                                                                           
errors: No known data errors

allen@server:~$ lsblk                                                                                    
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                              
sda      8:0    0 465.8G  0 disk                                                                         
├─sda1   8:1    0 449.9G  0 part /                                                                       
├─sda2   8:2    0     1K  0 part                                                                         
└─sda5   8:5    0  15.9G  0 part [SWAP]                                                                  
sdb      8:16   1 931.5G  0 disk                                                                         
├─sdb1   8:17   1 931.5G  0 part                                                                         
└─sdb9   8:25   1     8M  0 part                                                                         
sdc      8:32   1 931.5G  0 disk                                                                         
sdd      8:48   1 931.5G  0 disk                                                                         
├─sdd1   8:49   1 931.5G  0 part                                                                         
└─sdd9   8:57   1     8M  0 part    
...


sudo zpool replace srv 6847353731192779603 /dev/sdc -f

allen@server:~$ sudo zpool status
  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Mar 22 15:50:21 2019
    131G scanned out of 13.5T at 941M/s, 4h7m to go
    5.40G resilvered, 0.95% done
config:

        NAME                       STATE     READ WRITE CKSUM
        srv                        DEGRADED     0     0     0
          raidz1-0                 DEGRADED     0     0     0
            sdb                    ONLINE       0     0     0
            replacing-1            OFFLINE      0     0     0
              6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
              sdc                  ONLINE       0     0     0  (resilvering)
            sdd                    ONLINE       0     0     0
            sde                    ONLINE       0     0     0
            sdf                    ONLINE       0     0     0
...

A few hours later…

$ sudo zpool status


  pool: srv
 state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 571G in 5h16m with 2946 errors on Fri Mar 22 21:06:48 2019
config:

    NAME                       STATE     READ WRITE CKSUM
    srv                        DEGRADED   208     0 2.67K
        raidz1-0               DEGRADED   208     0 5.16K
        sdb                    ONLINE       0     0     0
        replacing-1            OFFLINE      0     0     0
          6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
        sdc                    ONLINE       0     0     0
        sdd                    ONLINE     208     0     1
        sde                    ONLINE       0     0     0
        sdf                    ONLINE       0     0     0
        sdg                    ONLINE       0     0     0
        sdh                    ONLINE       0     0     0
        sdi                    ONLINE       0     0     0
      raidz1-1                 ONLINE       0     0     0
        sdr                    ONLINE       0     0     0
        sds                    ONLINE       0     0     0
        sdt                    ONLINE       0     0     0
        sdu                    ONLINE       0     0     0
        sdj                    ONLINE       0     0     1
        sdk                    ONLINE       0     0     1
        sdl                    ONLINE       0     0     0
        sdm                    ONLINE       0     0     0
      raidz1-2                 ONLINE       0     0     0
        sdv                    ONLINE       0     0     0
        sdw                    ONLINE       0     0     0
        sdx                    ONLINE       0     0     0
        sdy                    ONLINE       0     0     0
        sdn                    ONLINE       0     0     0
        sdo                    ONLINE       0     0     0
        sdp                    ONLINE       0     0     0
        sdq                    ONLINE       0     0     0

The time was 5h16m. But notice the error - during resilvering drive sdd had 208 read errors and data was lost. This is the classic RAID situation where resilvering stresses the drives, another goes bad and you can’t restore.

It’s somewhat questionable if this is a valid test as the affect of the error on resilvering duration is unknown. But on with the test.

Let’s wipe that away and create a raidz3

sudo zpool destroy srv


zpool create \
-f -m /srv srv \
raidz3 \
 sdb sdc sdd sde sdf sdg sdh sdi \
 sdj sdk sdl sdm sdn sdo sdp sdq \
 sdr sds sdt sdu sdv sdw sdx sdy



zdb
zpool offline srv 15700807100581040709
sudo zpool replace srv 15700807100581040709 sdc




allen@server:~$ sudo zpool status
  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Mar 24 10:07:18 2019
    27.9G scanned out of 9.14T at 362M/s, 7h19m to go
    1.21G resilvered, 0.30% done
config:

    NAME             STATE     READ WRITE CKSUM
    srv              DEGRADED     0     0     0
      raidz3-0       DEGRADED     0     0     0
        sdb          ONLINE       0     0     0
        replacing-1  OFFLINE      0     0     0
          sdd        OFFLINE      0     0     0
          sdc        ONLINE       0     0     0  (resilvering)
        sde          ONLINE       0     0     0
        sdf          ONLINE       0     0     0
        ...


allen@server:~$ sudo zpool status

  pool: srv
 state: ONLINE
  scan: resilvered 405G in 6h58m with 0 errors on Sun Mar 24 17:05:50 2019
config:

    NAME        STATE     READ WRITE CKSUM
    srv         ONLINE       0     0     0
      raidz3-0  ONLINE       0     0     0
        sdb     ONLINE       0     0     0
        sdc     ONLINE       0     0     0
        sde     ONLINE       0     0     0
        sdf     ONLINE       0     0     0
        sdg     ONLINE       0     0     0
        ...

The time? 6h58m. Longer, but safer.

1.4.8.7 - ZFS Cache

Metadata Cache

There is a lot out there about ZFS cache config. I’ve found the most significant feature to be putting your metadata on a dedicated NVMe devices. This is noted as a ‘Special’ VDev. Here’s an example of a draid with such a device at the end.

Note: A 2x18 is bad practice - just more fun than a 3x12 with no spares.

zpool create -f pool02 \
    draid3:14d:18c:1s \
        ata-ST4000NM000A-2HZ100_WJG04M27 \
        ata-ST4000NM000A-2HZ100_WJG09BH7 \
        ata-ST4000NM000A-2HZ100_WJG0QJ7X \
        ata-ST4000NM000A-2HZ100_WS20ECCD \
        ata-ST4000NM000A-2HZ100_WS20ECFH \
        ata-ST4000NM000A-2HZ100_WS20JXTA \
        ata-ST4000NM0024-1HT178_Z4F14K76 \
        ata-ST4000NM0024-1HT178_Z4F17SJK \
        ata-ST4000NM0024-1HT178_Z4F17YBP \
        ata-ST4000NM0024-1HT178_Z4F1BJR1 \
        ata-ST4000NM002A-2HZ101_WJG0GBXB \
        ata-ST4000NM002A-2HZ101_WJG11NGC \
        ata-ST4000NM0035-1V4107_ZC1168N3 \
        ata-ST4000NM0035-1V4107_ZC116F11 \
        ata-ST4000NM0035-1V4107_ZC116MSW \
        ata-ST4000NM0035-1V4107_ZC116NZM \
        ata-ST4000NM0035-1V4107_ZC118WV5 \
        ata-ST4000NM0035-1V4107_ZC118WW0 \
    draid3:14d:18c:1s \
        ata-ST4000NM0035-1V4107_ZC118X74 \
        ata-ST4000NM0035-1V4107_ZC118X90 \
        ata-ST4000NM0035-1V4107_ZC118XBS \
        ata-ST4000NM0035-1V4107_ZC118Z23 \
        ata-ST4000NM0035-1V4107_ZC11907W \
        ata-ST4000NM0035-1V4107_ZC1192GG \
        ata-ST4000NM0035-1V4107_ZC1195PR \
        ata-ST4000NM0035-1V4107_ZC1195V5 \
        ata-ST4000NM0035-1V4107_ZC1195ZJ \
        ata-ST4000NM0035-1V4107_ZC11AHH9 \
        ata-ST4000NM0035-1V4107_ZC11CDD0 \
        ata-ST4000NM0035-1V4107_ZC11CE77 \
        ata-ST4000NM0035-1V4107_ZC11CV5E \
        ata-ST4000NM0035-1V4107_ZC11D2AQ \
        ata-ST4000NM0035-1V4107_ZC11HRGR \
        ata-ST4000NM0035-1V4107_ZC1B200R \
        ata-ST4000NM0035-1V4107_ZC1CBXEH \
        ata-ST4000NM0035-1V4107_ZC1DC98B \
    special mirror \
        ata-MICRON_M510DC_MTFDDAK960MBP_164614A1DBC4 \
        ata-MICRON_M510DC_MTFDDAK960MBP_170615BD4A74

zfs set special_small_blocks=64K pool02

Metadata is stores automatically on the special device but there’s a benefit in also directing the pool to use the special vdev for small files as well.

Sources

https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954

1.4.8.8 - ZFS Encryption

You might want to store data such that it’s encrypted at rest. Or replicate data to such as system. ZFS offers this on a per-dataset option.

Create an Encrypted Fileset

Let’s assume that you’re at a remote site and want to create an encrypted fileset to receive your replications.

zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase pool02/encrypted

Replicating to an Encrypted Fileset

This example uses mbuffer and assumes a secure VPN. Replace with SSH as needed.

# On the receiving side
sudo zfs load-key -r pool02/encrypted
mbuffer -4 -s 128k -m 1G -I 8990 | sudo zfs receive -s -F pool02/encrypted

# On the sending side
zfs send -i pool01/archive@snap1 pool01/archived@snap2 | mbuffer -s 128k -m 1G -O some.server:8990

1.4.8.9 - VDev Sizing

Best practice from Oracle says a VDev should be less than 9 disks1. So given 24 disks you should have 3 VDevs. However, when using RAIDZ, the math shows they should be as large as possible with multiple parity disks2. I.e. with 24 disks you should have a single, 24 disk VDev.

The reason for the best practice seems to be about the speed of writing and recovering from disk failures.

It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).

With a single VDev, you break up the data to send a chunk to each drive, then wait for them all to finish writing before you send the next. With several VDevs, you can move on to the next while you wait for the others to finish.

Build a 3-Wide RAIDZ1

Create the pool across 24 disks

#
# -O is the pool's root dataset. Lowercase letter -o is for pool properties
# sudo zfs get compression to check. lz4 is now prefered
#
zpool create \
-m /srv srv \
-O compression=lz4 \
raidz sdb sdc sdd sde sdf sdg sdh sdi \
raidz sdj sdk sdl sdm sdn sdo sdp sdq \
raidz sdr sds sdt sdu sdv sdw sdx sdy -f

Copy a lot of random data to it.

#!/bin/bash

no_of_files=1000
counter=0
while [[ $counter -le $no_of_files ]]
 do echo Creating file no $counter
   touch random-file.$counter
   shred -n 1 -s 1G random-file.$counter
   let "counter += 1"
 done

Yank out (literally) one of the physical disks and replace it.

sudo zpool status                                                               [433/433]

  pool: srv                                                                                              
 state: DEGRADED                                                                                         
status: One or more devices could not be used because the label is missing or                            
        invalid.  Sufficient replicas exist for the pool to continue                                     
        functioning in a degraded state.                                                                 
action: Replace the device using 'zpool replace'.                                                        
   see: http://zfsonlinux.org/msg/ZFS-8000-4J                                                            
  scan: none requested                                                                                   
config:                                                                                                  
                                                                                                         
        NAME                     STATE     READ WRITE CKSUM                                              
        srv                      DEGRADED     0     0     0                                              
          raidz1-0               DEGRADED     0     0     0                                              
            sdb                  ONLINE       0     0     0                                              
            6847353731192779603  UNAVAIL      0     0     0  was /dev/sdc1                               
            sdd                  ONLINE       0     0     0                                              
            sde                  ONLINE       0     0     0                                              
            sdf                  ONLINE       0     0     0                                              
            sdg                  ONLINE       0     0     0                                              
            sdh                  ONLINE       0     0     0                                              
            sdi                  ONLINE       0     0     0                                              
          raidz1-1               ONLINE       0     0     0                                              
            sdr                  ONLINE       0     0     0                                              
            sds                  ONLINE       0     0     0                                              
            sdt                  ONLINE       0     0     0                                              
            sdu                  ONLINE       0     0     0                                              
            sdj                  ONLINE       0     0     0                                              
            sdk                  ONLINE       0     0     0                                              
            sdl                  ONLINE       0     0     0                                              
            sdm                  ONLINE       0     0     0  
          raidz1-2               ONLINE       0     0     0  
            sdv                  ONLINE       0     0     0  
            sdw                  ONLINE       0     0     0  
            sdx                  ONLINE       0     0     0  
            sdy                  ONLINE       0     0     0  
            sdn                  ONLINE       0     0     0  
            sdo                  ONLINE       0     0     0  
            sdp                  ONLINE       0     0     0  
            sdq                  ONLINE       0     0     0             
                                                                           
errors: No known data errors

Insert a new disk and replace the missing one.

allen@server:~$ lsblk                                                                                    
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                              
sda      8:0    0 465.8G  0 disk                                                                         
├─sda1   8:1    0 449.9G  0 part /                                                                       
├─sda2   8:2    0     1K  0 part                                                                         
└─sda5   8:5    0  15.9G  0 part [SWAP]                                                                  
sdb      8:16   1 931.5G  0 disk                                                                         
├─sdb1   8:17   1 931.5G  0 part                                                                         
└─sdb9   8:25   1     8M  0 part                                                                         
sdc      8:32   1 931.5G  0 disk                       # <-- new disk showed up here
sdd      8:48   1 931.5G  0 disk                                                                         
├─sdd1   8:49   1 931.5G  0 part                                                                         
└─sdd9   8:57   1     8M  0 part    
...


sudo zpool replace srv 6847353731192779603 /dev/sdc -f

sudo zpool status

  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Mar 22 15:50:21 2019
    131G scanned out of 13.5T at 941M/s, 4h7m to go
    5.40G resilvered, 0.95% done
config:

        NAME                       STATE     READ WRITE CKSUM
        srv                        DEGRADED     0     0     0
          raidz1-0                 DEGRADED     0     0     0
            sdb                    ONLINE       0     0     0
            replacing-1            OFFLINE      0     0     0
              6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
              sdc                  ONLINE       0     0     0  (resilvering)
            sdd                    ONLINE       0     0     0
            sde                    ONLINE       0     0     0
            sdf                    ONLINE       0     0     0
...

We can see it’s running at 941M/s. Not too bad.

Build a 1-Wide RAIDZ3

sudo zpool destroy srv

zpool create \
-m /srv srv \
-O compression=lz4 \
raidz3 \
 sdb sdc sdd sde sdf sdg sdh sdi \
 sdj sdk sdl sdm sdn sdo sdp sdq \
 sdr sds sdt sdu sdv sdw sdx sdy -f

Copy a lot of random data to it again (as above)

Replace a disk (as above)

allen@server:~$ sudo zpool status
  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Mar 24 10:07:18 2019
    27.9G scanned out of 9.14T at 362M/s, 7h19m to go
    1.21G resilvered, 0.30% done
config:

 NAME             STATE     READ WRITE CKSUM
 srv              DEGRADED     0     0     0
   raidz3-0       DEGRADED     0     0     0
     sdb          ONLINE       0     0     0
     replacing-1  OFFLINE      0     0     0
       sdd        OFFLINE      0     0     0
       sdc        ONLINE       0     0     0  (resilvering)
     sde          ONLINE       0     0     0
     sdf          ONLINE       0     0     0

So that’s running quite a bit slower. Not exactly 1/3, but closer to it than not.

Surprise Ending

That was all about speed. What about reliability?

Our first resilver was going a lot faster, but it ended badly. Other errors popped up, some on the same VDev as was being resilvered, and so it failed.

sudo zpool status

  pool: srv
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 571G in 5h16m with 2946 errors on Fri Mar 22 21:06:48 2019
config:

 NAME                       STATE     READ WRITE CKSUM
 srv                        DEGRADED   208     0 2.67K
   raidz1-0                 DEGRADED   208     0 5.16K
     sdb                    ONLINE       0     0     0
     replacing-1            OFFLINE      0     0     0
       6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
       sdc                  ONLINE       0     0     0
     sdd                    ONLINE     208     0     1
     sde                    ONLINE       0     0     0
     sdf                    ONLINE       0     0     0
     sdg                    ONLINE       0     0     0
     sdh                    ONLINE       0     0     0
     sdi                    ONLINE       0     0     0
   raidz1-1                 ONLINE       0     0     0
     sdr                    ONLINE       0     0     0
     sds                    ONLINE       0     0     0
     sdt                    ONLINE       0     0     0
     sdu                    ONLINE       0     0     0
     sdj                    ONLINE       0     0     1
     sdk                    ONLINE       0     0     1
     sdl                    ONLINE       0     0     0
     sdm                    ONLINE       0     0     0
   raidz1-2                 ONLINE       0     0     0
     sdv                    ONLINE       0     0     0
     sdw                    ONLINE       0     0     0
     sdx                    ONLINE       0     0     0
     sdy                    ONLINE       0     0     0
     sdn                    ONLINE       0     0     0
     sdo                    ONLINE       0     0     0
     sdp                    ONLINE       0     0     0
     sdq                    ONLINE       0     0     0

errors: 2946 data errors, use '-v' for a list

Our second resilver was going very slowly, but did slow and sure when the race? It did, but very very slowly.

allen@server:~$ sudo zpool status
[sudo] password for allen: 
  pool: srv
 state: ONLINE
  scan: resilvered 405G in 6h58m with 0 errors on Sun Mar 24 17:05:50 2019
config:

 NAME        STATE     READ WRITE CKSUM
 srv         ONLINE       0     0     0
   raidz3-0  ONLINE       0     0     0
     sdb     ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sde     ONLINE       0     0     0
     sdf     ONLINE       0     0     0
     sdg     ONLINE       0     0     0
      ...
      ...

It slowed even further down, as 400G in 7 hours is something like a 16M/s. I didn’t see any checksum errors this time, but that time is abysmal.

Though, to paraphrase Livy, better late than never.

1.4.8.10 - ZFS Replication Script

#!/usr/bin/env bash
#
# zfs-pull.sh
#
# Pulls incremental ZFS snapshots from a remote (source) server to the local (destination) server.
# Uses snapshots made by zfs-auto-snapshot. Locates the latest snapshot common to both sides
# to perform an incremental replication; if none is found, it does a full send.
#
# Usage: replicate-zfs-pull.sh <SOURCE_HOST> <SOURCE_DATASET> <DEST_DATASET>
#
# Example:
#   ./replicate-zfs-pull.sh mysourcehost tank/mydata tank/backup/mydata
#
# Assumptions/Notes:
#   - The local server is the destination. The remote server is the source.
#   - We're using "zfs recv -F" locally, which can forcibly roll back the destination 
#     dataset if it has diverging snapshots. Remove or change -F as desired.
#   - This script is minimal and doesn't handle advanced errors or timeouts gracefully.
#   - Key-based SSH authentication should be set up so that `ssh <SOURCE_HOST>` doesn't require a password prompt.
#

set -euo pipefail

##############################################################################
# 1. Parse command-line arguments
##############################################################################
if [[ $# -ne 3 ]]; then
  echo "Usage: $0 <SOURCE_HOST> <SOURCE_DATASET> <DEST_DATASET>"
  exit 1
fi

SOURCE_HOST="$1"
SOURCE_DATASET="$2"
DEST_DATASET="$3"

##############################################################################
# 2. Gather snapshot lists
#
#    The command zfs list -H -t snapshot -o name -S creation -d 1
#          -H           : Output without headers for script-friendliness
#          -t snapshot  : Only list snapshots
#          -o name      : Only list the name
#          -d 1         : Only descend one level - i.e. don't tree out child datasets
##############################################################################
# - Remote (source) snapshots: via SSH to the remote host
# - Local (destination) snapshots: from the local ZFS

echo "Collecting snapshots from remote source: ${SOURCE_HOST}:${SOURCE_DATASET}..."
REMOTE_SNAPSHOTS=$(ssh "${SOURCE_HOST}" zfs list -H -t snapshot -o name -d 1 "${SOURCE_DATASET}" 2>/dev/null \
  | grep "${SOURCE_DATASET}@" \
  | awk -F'@' '{print $2}' || true)

echo "Collecting snapshots from local destination: ${DEST_DATASET}..."
LOCAL_SNAPSHOTS=$(zfs list -H -t snapshot -o name -d 1 "${DEST_DATASET}" 2>/dev/null \
  | grep "${DEST_DATASET}@" \
  | awk -F'@' '{print $2}' || true)

##############################################################################
# 3. Find the latest common snapshot
#
#   The snapshots names have prefixes like "zfs-auto-snap_daily" and "zfs-auto-snap_hourly"
#   that confuse sorting for the linux comm program, so we strip the prefix with sed before 
#   using 'comm -12' to find common elements of input 1 and 2, and tail to get the last one.
#  
COMMON_SNAPSHOT=$(comm -12 <(echo "$REMOTE_SNAPSHOTS" | sed 's/zfs-auto-snap_\w*-//' | sort) <(echo "$LOCAL_SNAPSHOTS" | sed 's/zfs-auto-snap_\w*-//' | sort) | tail -n 1)

# We need the full name back for the transfer, so grep it out of the local list. Make sure to quote the variable sent to grep or you'll loose the newlines.
COMMON_SNAPSHOT=$(echo "$LOCAL_SNAPSHOTS" | grep $COMMON_SNAPSHOT)

if [[ -n "$COMMON_SNAPSHOT" ]]; then
  echo "Found common snapshot: $COMMON_SNAPSHOT"
else
  echo "No common snapshot found—will perform a full send."
fi


##############################################################################
# 4. Identify the most recent snapshot on the remote source
#
#   This works because we zfs list'ed the snapshots originally in order
#   so we can just take the first line with 'head -n 1'
##############################################################################
LATEST_REMOTE_SNAPSHOT=$(echo "$REMOTE_SNAPSHOTS" | head -n 1)

if [[ -z "$LATEST_REMOTE_SNAPSHOT" ]]; then
  echo "No snapshots found on the remote source. Check if zfs-auto-snapshot is enabled there."
  exit 1
fi

##############################################################################
# 5. Perform replication
##############################################################################
echo "Starting pull-based replication from ${SOURCE_HOST}:${SOURCE_DATASET} to local ${DEST_DATASET}..."

if [[ -n "$COMMON_SNAPSHOT" ]]; then
  echo "Performing incremental replication from @$COMMON_SNAPSHOT up to @$LATEST_REMOTE_SNAPSHOT."
  ssh "${SOURCE_HOST}" zfs send -I "${SOURCE_DATASET}@${COMMON_SNAPSHOT}" "${SOURCE_DATASET}@${LATEST_REMOTE_SNAPSHOT}" \
    | zfs recv -F "${DEST_DATASET}"
else
  echo "Performing full replication of @$LATEST_REMOTE_SNAPSHOT."
  ssh "${SOURCE_HOST}" zfs send "${SOURCE_DATASET}@${LATEST_REMOTE_SNAPSHOT}" \
    | zfs recv -F "${DEST_DATASET}"
fi

echo "Replication completed successfully!"

1.5 - TrueNAS

TrueNAS is a storage “appliance” in the sense that it’s a well put together linux system with a web administration layer. If you find yourself digging into the linux internals, you’re doing it wrong.

1.5.1 - Disk Replacement

You may get an alert in the GUI along the lines of

Device: /dev/sda [SAT], 8 Currently unreadable (pending) sectors.

This is one of the predictive failures that Backblaze mentions. You should replace the drive. You may also get an outright failure such as:

Pool pool01 state is DEGRADED: One or more devices has experienced an unrecoverable error.

That’s a drive that has already failed and likewise you must replace it.

Using The GUI

Get The Device Number

In the GUI, Observe the Device Number from the alert. It will be similar to daXX like this

Device: /dev/sda [SAT], 24 Currently unreadable (pending) sectors.

Off-line the disk

Navigate to the device and mark it off-line.

Storage -> (pool name) -> Manage Devices -> (Select the disk) -> ZFS Info -> Offline Button

Physically Replace the Disk

You probably know what bay it is, but a mistake can take the pool down if you only have Z1. If you have a larger server it may have bay lights you can use. Check the command line section below on how to light the indicator.

Logically Replace The Drive

Navigate to the pool, find the drive and in the Disk Info menu click “replace”. The now windows should allow you to pick the ’new’ drive. There should be only unused drives listed. If nothing is listed, you may need to wipefs -a /dev/sdX on it first.

Observe The Resilvering Process

The system should automatically start rebuilding the array.

At The Command Line

It’s ‘strongly advised against’ using the CLI to replace the disk. The GUI takes several steps to prepare the disk and adds a partition to the pool, not the whole disk.

Identify and Off-Line The Disk

Use the gptid from zpool to get the device number, then off-line the disk.

sudo zpool status

 raidz3-2                                 ONLINE       0     0     0
    976e8f03-931d-4c9f-873e-048eeef08680  ONLINE       0     0     0
    f9384b4f-d94a-43b6-99c4-b8af6702ca42  ONLINE       0     0     0
    c5e4f2e5-62f2-41cc-a8de-836ff9683332  ONLINE       0     0 35.4K

sudo zpool offline pool01 c5e4f2e5-62f2-41cc-a8de-836ff9683332

find /dev/disk -name c5e4f2e5-62f2-41cc-a8de-836ff9683332 -exec ls -lah {} ;

lrwxrwxrwx 1 root root 11 Jan 10 08:41 /dev/disk/by-partuuid/c5e4f2e5-62f2-41cc-a8de-836ff9683332 -> ../../sdae1

sudo smartctl -a /dev/sdae1 | grep Serial

Serial Number:    WJG1LNP7

sas3ircu list

sas3ircu 0 display

sas3ircu 0 display | grep -B 10 WJG1LNP7

(Inspect the output for Enclosure and Slot to use in that order below)

sudo sas3ircu 0 locate 3:5 ON

Physically Replace The Drive

This is a physical swap - the indicator will be blinking red. Turn it off when you’re done

sudo sas3ircu 0 locate 3:5 OFF

Logically Replace The Removed Drive

(It’s probably the same device identifier, but you can tail the message log to make sure)

sudo dmesg | tail

sudo zpool replace pool01 c5e4f2e5-62f2-41cc-a8de-836ff9683332 sdae -f

sudo zpool status

 raidz3-2                                  DEGRADED     0     0     0
   976e8f03-931d-4c9f-873e-048eeef08680    ONLINE       0     0     0
   f9384b4f-d94a-43b6-99c4-b8af6702ca42    ONLINE       0     0     0
   replacing-2                             DEGRADED     0     0     0
     c5e4f2e5-62f2-41cc-a8de-836ff9683332  REMOVED      0     0     0
     sdae                                  ONLINE       0     0     0

Hot Spare

If you’re using a hot spare, you may need to detach it after the resilver is finished so status as a spare is returned. Check the spare’s ID at the bottom and then detach it.

zpool status zpool detach pool01 9d794dfd-2ef6-432d-8252-0c93e79509dc

Troubleshooting

When working at the command line, you may need download the sas3ircu utility from Broadcom.

wget https://docs.broadcom.com/docs-and-downloads/host-bus-adapters/host-bus-adapters-common-files/sas_sata_12g_p16_point_release/SAS3IRCU_P16.zip

If you forgot what light you turned on, you can turn off all slot lights.

for X in {0..23};do echo sas3ircu 0 locate 2:$X OFF;done for X in {0..11};do sas3ircu 0 locate 3:$X OFF;done

To recreate the GUI process at the command line, as adapted from https://www.truenas.com/community/resources/creating-a-degraded-pool.100/ use these commands. Though gpart and glable are not present on TrueNAS Scale, so you would have to adapt this to another tool.

gpart create -s gpt /dev/da18 gpart add -i 1 -b 128 -t freebsd-swap -s 2g /dev/da18 gpart add -i 2 -t freebsd-zfs /dev/da18

zpool replace pool01 65f61699-e2fc-4a36-86dd-b0fa6a77479

2 - Operating Systems

2.1 - NetBoot

Most computers come with ‘firmware’. This is a built-in mini OS, embedded in the chips, that’s just smart enough to start things up and hand-off to something more capable.

That more-capable thing is usually an Operating System on a disk, but it can also be something over the network. This lets you:

  • Run an OS installer, such as when you don’t have one installed yet.
  • Run an the whole OS remotely without having a local disk at all.

PXE

The original way was Intel’s PXE (Preboot eXecution Environment) Option ROM on their network cards. The IBM PC firmware (BIOS) would would turn over execution to it and PXE would use basic network drivers to get on the network.

HTTP Boot

Modern machines have newer firmware (UEFI) and it includes logic on how to use HTTP/S without the need for add-ons. This simplifies thigns and also solves potential man-in-the middle attacks. Both methods are still generally called PXE booting, though.

Building a NetBoot Environment

Start by setting up a HTTP Boot system, then add PXE Booting and netboot.xyz to it. This gets you an installation system. Then proceed to diskless stations.

2.1.1 - HTTP Boot

We’ll set up a PXE Proxy server that runs DHCP and HTTP. This server and can be used along side your existing DHCP/DNS servers. We use Debian in this example but anything that runs dnsmasq should work.

Installation

sudo apt install dnsmasq lighttpd

Configuration

Server

Static IPs are best practice, though we’ll use a hostname in this config, so the main thing is that the server name netboot resolves correctly.

HTTP

Lighttpd serves up from /var/www/http so just drop an ISO there. For example, take a look at the current debian ISO (the numbering changes) at https://www.debian.org/CD/netinst and copy the link in like so:

sudo wget https://cdimage.debian.org/debian-cd/current/amd64/iso-cd/debian-12.6.0-amd64-netinst.iso -P /var/www/html -O debian.iso

DHCP

When configured in proxy dhcp mode: “…dnsmasq simply provides the information given in –pxe-prompt and –pxe-service to allow netbooting”. So only certain settings are available. This is a bit vague, but testing reveals that you must set the boot file name with the dhcp-boot directive, rather than setting it with the more general DHCP option ID 67, for example.

sudo vi /etc/dnsmasq.d/netboot.conf 
# Disable DNS
port=0

# Set for DHCP PXE Proxy mode
dhcp-range=192.168.0.0,proxy

# Respond to clients that use 'HTTPClient' to identify themselves.
dhcp-pxe-vendor=HTTPClient

# Set the boot file name to the web server URL
dhcp-boot="http://netboot/debian.iso"

# PXE-service isn't actually used, but dnsmasq seems to need at least one entry to send the boot file name when in proxy mode.
pxe-service=x86-64_EFI,"Network Boot"

Client

Simply booting the client and selecting UEFI HTTP should be enough. The debian boot loader is signed and works with secure boot.

In addition to ISOs, you can also specify .efi binaries like grubx64.efi. Some distributions support this, though Debian itself may have issues.

Next Steps

You may want to support older clients by adding PXE Boot support.

Troubleshooting

dnsmasq

A good way to see what’s going on is to enable dnsmasq logging.

# Add these to the dnsmasq config file
log-queries
log-dhcp

# Restart and follow to see what's happening
sudo systemctl restart dnsmasq.service
sudo systemctl -u dnsmasq -f

If you’ve enabled logging in dnsmasq and it’s not seeing any requests, you may need to look at your networking. Some virtual environments suppress DHCP broadcasts when they are managing the IP range.

lighttpd

You can also see what’s being requested from the web server if you enable access logs.

cd /etc/lighttpd/conf-enabled
sudo ln -s ../conf-available/10-accesslog.conf
sudo systemctl restart lighttpd.service
sudo cat /var/log/lighttpd/access.log

2.1.2 - PXE Boot

Many older systems can’t HTTP Boot so let’s add PXE support with some dnsmasq options.

Installation

Dnsmasq

Install as in the httpboot page.

The Debian Installer

Older clients don’t handle ISOs well, so grab and extract the Debian netboot files.

sudo wget http://ftp.debian.org/debian/dists/bookworm/main/installer-amd64/current/images/netboot/netboot.tar.gz -O - | sudo tar -xzvf - -C /var/www/html

Grub is famous for ignoring proxy dhcp settings, so let’s start off the boot with something else; iPXE. It can do a lot, but isn’t signed so you must disable secure boot on your clients.

sudo wget https://boot.ipxe.org/ipxe.efi -P /var/www/html

Configuration

iPXE

Debian is ready to go, but you’ll want to create an auto-execute file for iPXE so you don’t have to type in the commands manually.

sudo vi /var/www/html/autoexec.ipxe
#!ipxe

set base http://netboot/debian-installer/amd64

dhcp
kernel ${base}/linux
initrd ${base}/initrd.gz
boot

Dnsmasq

HTTP and PXE clients need different information to boot. We handle this by adding a filename to the PXE service option. This will override the dhcp-boot directive for PXE clients.

sudo vi /etc/dnsmasq.d/netboot.conf 
# Disable DNS
port=0 
 
# Use in DHCP PXE Proxy mode
dhcp-range=192.168.0.0,proxy 
 
# Respond to both PXE and HTTP clients
dhcp-pxe-vendor=PXEClient,HTTPClient 
 
# Send the BOOTP information for the clients using HTTP
dhcp-boot="http://netboot/debian.iso" 

# Specify a boot menu option for PXE clients. If there is only one, it's booted immediately.
pxe-service=x86-64_EFI,"iPXE (UEFI)", "ipxe.efi"

# We also need to enable TFTP for the PXE clients
enable-tftp 
tftp-root=/var/www/html

Client

Both types of client should now work. The debian installer will pull the rest of what it needs from the web.

Next Steps

You can create a boot-menu by adding multiple pxe-service entries in dnsmasq, or by customizing the iPXE autoexec.ipxe files. Take a look at that in the menu page.

Troubleshooting

Text Flashes by, disappears, and client reboots

This is most often a symptom of secure boot still being enabled.

Legacy Clients

These configs are aimed at UEFI clients. If you have old BIOS clients, you can try the pxe-service tag for those.

pxe-service=x86-64_EFI,"iPXE (UEFI)", "ipxe.efi"
pxe-service=x86PC,"iPXE (UEFI)", "ipxe.kpxe"

This may not work and there’s a few client flavors so enable the dnsmasq logs to see how they identify themselves. You can also try booting pxelinux as in the Debian docs.

DHCP Options

Dnsmasq also has a whole tag system that you can set and use similar to this:

dhcp-match=set:PXE-BOOT,option:client-arch,7
dhcp-option=tag:PXE-BOOT,option:bootfile-name,"netboot.xyz.efi"

However, dnsmasq in proxy mode limits what you can send to the clients, so we’ve avoided DHCP options and focused on PXE service directives.

Debian Error

*ERROR* CPU pipe B FIFO underrun

You probably need to use the non-free firmware

No Boot option

Try entering the computers bios setup and adding a UEFI boot option for the OS you just installed. You may need to browse for the file \EFI\debian\grubx64.efi

Sources

https://documentation.suse.com/sles/15-SP2/html/SLES-all/cha-deployment-prep-uefi-httpboot.html https://github.com/ipxe/ipxe/discussions/569 https://linuxhint.com/pxe_boot_ubuntu_server/#8

It’s possible to use secure boot if you’re willing to implement a chain of trust. Here’s an example used by FOG to boot devices.

https://forums.fogproject.org/topic/13832/secureboot-issues/3

2.1.3 - menu

It would be useful to have some choices when you netboot. You can use the pxe-service built into dnsmasq but a more flexible option is the menu system provided by the iPXE project.

Installation

Set up a http/pxe net-boot server if you haven’t already.

Configuration

dnsmasq

Configure dnsmasq to serve up the ipxe.efi binary for both types of clients.

# Disable DNS
port=0 
 
# Use in DHCP PXE Proxy mode
dhcp-range=192.168.0.0,proxy 
 
# Tell dnsmasq to provide proxy PXE service to both PXE and HTTP clients
dhcp-pxe-vendor=PXEClient,HTTPClient 
 
# Send the BOOTP information for the clients using HTTP
dhcp-boot="http://netboot/ipxe.efi" 

# Specify a boot menu option for PXE clients. If there is only one, it's booted immediately.
pxe-service=x86-64_EFI,"iPXE (UEFI)", "ipxe.efi"
  
# We also need to enable TFTP for the PXE clients  
enable-tftp 
tftp-root=/var/www/html

Custom Menu

Change the autoexec.ipxe to display a menu.

sudo vi /var/www/html/autoexec.ipxe
#!ipxe

echo ${cls}

:MAIN
menu Local Netboot Menu
item --gap Local Network Installation
item WINDOWS ${space} Windows 11 LTSC Installation
item DEBIAN ${space} Debian Installation
choose selection && goto ${selection} || goto ERROR

:WINDOWS
echo Some windows things here
sleep 3
goto MAIN

:DEBIAN
dhcp
imgfree
set base http://netboot/debian-installer/amd64
kernel ${base}/linux 
initrd ${base}/initrd.gz
boot || goto ERROR


:ERROR
echo There was a problem with the selection. Exiting...
sleep 3
exit

Operation

You’ll doubtless find additional options to add. You may want to add the netboot.xyz project to your local menu too.

2.1.4 - netboot.xyz

You can add netboot.xyz to your iPXE menu to run Live CDs, OS installers and utilities they provide. This can save a lot of time and their list is always improving.

Installation

You’re going to connect to the web for this, so there’s nothing to install. You can download their efi bootloader manually if you’d like to keep things HTTPS, but they update it regularly so you may fall behind.

Configuration

Autoexec.ipxe

Add a menu item to your autoexec.ipxe. When you select it, iPXE will chainload (in their parlance) the netboot.xyz bootloader.

#!ipxe

echo ${cls}

:MAIN
menu Local Netboot Menu
item --gap Local Network Installation
item WINDOWS ${space} Windows 11 LTSC Installation
item DEBIAN ${space} Debian Installation
item --gap Connect to Internet Sources
item NETBOOT ${space} Netboot.xyz
choose selection && goto ${selection} || goto ERROR

:WINDOWS
echo Some windows things here
sleep 3
goto MAIN

:DEBIAN
dhcp
imgfree
set base http://netboot/debian-installer/amd64
kernel ${base}/linux 
initrd ${base}/initrd.gz
boot || goto ERROR

:NETBOOT
dhcp
chain --autofree http://boot.netboot.xyz || goto ERROR

:ERROR
echo There was a problem with the selection. Exiting...
sleep 3
exit

Local-vars

Netboot.xyz detects that it’s working with a Proxy PXE server and behaves a little differently. For example, you can’t insert your own local menu.ipxe. One helpful addition is a local settings file to speed up boot.

sudo vi /var/www/html/local-vars.ipxe
#!ipxe
set use_proxydhcp_settings true

Operation

You can choose the new menu item and load netboot.xyz. It will take you out the web for more selections. Not everything will load on every client, of course. But it gives you a lot of options.

Next Steps

We glossed over how to install Windows. That’s a useful item.

Troubleshooting

Wrong TFTP Server

tftp://192.168.0.1/local-vars.ipxe....Connection timed out
Local vars file not found... attempting TFTP boot...
DHCP proxy detected, press p to boot from 192.168.0.2...

If your boot client is attempting to connect to the main DHCP server, that server is probably sending value next server: 192.168.0.1 in it’s packets. This isn’t a DNS option per say, but it affects netboot. Dnsmasq does this though Kea doesn’t.

sudo systemctl -u dnsmasq -f

...
...
next server: 192.168.0.1
...
...

The boot still works, it’s just annoying. You can usually ignore the message and don’t have to hit ‘p’.

Exec Format Error

Could not boot: Exec format error (https://ipxe.org/2e008081)

You may see this flash by. Check your menus and local variables file to make sure you’ve in included the #!pxe shebang.

No Internet

You can also host your own local instance.

2.1.5 - windows

To install windows, have iPXE load wimboot then WinPE. From there you can connect to a samba share and start the Windows installer. Just like back in the gold-ole administrative installation point days.

Getting a copy of WinPE the official way is a bit of a hurdle, but definitely less work than setting up a full Windows imaging solution.

Installation

Samba and Wimboot

On the netboot server, install wimboot and Samba.

sudo wget https://github.com/ipxe/wimboot/releases/latest/download/wimboot -P /var/www/html
sudo apt install samba

Window ADK

On a Windows workstation, download the ADK and PE Add-on and install as per Microsoft’s ADK Install Doc.

Configuration

Samba

Prepare the netboot server to receive the Windows files.

sudo vi /etc/samba/smb.conf
[global]
  map to guest = bad user
  log file = /var/log/samba/%m.log

[install]
  path = /var/www/html
  browseable = yes
  read only = no
  guest ok = yes
  guest only = yes
sudo mkdir /var/www/html/winpe
sudo mkdir /var/www/html/win11
sudo chmod o+w /var/www/html/win*
sudo systemctl restart smbd.service

Window ADK Config

On the Windows workstation, start the deployment environment as an admin and create the working files as below. More info is in Microsoft’s Create Working Files document.

  • Start -> All Apps -> Windows Kits -> Deployment and Imaging Tools Environment (Right Click, More, Run As Admin)
copype amd64 c:\winpe\amd64

Add the required additions for Windows 11 with the commands below. These are the optional components WinPE-WMI and WinPE-SecureStartup and more info is in Microsoft’s Customization Section.

mkdir c:\winpe\offline

dism /mount-Image /Imagefile:c:\winpe\amd64\media\sources\boot.wim /index:1 /mountdir:c:\winpe\offline

dism /image:c:\winpe\offline /add-package /packagepath:"..\Windows Preinstallation Environment\amd64\WinPE_OCs\WinPE-WMI.cab" /packagepath:"..\Windows Preinstallation Environment\amd64\WinPE_OCs\WinPE-SecureStartup.cab"

dism /unmount-image /mountdir:c:\winpe\offline /commit

Make the ISO in case you want to HTTP Boot from it later and keep the shell open for later.

MakeWinPEMedia /ISO C:\winpe\amd64 C:\winpe\winpe_amd64.iso

WinPE

Now that you’ve got a copy of WinPE, copy it to the netboot server.

net use q: \\netboot\install
xcopy /s c:\winpe\* q:\winpe

Also create some auto-start files for setup. The first is part to the WinPE system and tells it (generically) what to do after it starts up.

notepad q:\winpe\amd64\winpeshl.ini
[LaunchApps]
"install.bat"

This the second is more specific and associated with the thing you are installing. We’ll mix and match these in the PXE menu later so we can install different things.

notepad q:\win11\install.bat
wpeinit
net use \\netboot
\\netboot\install\win11\setup.exe
pause

Win 11

You also need to obtain the latest ISO and extract the contents.

Wimboot

Bck on the netboot server, customize the WINDOWS section of your autoexex.ipxe like this.

:WINDOWS
dhcp
imgfree
set winpe http://netboot/winpe/amd64
set source http://netboot/win11
kernel wimboot
initrd ${winpe}/media/sources/boot.wim boot.wim
initrd ${winpe}/media/Boot/BCD         BCD
initrd ${winpe}/media/Boot/boot.sdi    boot.sdi
initrd ${winpe}/winpeshl.ini           winpeshl.ini
initrd ${source}/install.bat           install.bat
boot || goto MAIN

You can add other installs by copying this block and changing the :WINDOWS header and source variable.

Next Steps

Add some more installation sources and take a look at the Windows zero touch install.

Troubleshooting

System error 53 has occurred. The network path was not found

A given client may be unable to connect to the SMB share, or it may fail once, but then connect on a retry a moment later. I suspect it’s because the client doesn’t have an IP yet, though I’ve not looked at it closely. You can usually just retry.

You can also comment out the winpeshl.ini line and you’ll boot to a command prompt that will let you troubleshoot. Sometimes you just don’t have an IP yet from the DHCP server and you can edit the install.bat file to add a sleep or other things. See then [zero touch deployment] page for some more ideas.

Access is denied

This may be related to the executable bit. If you’ve copied from the ISO they should be set. But if after that you’ve changed anything you could have lost the x bit from setup.exe. It’s hard to know what’s supposed to be set once it’s gone, so you may want to recopy the files.

2.2 - Windows

2.2.1 - Server Core

Installation Notes

If you’re deploying Windows servers, Server Core is best practice1. Install from USB and it will offer that as a choice - it’s fairly painless. But these instances are designed to be remote-managed so you’ll need to perform a few post-install tasks to help with that.

Server Post-Installation Tasks

Set a Manual IP Address

The IP is DHCP by default and that’s fine if you create a reservation at the DHCP server or just use DNS. If you require a manual address, however:

# Access the PowerShell interface (you can use the server console if desired)

# Identify the desired interface's index number. You'll see multiple per adapter for IP4 and 6 but the interface index will repeat.
Get-NetIPInterface

# Set a manual address, netmask and gateway using that index (12 in this example)
New-NetIPaddress -InterfaceIndex 12 -IPAddress 192.168.0.2 -PrefixLength 24 -DefaultGateway 192.168.0.1

# Set DNS
Set-DNSClientServerAddress –InterfaceIndex 12 -ServerAddresses 192.168.0.1

Allow Pings

This is normally a useful feature, though it depends on your security needs.

Set-NetFirewallRule -Name FPS-ICMP4-ERQ-In -Enabled True

Allow Computer Management

Server core allows ‘Remote Management’ by default2. That is specifically the Server Manager application that ships with Windows Server versions and is included with the Remote Server Admin Tools on Windows 10 professional3 or better. For more detailed work you’ll need to use the Computer Management feature as well. If you’re all part of AD, this is reported to Just Work(TM). If not, you’ll need to allow several ports for SMB and RPC.

# Port 445
Set-NetFirewallRule -Name FPS-SMB-In-TCP -Enabled True

# Port 135
Set-NetFirewallRule -Name WMI-RPCSS-In-TCP -Enabled True


maybe 
FPS-NB_Name-In-UDP
NETDIS-LLMNR-In-UDP

Configuration

Remote Management Client

If you’re using windows 10/11, install it on a workstation by going to System -> Optional features -> View features and enter Server Manager in the search box to select and install.

With AD

When you’re all in the same Domain then everything just works (TM). Or so I’ve read.

Without AD

If you’re not using Active Directory, you’ll have to do a few extra steps before using the app.

Trust The Server

Tell your workstation you trust the remote server you are about to manage4 (yes, seems backwards). Use either the hostname or IP address depending on how your planning to connect - i.e. if you didn’t set up DNS use IPs. Start an admin powershell and enter:

Set-Item wsman:\localhost\Client\TrustedHosts 192.168.5.1 -Concatenate -Force
Add The Server

Start up Server Manager and select Manage -> Add Servers -> DNS and search for the IP or DNS name. Pay attention the server’s name that it detects. If DNS happens to reslove the IP address you put in, as server-1.local for example, you’ll need to repeat the above TrustedHosts command with that specific name.

Manage As…

You may notice that after adding the server, the app tries to connect and fails. You’ll need to right-click it and select Manage As… and enter credentials in the form of server-1\Administrator and select Remember me to have this persist. Here you’ll need to use the actual server name and not the IP. If unsure, you can get this on the server with the hostname command.

Starting Performance Counters

The server you added should now say that it’s performance counters are not started. Right-click to and you can select to start them. The server should now show up as Online and you can perform some basic tasks.

server-1.local\Administrator

Server Manager is the default management tool and newer servers allow remote management by default. The client needs a few things, however.

  • Set DNS so you can resolve by names
  • Configure Trusted Hosts

On the system where you start the the Server Manager app - usually where you are sitting - ensure you can resolve the remote host via DNS. You may want to edit your hosts file if not.

notepad c:\Windows\System32\drivers\etc\hosts

You can now add the remote server.

Manage -> Add Servers -> DNS -> Search Box (enter the other servers hostname) -> Magnifying Glass -> Select the server -> Right Arrow Icon -> OK

(You man need to select Manage As on it)

Allow Computer Management

You can right-click on a remote server and select Computer Management after doing this

MISC

Set-NetFirewallProfile -Profile Domain, Public, Private -Enabled False

2.2.2 - Windows Zero Touch Install

The simplest way to zero-touch install Windows is with a web-generated answer file. Go to a site like schneegans and just create it. This removes the need for the complexity of MDS WDS SCCM etc. systems for normal deployments.

Create An Answer File

Visit schneegans. Start with some basic settings, leaving most at the default, and increase complexity with successive iterations. A problematic setting will just dump you out of the installer and it can be hard to determine what went wrong.

Download the file and use it one of the following ways;

USB

After creating the USB installer, copy the file (autounattend.xml) to the root of the USB drive (or one of these locations) and setup will automatically detect it.

Netboot

For a netboot install, copy the file to the sources folder of the Windows files.

scp autounattend.xml netboot:/var/www/html/win11/sources

Additionally, some scripting elements of the install don’t support UNC paths so we must map a drive. Back in the Windows netboot page, we created an install.bat to start the installation. Let’s modify that like so

vi /var/www/html/win11/install.bat
wpeinit

SET SERVER=netboot

:NET
net use q: \\%SERVER%\install

REM If there was a problem with the net use command, 
REM ping, pause and loop back to try again

IF %ERRORLEVEL% NEQ 0 (
  ping %SERVER%
  pause
  GOTO NET
) ELSE (
  q:
  cd win11
  setup.exe
)

Add Packages

The installer can also add 3rd party software packages by adding commands in the Run custom scripts section to run at initial log-in. We’ll use HTTP to get the files as some versions of windows block anonymous SMB.

Add Package Sources

On the netboot server, create an apps folder for your files and download packages there.

mkdir /var/www/html/apps; cd /var/www/html/apps
wget https://get.videolan.org/vlc/3.0.9.2/win64/vlc-3.0.9.2-win64.msi 
wget https://statics.teams.cdn.office.net/production-windows-x64/enterprise/webview2/lkg/MSTeams-x64.msix

Add to Autounattend.xml

It’s easiest to add this in the web form rather than try and edit the XML file. Go to this section and add a line like this one to the third block of custom scripts. It must run at initial user login as the network isn’t available before that.

Navigate to the block that says:

Scripts to run when the first user logs on after Windows has been installed

For MSI Files

These and handled as .cmd files as in field 1.

msiexec /package http://netboot/apps/GoogleChromeStandaloneEnterprise64.msi /quiet
msiexec /package http://netboot/apps/vlc-3.0.9.2-win64.msi /quiet

For MSIX Files

These are handled as .ps1 files as in field 2.

Add-AppPackage -path http://netboot/apps/MSTeams-x64.msix

For EXE files

These are are also handled in the .ps1 files in field 2. They require more work however, as you must download, run, then remove them.

(New-Object System.Net.WebClient).DownloadFile("http://netboot/apps/WindowsSensor.MaverickGyr.exe","$env:temp\crowd.exe")
Start-Process $env:temp\crowd.exe -ArgumentList "/install /quiet CID=239023847023984098098" -wait
Remove-Item "$env:temp\crowd.exe"

Troubleshooting

Select Image Screen

Specifying the KMS product key won’t always allow you to skip the “Select Image” screen. This may be due to an ISO being pre-licensed or have something to do with Windows releases. To fix this, add an InstallFrom stanza to the OSImage block of your unattended.xml file.


                        <ImageInstall> 
                                <OSImage> 
                                        <InstallTo> 
                                                <DiskID>0</DiskID> 
                                                <PartitionID>3</PartitionID> 
                                        </InstallTo> 
                                        <InstallFrom> 
                                                <MetaData wcm:action="add"> 
                                                        <Key>/Image/Description</Key> 
                                                        <Value>Windows 11 Enterprise</Value> 
                                                </MetaData> 
                                        </InstallFrom> 
                                </OSImage> 
                        </ImageInstall>

https://www.tenforums.com/installation-upgrade/180022-autounattend-no-product-key.html

Notes

Windows Product Keys https://gist.github.com/rvrsh3ll/0810c6ed60e44cf7932e4fbae25880df

3 - Virtualization

In the beginning, users time-shared CPUs and virtualization was without form and void. And IBM said “Let there be System/370”. This was in the 70’s and involved men with crew-cuts, horn-rimmed glasses and pocket protectors. And ties.

Today, you can still do full virtualization. Everything is emulated down to the hardware and every system has it’s own kernel and device drivers. Most of the public cloud started out this way at the dawn of the new millennium. It was the way. VMWare was the early player in this area and popularized it on x86 hardware where everyone was using 5% of their pizzabox servers.

The newer way is containerization. There is just one kernel and it keeps groups processes separate from each other. This is possible because Linux implemented kernel namespaces around 2008 - mostly work by IBM, suitably enough. The program used to work with this is named LXC and you’d use commands like sudo lxc-create --template download --name u1 --dist ubuntu --release jammy --arch amd64. Other systems such as LXD and Docker (originally) are layed on top to provide more management.

Twenty some years later, what used to be a hot market is now a commodity that’s essentially given away for free. VMWare was acquired by Broadcom who’s focused on the value-extraction phase of it’s lifecycle and the cloud seems decidedly headed toward containers because of it’s better efficiency and agility.

3.1 - Incus

Inucs is a container manager, forked from Canonical’s LXD manager. It combines all the virtues of upstream LXD (containers + vms) with the advantages of community driven additions. You have access to the containers provided by the OCI (open container initiative) as well as being able to create VMs. It is used at the command line and includes a web interface.

Installation

Simply install a base OS on your server and add a few commands. You can install from your distro’s repo, but zabbly (the sponsor) is a bit newer.

As per https://github.com/zabbly/incus

sudo mkdir -p /etc/apt/keyrings/
sudo wget -O /etc/apt/keyrings/zabbly.asc https://pkgs.zabbly.com/key.asc

sudo sh -c 'cat <<EOF > /etc/apt/sources.list.d/zabbly-incus-stable.sources
Enabled: yes
Types: deb
URIs: https://pkgs.zabbly.com/incus/stable
Suites: $(. /etc/os-release && echo ${VERSION_CODENAME})
Components: main
Architectures: $(dpkg --print-architecture)
Signed-By: /etc/apt/keyrings/zabbly.asc

EOF'

sudo apt update
sudo apt install -y incus incus-ui-canonical

Configuration

sudo adduser YOUR-USERNAME incus-admin
incus admin init

You’re fine to accept the defaults, though if you’re planning on a cluster consult

https://linuxcontainers.org/incus/docs/main/howto/cluster_form/#cluster-form

Managing Networks

Incus uses managed networks. It creates a private bridged network by default with DHCP, DNS and NAT services. You can create others and it add services similarly. You don’t plug instances in, rather you create a new profile with no network and configure the instance with that profile.

If you’re testing DHCP though, such as when working with netboot, you must create a network without those services. That must be done at the command line with the IP spaces set to none. You can then use that in a profile

incus network create test ipv4.address=none ipv6.address=none
incus profile copy default isolated

You can proceed to the GUI for the rest.

Operation

Windows 11 VM Creation

This requires access to the TPM module and an example at the command line is extracted from https://discussion.scottibyte.com/t/windows-11-incus-virtual-machine/362.

After repacking the installation ISO you can also create through the GUI and add:

incus config device add win11vm vtpm tpm path=/dev/tpm0

Agent

sudo apt install lxd-agent

Notes

LXD is widely admired, but Canonical’s decision to move it to in-house-only led the lead developer and elements of the community to fork.

3.2 - Proxmox PVE

Proxmox PVE is a distro from the company Proxmox that makes it easy to manage manage containers and virtual machines. It’s built on top of Debian and allows a lot of customization. This can be good and bad compared to VMWare or XCP-NG that keep you on the straight and narrow. But it puts the choice in your hands.

Installation

Initial Install

Download the ISO and make a USB installer. It’s a hybrid image so you can write it directly to a USB drive.

sudo dd if=Downloads/proxmox*.iso of=/dev/sdX bs=1M conv=fdatasync

EUFI boot works fine. It will set a static IP address durning the install so be prepared for that.

If your system has an integrated NIC, check out the troubleshooting section after installing for potential issues with RealTek hardware.

System Update

After installation has finished and the system rebooted, update it. If you skip this step you may have problems with containers.

# Remove the pop-up warning if desired
sed -Ezi.bak "s/(function\(orig_cmd\) \{)/\1\n\torig_cmd\(\);\n\treturn;/g" /usr/share/javascript/proxmox-widget-toolkit/proxmoxlib.js && systemctl restart pveproxy.service

# Remove the enterprise subscription repos
rm /etc/apt/sources.list.d/pve-enterprise.list
rm /etc/apt/sources.list.d/ceph.list

# Add the non-subscription PVE repo
. /etc/os-release 
echo "deb http://download.proxmox.com/debian/pve $VERSION_CODENAME pve-no-subscription" > /etc/apt/sources.list.d/pve-no-subscription.list

# Add the non-subscription Ceph repo - this will change so consult 
# Check in your browser --> https://pve.proxmox.com/wiki/Package_Repositories
echo "deb http://download.proxmox.com/debian/ceph-reef bookworm no-subscription" > /etc/apt/sources.list.d/ceph-no-subscription.list

# Alternately, here's a terrible way to get the latest ceph release
LATEST=$(curl https://enterprise.proxmox.com/debian/ | grep ceph | sed 's/.*>\(ceph-.*\)\/<.*\(..-...-....\) .*/\1,\2/' | sort -t- -k3,3n -k2,2M -k1,1n | tail -1 | cut -f 1 -d ",")

echo "deb http://download.proxmox.com/debian/$LATEST $VERSION_CODENAME no-subscription" > /etc/apt/sources.list.d/ceph-no-subscription.list

# Update, upgrade and reboot
apt update
apt upgrade -y
reboot

Container Template Update

The template list is updated on a schedule, but you can get a jump on it while you’re logged in. More information at:

https://pve.proxmox.com/wiki/Linux_Container#pct_container_images

pveam update
pveam available
pveam download local (something from the list)

Configuration

Network

The default config is fine for most use-cases and you can skip the Network section. If you’re in a larger environment, you may want to employ an overlap network, or use VLANs.

The Default Config

PVE creates bridge interface named vmbr0 and assigns a management IP there. As containers and VMs come up, their virtual interfaces will be connected to this bridge so they can have their own MAC addresses.

You can see this in the GUI or in the traditional Debian interfaces file.

cat /etc/network/interfaces
auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.11/24
    gateway 192.168.100.1
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0

Create an Overlap Network

You may want to add some additional LANs for your guests, or to separate your management from the rest of the network. You can do this by simply adding some additional LAN addresses.

After changing IPs, take a look further down at how to restrict access.

Mixing DHCP and Static Addresses

To add additional DHCP IPs, say because you get a DHCP address from the wall but don’t want PVE management on that, use the up directive. In this example the 192 address is LAN only and the gateway comes from DHCP.

auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.11/24
    up dhclient vmbr0
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0
Adding Additional Static Addresses

You should1 use the modern debian method.

auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet static
    address 192.168.1.11/24
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0    

iface vmbr0 inet static
    address 192.168.64.11/24
    gateway 192.168.64.1

Adding VLANs

To can add VLANS in the /etc/network/interfaces file as well.

auto lo
iface lo inet loopback

iface enp1s0 inet manual

auto vmbr0
iface vmbr0 inet manual
    bridge-ports enp1s0
    bridge-stp off
    bridge-fd 0
    bridge-vlan-aware yes
    bridge-vids 2-4094

auto vmbr0.1337
iface vmbr0.1337 inet static
    address 10.133.7.251/24
    gateway 10.133.7.1

auto vmbr0.1020
iface vmbr0.1020 inet static
    address 10.20.146.14/16

Restricting Access

You can use the pvefirewall or the pveproxy settings. There’s an anti-lockout rule on the firewall however, that requires an explicit block, so you may prefer to set controls on the proxy.

PVE Proxy

The management web interface listens on all addresses by default. You can change that here. Other services, such as ssh, remain the same.

vi /etc/default/pveproxy 
LISTEN_IP="192.168.32.11"

You can also combine or substitute a control list. The port will still accept connections, but the application will reset them.

ALLOW_FROM="192.168.32.0/24"
DENY_FROM="all"
POLICY="allow"
pveproxy restart

Accessing Data

Container Bind Mounts for NFS

VMs and Containers work best when they’re light-weight and that means saving data somewhere else, like a NAS. Containers are the lightest of all but using NFS in a container causes a security issue.

Instead, mount on the host and bind-mount to the container with mp.

vi /etc/pve/lxc/100.conf

# Add this line.
#  mount point ID: existing location on the server, location to mount inside the guest
mp0: /mnt/media,mp=/mnt/media,shared=1
#mp1: and so on as you need more.

User ID Mapping

This next thing you’ll notice is that users inside the containers don’t match users outside. That’s because they’re shifted for security. To get them to line up you need a map.

# In the host, edit these files to allow root, starting at 1000, to map the next 11 UIDs and GIDs (in addition to what's there already)

# cat /etc/subuid
root:1000:11
root:100000:65536

# cat /etc/subgid
root:1000:11
root:100000:65536
# Also on the host, edit the container's config 
vi /etc/pve/lxc/100.conf

# At the bottom add these

# By default, the container users are shifted up by 100,000. Keep that in place for the first 1000 with this section 

## Starting with uid 0 in the container, map it to 100000 in the host and continue mapping for 1000 entries. (users 0-999)
lxc.idmap = u 0 100000 1000
lxc.idmap = g 0 100000 1000

# Map the next 10 values down low so they match the host (10 is just an arbitrary number. Map as many or as few as you need)

## Starting in the container at uid 1000, jump to 1000 in the host and map 10 values. (users 1000-1009)
lxc.idmap = u 1000 1000 10
lxc.idmap = g 1000 1000 10

# Then go back to mapping the rest up high
## Starting in the container at uid 1010, map it 101010 and continue for the next 64525 entries (65535 - 1010)
lxc.idmap = u 1010 101010 64525
lxc.idmap = g 1010 101010 64525

Fixing User ID Mapping

If you want to add mapping to an existing container, user IDs are probably already in place and you’ll have to adjust them. Attempts to do so in the container will result in a permission denied, even as root. Mount them in the PVE host and change them there.

pct mount 119
# For a user ID number of 1000
find /var/lib/lxc/119/rootfs -user 101000 -exec chown -h 1000 {} \;
find /var/lib/lxc/119/rootfs -group 101000 -exec chgrp -h 1000 {} \; 
pct unmount 119

Retro-fitting a Service

Sometimes, you have to change a service to match between different containers. Log into your container and do the following.

# find the service's user account, make note of it and stop the service
ps -ef 
service someService stop
 
# get the exiting uid and gid, change them and change the files
id someService

 > uid=112(someService) gid=117(someService) groups=117(someService),117(someService)

usermod -u 1001 someService
groupmod -g 1001 someService

# Change file ownership -xdev so it won't traverse remote volumes 
find / -xdev -group 1001 -exec chgrp -h someService {} \;
find / -xdev -user 1001 -exec chown -h someService {} \;

Clustering

Edit the /etc/hosts file to ensure that the IP address reflects any changes you’ve made (such as the addition on a specific management address). Ideally, add the hostname and IP of all of the impending cluster members and ensure they can all ping each other by that name.

The simple way to create and add members is done at the command line.

# On the first cluster member
pvecm create CLUSTERNAME

# On the other members
pvecm add FIRST-NODE-HOSTNAME

You can also refer to the notes at:

https://pve.proxmox.com/wiki/Cluster_Manager

Operation

Web Gui

You can access the Web GUI at:

https://192.168.32.10:8006

Logging Into Containers

https://forum.proxmox.com/threads/cannot-log-into-containers.39064/

pct enter 100

Troubleshooting

When The Container Doesn’t start

You may want to start it in foreground mode to see the error up close

lxc-start -n ID -F -l DEBUG -o /tmp/lxc-ID.log

Repairing Container Disks

Containers use LVM by default. If it fails to start and you suspect a disk error, you can fsck it. You can access the content as well. There are also direct ways2 if these fail.

pct list
pct fsck 108
pct mount 108

Network Drops

If your PVE server periodically drops the network with an error message about the realtek firmware, consider updating the driver.

# Add the non free and firmware to the apt source main line
sed -i '/bookworm main contrib/s/$/ non-free non-free-firmware/' /etc/apt/sources.list
apt update

# Install the kernel headers and the dkms driver.
apt -y install linux-headers-$(uname -r)
apt install r8168-dkms

Combining DHCP and Static The Normal Way Fails

You can’t do this the normal Debian way it seems. In testing, the bridge doesn’t accept mixing types directly. You must use the ip command.

Cluster Addition Failure

PVE local node address: cannot use IP not found on local node! 500 Can’t connect to XXXX:8006 (hostname verification failed)

Make sure the hosts files on all the nodes match and they can ping each other by hostname. Use hostnames to add cluster members, not IPs.

Sources

https://forum.proxmox.com/threads/proxmox-host-is-getting-unavailable.125416/ https://www.reddit.com/r/Proxmox/comments/10o58uq/how_to_install_r8168dkms_package_on_proxmox_ve_73/ https://wiki.archlinux.org/title/Dynamic_Kernel_Module_Support https://pve.proxmox.com/wiki/Unprivileged_LXC_containers https://pve.proxmox.com/wiki/Unprivileged_LXC_containers#Using_local_directory_bind_mount_points https://www.reddit.com/r/homelab/comments/6p3xdw/proxmoxlxc_mount_host_folder_in_an_unprivileged/