This is the multi-page printable view of this section.
Click here to print.
Return to the regular view of this page.
ZFS
Overview
ZFS: the last word in file systems - at least according to Sun Microsystems back in 2004. But it’s pretty much true for traditional file servers. You add disks to a pool where you decide how much redundancy you want, then create file systems that sit on top. ZFS directly manages it all. No labeling, partitioning or formatting required.
There is error detection and correction, zero-space clones and snapshots, compression and deduplication, and delta-based replication for backup. It’s also highly resistent to corruption from power loss and crashes because it uses Copy-On-Write.
This last feature means that as files are changed, only the changed-bits are written out, and then the metadata updated at the end as a separate and final step to include these changed bits. The original file stays the same until the very end. An interruption in the middle of a write (such as from a crash) leaves the file undamaged.
Lastly, everything is checksummed and automatically repaired should you ever suffer from silent corruption.
1 - Basics
The Basics
Let’s create a pool and mount a file system.
Create a Pool
A ‘pool’ is group of disks and that’s where RAID levels are established. You can choose to mirror drives, or use distributed parity. A common choice is RAIDZ1
so that you can sustain the loss of one drive. You can increase the redundancy to RAIDZ2 and RAIDZ3 as desired.
zpool create pool01 raidz1 /dev/sdc /dev/sdd /dev/sde /dev/sdf
zpool list
NAME SIZE ALLOC FREE
pool01 40T 0T 40T
Create a Filesystem
A default root file system is created and mounted automatically when the pool is created. Using this is fine and you can easily change where it’s mounted.
zfs list
NAME USED AVAIL REFER MOUNTPOINT
pool01 0T 40T 0T /pool01
zfs set mountpoint=/mnt pool01
But often, you need more than one file system - say one for holding working files and another for long term storage. This allows you to easily back up things different things, differently.
# Get rid of the initial fs and pool
zpool destroy pool01
# Create it again but leave the root filesystem unmounted with the `-m` option. You can also use drive short-names
zpool create -m none pool01 raidz1 sdc sdd sde sdf
# Add a couple filesystems and mount under the /srv directory
zfs create pool01/working -o mountpoint=/srv/working
zfs create pool01/archive -o mountpoint=/srv/archive
^ ^
/ \
pool name -- filesystem name
Now you can do things like snapshot the archive folder regularly while skipping the working folder. The only downside is that they are separate filessytems. Moving things doesn’t happen instantly.
Compression
Compression is on by default and this will save space for things that can benefit from it. It also makes things faster as moving compressed data takes less time. CPU use for the default algorithm lz4
, is negligible and it quickly detects files that aren’t compressible and gives up so that CPU time isn’t wasted.
zfs get compression pool01
NAME PROPERTY VALUE SOURCE
pool01 compression lz4 local
Next Step
Now you’ve got a filesystem, take a look at creating and work with snapshots.
2 - Snapshot
Create a Snapshot
About to do something dangerous? Let’s create a ‘save’ point so you can reload your game, so to speak. They don’t take any space (to start with) and are nearly instant.
# Create a snapshot named `save-1`
zfs snapshot pool01/archive@save-1
The snapshot is a read-only copy of the filesystem at that time. It’s mounted by default in a hidden directory and you can examine and even copy things out of it, if you desire.
ls /srv/archive/.zfs/save-1
Delete a Snapshot
While a snapshot doesn’t take up any space to start with, it begins to as you make changes. Anything you delete stays around on the snapshot. Things you edit consume space as well for the changed bits. So when you’re done with a snapshot, it’s easy to remove.
zfs snapshot destroy pool01/archive@save-1
Rollback to a Snapshot
Mess things up in your archive folder after all? Not a problem, just roll back to the same state as your snapshot.
zfs rollback pool01/archive@save-1
Importantly, this is a one-way trip back. You can’t branch and jump around like it was a filesystem multiverse of alternate possibilities. ZFS will warn you about this and if you have more than one snapshot in between you and where you’re going, it will let you know they are about to be deleted.
Auto Snapshot
One of the most useful tools is the zfs-auto-snapshot
utility. This will create periodic snapshots of your filesystem and keep them pruned for efficiency. By default, it creates a snapshot every 15 min, and then prunes them down so you have one:
- Every 15 min for an hour
- Every hour for a day
- Every day for a week
- Every week for month
- Every month
Install with the command:
sudo apt install zfs-auto-snapshot
That’s it. You’ll see new folders based on time created in the hidden .zfs
folder at the root of your filesystems. Each filesystem will get it’s own. Anytime you need to look for a file you’ve deleted, you’ll find it there.
# Look at snapshots
ls /srv/archive/.zfs/
Excluding datasets from auto snapshot
# Disable
zfs set com.sun:auto-snapshot=false rpool/export
Excluding frequent or other auto-snapshot
There are sub-properties you can set under the basic auto-snapshot value
zfs set com.sun:auto-snapshot=true pool02/someDataSet
zfs set com.sun:auto-snapshot:frequent=false pool02/someDataSet
zfs get com.sun:auto-snapshot pool02/someDataSet
zfs get com.sun:auto-snapshot:frequent pool02/someDataSet
# Possibly also the number to keep if other than the default is desired
zfs set com.sun:auto-snapshot:weekly=true,keep=52
# Take only weekly
zfs set com.sun:auto-snapshot:weekly=true rpool/export
Deleting Lots of auto-snapshot files
You can’t use globbing or similar to mass-delete snapshots, but you can string together a couple commands.
# Disable auto-snap as needed
zfs set com.sun:auto-snapshot=false pool04
zfs list -H -o name -t snapshot | grep auto | xargs -n1 zfs destroy
Missing Auto Snapshots
On some centOS-based systems, like XCP-NG, you will only see frequent snapshots. This is because only the frequent cron job uses the correct path. You must add a PATH statement to the other cron jobs
https://forum.level1techs.com/t/setting-up-zfs-auto-snapshot-on-centos-7/129574/12
Next Step
Now that you have snapshots, let’s send them somewhere for backup with replication.
References
https://www.reddit.com/r/zfs/comments/829v5a/zfs_ubuntu_1604_delete_snapshots_with_wildcard/
3 - Replication
Replication is how you backup and copy ZFS. It turns a snapshot into a bit-stream that you can pipe to something else. Usually, you pipe it over the network to another system where you connect it to zfs receive
.
It is also the only way. The snapshot
is what allows point-in-time handling and the receive
ensures consistency. And a snapshot is a filesystem, but it’s more than just the files. Two identically named filesystems with the same files you put in place by rsync are not the same filesystem and you can’t jump-start a sync this way.
Basic Examples
# On the receiving side, create a pool to hold the filesystem
zpool create -m none pool02 raidz1 sdc sdd sde sdf
# On the sending side, pipe over SSH. The -F forces the filesystem on the receiving side to be replaced
zfs snapshot pool01@snap1
zfs send pool01@snap1 | ssh some.other.server zfs receive -F pool02
Importantly, this replaces the root filesystem on the receiving side. The filesystem you just copied over is accessible when the replication is finished - assuming it’s mounted and your only using the default root. If you’re using multiple filesystems, you’ll want to recursively send things so you can pick up children like the archive filesystem.
# The -r and -R trigger recursive operations
zfs snapshot -r pool01@snap1
zfs send -R pool01@snap1 | ssh some.other.server zfs receive -F pool02
You can also pick a specific filesystem to send. You can name it whatever you like on the other side, or replace something already named.
# Sending just the archive filesystem
zfs snapshot pool01/archive@snap1
zfs send pool01/archive@snap1 | ssh some.other.server zfs receive -F pool02/archive
And of course, you may have two pools on the same system. One line in a terminal is all you need.
zfs send -R pool01@snap1 | zfs receive -F pool02
Using Mbuffer or Netcat
These are much faster than ssh if you don’t care about someone capturing the traffic. But it does require you to start both ends separately.
# On the receiving side
ssh some.other.system
mbuffer -4 -s 128k -m 1G -I 8990 | zfs receive -F pool02/archive
# On the sending side
zfs send pool01/archive@snap1 | mbuffer -s 128k -m 1G -O some.other.system:8990
You can also use good-ole netcat. It’s a little slower but still faster than SSH. Combine it with pv
for some visuals.
# On the receiving end
nc -l 8989 | pv -trab -B 500M | zfs recv -F pool02/archive
# On the sending side
zfs send pool01/archive@snap1 | nc some.other.system 8989
Estimating The Size
You may want to know how big the transmit is to estimate time or capacity. You can do this with a dry-run.
zfs send -nv pool01/archive@snap1
Use a Resumable Token
Any interruptions and you have to start all over again - or do you? If you’re sending a long-running transfer, add a token
on the receiving side and you can restart from where it broke, turning a tragedy into just an annoyance.
# On the receiving side, add -s
ssh some.other.system
mbuffer -4 -s 128k -m 1G -I 8990 | zfs receive -s -F pool01/archive
# Send the stream normally
zfs send pool01/archive@snap1 | mbuffer -s 128k -m 1G -O some.other.system:8990
# If you get interrupted, on the receiving side, look up the token
zfs get -H -o value receive_resume_token pool01
# Then use that on the sending side to resume where you left off
zfs send -t 'some_long_key' | mbuffer -s 128k -m 1G -O some.other.system:8990
If you decide you don’t want to resume, clean up with the -A command to release the space consumed by the pending transfer.
# On the receiving side
zfs recv -A pool01/archive
Sending an Incremental Snapshot
After you’ve sent the initial snapshot, subsequent ones are much smaller. Even very large backups can be kept current up if you ‘pre-seed’ before taking the other pool remote.
# at some point in past
zfs snapshot pool01/archive@snap1
zfs send pool01/archive@snap1 ....
# now we'll snap again and send just the changes between the two using -i
zfs snapshot pool01/archive@snap2
zfs send -i pool01/archive/@snap1 pool01/archive@snap2 ...
Sending Intervening Snapshots
If you are jumping more than one snapshot ahead, the intervening ones are skipped. If you want to include them for some reason, use the -I option.
# This will send all the snaps between 1 and 9
zfs send -I pool01/archive/@snap1 pool01/archive@snap9 ...
Changes Are Always Detected
You’ll often need to use -F to force changes as even though you haven’t used the remote system, it may think you have if it’s mounted and atimes are on.
You must have a snapshot in common
You need at least 1 snapshot in common at both locations. This must have been sent from one to the other, not just named the same. Say you create snap1 and send it. Later, you create snap2 and, thinking you don’t need it anymore, delete snap1. Even though you have snap1 on the destination you cannot send snap2 as a delta. You need snap1 in both locations to generate the delta. You are out of luck and must send snap2 as a full snapshot.
You can use a zfs feature called a bookmark
as an alternative, but that is something you set up in advance and won’t save you from the above scenario.
A Full Example of an Incremental Send
Take a snapshot, estimate the size with a dry-run, then use a resumable token and force changes
zfs snapshot pool01/archive@12th
zfs send -nv -i pool01/archive@11th pool01/archive@12th
zfs send -i pool01/archive@11th pool01/archive@12th | pv -trab -B 500M | ssh some.other.server zfs recv -F -s pool01/archive
Here’s an example of a recursive snapshot to a file. The snapshot takes a -r for recursive, and the send a -R.
# Take the snapshot
zfs snapshot -r pool01/archive@21
# Estimate the size
zfs send -nv -R -i pool01/archive@20 pool01/archive@21
# Mount a portable drive to /mnt - a traditional ext4 or whatever
zfs send -Ri pool01/archive@20 pool01/archive@21 > /mnt/backupfile
# When you get to the other location, receive from the file
zfs receive -s -F pool02/archive < /mnt/backupfile
Sending The Whole Pool
This is a common thought when you’re starting, but it ends up being deceptive because pools aren’t things that can be sent. Only filesets. So what you’re actually doing is a recursive send of the root fileset with a implicit snapshot created on the fly. This is find, but you won’t be able to refer to it later for updates, so you’re better off not.
# Unmount -a (all available filesystems) on the given pool
zfs unmount -a pool01
# Send the unmounted filesystem with an implicit snapshot
zfs send -R pool01 ...
Auto Replication
You don’t want to do this by hand all the time. One way is with a simple script. If you’ve already installed zfs-auto-snapshot
you may have something that looks like this:
# use '-o name' to get just the snapshot name without all the details
# use '-s creation' to sort by creation time
zfs list -t snapshot -o name -s creation pool01/archive
pool01/archive@auto-2024-10-13_00-00
pool01/archive@auto-2024-10-20_00-00
pool01/archive@auto-2024-10-27_00-00
pool01/archive@auto-2024-11-01_00-00
pool01/archive@auto-2024-11-02_00-00
You can get the last two like this, then use send and receive. Adjust the grep to get just the daily as needed.
CURRENT=$(zfs list -t snapshot -o name -s creation pool01/replication | grep auto | tail -1 )
LAST=$(zfs list -t snapshot -o name -s creation pool01/replication | grep auto | tail -2 | head -1)
zfs send -i $LAST $CURRENT | pv -trab -B 500M | ssh some.other.server zfs recv -F -s pool01/archive
This is pretty basic and can fall out of sync. You can bring it up a notch by asking the other side to list it’s snapshots with a zfs list
over ssh and comparing against yours to find the most recent match. And add a resume token. But by that point you may consider just using a tool like:
This looks to replace your auto snapshots as well, and that’s probably fine. I have’t used it myself as I scripted back in the day. I will probably start, though.
Next Step
Sometimes, good disks go bad. Learn how to catch them before they do, and replace them when needed.
Scrub and Replacement
4 - Disk Replacement
Your disks will fail, but you’ll usually get some warnings because ZFS proactively checks every occupied bit to guard against silent corruption. This is normally done every month, but you can launch one manually if you’re suspicious. They take a long time, but operate at low priority.
# Start a scrub
zpool scrub pool01
# Check the status
zpool scrub -s pool01
You can check the status of your pool at anytime with the command zpool status
. When there’s a problem, you’ll see this:
zpool status
NAME STATE READ WRITE CKSUM
pool01 DEGRADED 0 0 0
raidz1 DEGRADED 0 0 0
/dev/sda ONLINE 0 0 0
/dev/sdb ONLINE 0 0 0
/dev/sdc ONLINE 0 0 0
/dev/sdd FAULTED 53 0 0 too many errors
Time to replace that last drive before it goes all the way bad
# You don't need to manually offline the drive if it's faulted, but it's good practice to as there's other states it can be in
zpool offline pool01 /dev/sdd
# Physically replace that drive. If you're shutting down to do this, the replacement usually has the same device path
zpool replace pool01 /dev/sdd
There’s a lot of strange things that can happen with drives and depending on your version of ZFS it might be using UUIDs or other drive identification strings. Check the link below for some of those conditions.
5 - Large Pools
You’ve guarded against disk failure this by adding redundancy, but was it enough? There’s a very mathy calculator at https://jro.io/r2c2/ that will allow you chart different parity configs. But a reasonable rule-of-thumb is to devote 20%, or 1 in 5 drives, to parity.
- RAIDZ1 - up to 5 Drives
- RAIDZ2 - up to 10 Drives
- RAIDZ3 - up to 15 Drives
Oracle however, recommends Virtual Devices when you go past 9 disks.
Pools and Virtual Devices
When you get past 15 drives, you can’t increase parity. You can however, create virtual devices. Best practice from Oracle says to do this even earlier as a VDev should be less than 9 disks . So given 24 disks, you should have 3 VDevs of 8 each. Here’s an example with 2 parity. Slightly better than 1 in 5 and suitable for older disks.
### Build a 3-Wide RAIDZ2 across 24 disks
zpool create \
pool01 \
-m none \
-f \
raidz2 sdb sdc sdd sde sdf sdg sdh sdi \
raidz2 sdj sdk sdl sdm sdn sdo sdp sdq \
raidz2 sdr sds sdt sdu sdv sdw sdx sdy
Using Disk IDs
Drive letters can be hard to trace back to a physical drive. A a better way is the /dev/disk/by-id
identifiers.
ls /dev/disk/by-id | grep ata | grep -v part
zpool create -m none -o ashift=12 -O compression=lz4 \
pool04 \
raidz2 \
ata-ST4000NM0035-1V4107_ZC11AHH9 \
ata-ST4000NM0035-1V4107_ZC116F11 \
ata-ST4000NM0035-1V4107_ZC1195V5 \
ata-ST4000NM0035-1V4107_ZC11CDMB \
ata-ST4000NM0035-1V4107_ZC1195PR \
ata-ST4000NM0024-1HT178_Z4F164WG \
ata-ST4000NM0024-1HT178_Z4F17SJK \
ata-ST4000NM0024-1HT178_Z4F17M6B \
ata-ST4000NM0024-1HT178_Z4F18FZE \
ata-ST4000NM0024-1HT178_Z4F18G35
Hot and Distributed Spares
Spares vs Parity
You may not be reach a location quickly when a disk fails. In such a case, is it better to have a Z3
filesystem run in degraded performance mode (i.e. calculating parity the whole time) or a Z2
system that replaces the failed disk automatically?
It’s better to have a more parity until you go past the guidelines of 15 drives in a Z3 config. If you have 16 bays, add a hot spare.
Distributed vs Dedicated
A distributed spare is a newer feature that allows you reserve space on all of your disks, rather than just one. That allows resilvering to go much faster as you’re no longer limited by the speed of one disk. Here’s an example of such a pool that has 16 total devices.
# This pool has 3 parity, 12 data, 16 total count, with 1 spare
```bash
zpool create -f pool02 \
draid3:12d:16c:1s \
ata-ST4000NM000A-2HZ100_WJG04M27 \
ata-ST4000NM000A-2HZ100_WJG09BH7 \
ata-ST4000NM000A-2HZ100_WJG0QJ7X \
ata-ST4000NM000A-2HZ100_WS20ECCD \
ata-ST4000NM000A-2HZ100_WS20ECFH \
ata-ST4000NM000A-2HZ100_WS20JXTA \
ata-ST4000NM0024-1HT178_Z4F14K76 \
ata-ST4000NM0024-1HT178_Z4F17SJK \
ata-ST4000NM0024-1HT178_Z4F17YBP \
ata-ST4000NM0024-1HT178_Z4F1BJR1 \
ata-ST4000NM002A-2HZ101_WJG0GBXB \
ata-ST4000NM002A-2HZ101_WJG11NGC \
ata-ST4000NM0035-1V4107_ZC1168N3 \
ata-ST4000NM0035-1V4107_ZC116F11 \
ata-ST4000NM0035-1V4107_ZC116MSW \
ata-ST4000NM0035-1V4107_ZC116NZM \
References
6 - Pool Testing
Best Practices
Best practice from Oracle says a VDev should be less than 9 disks. So given 24 disks, you should have 3 VDevs. They further recommend the following amount of parity vs data:
- single-parity starting at 3 disks (2+1)
- double-parity starting at 6 disks (4+2)
- triple-parity starting at 9 disks (6+3)
It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).
Reasons For These Practices
I interpret this as meaning that when a single IO write operation is given to the VDev, it won’t write anything else until it’s done. But if you have multiple VDevs, you can hand out a writes to other VDevs while you’re waiting on the first. Reading is probably unaffected, but writes will be faster with more VDevs.
Also, when resilvering the array, you have to read from each of the drives in the VDev to calculate the parity bit. If you have 24 drives in a VDev, then you have to read a block of data from all 24 drives to produce the parity bit. If you have only 8, then you have only 1/3 as much data to read. Meanwhile, the rest of the VDevs are available for real work.
Rebuilding the array also introduces stress which can cause other disks to fail, so it’s best to limit that to a smaller set of drives. I’ve heard many times of resilvering causing sister drives that were already on the edge, to go over and fail the array.
Calculating Failure Rates
You can calculate the failure rates of different configurations with an on-line tool. The chart scales the X axis by 50, so the differences in failure rates are not as large as it would seem, but if they didn’t you wouldn’t be able to see the lines. But in most cases, there’s not a large difference say between a 4x9 and a 3x12.
When To Use a Hot Spare
Given 9 disks where one fails, is it better to drop from 3 parity to 2 and run in degraded mode, or 2 parity that drops to 1 and a spare that recovers without intervention. The math says its better to have parity. But what about speed? When you loose a disk, 1 out of every 9 IOPS requires that you reconstruct it from parity. Anecdotally, observed performance penalties are minor. So the only times to use a hot spare is:
- When you have unused capacity in RAIDZ3 (i.e. almost never)
- When IOPS require a mirror pool
Say you have 16 bays of 4TB Drives. A 2x8 Z2 config gives you 48TB but you only want 32TB. Change that to a 2x8 Z3 and get 40TB. Still only need 32 TB? Change that to a 2x7 Z3 with 2 hot spares. Now you have 32TB with the maximum protection and the insurance of an automatic replacement.
Or maybe you have a 37 bay system. You do something that equals 36 plus a spare.
The other case is when your IOPS demands push past what RAIDZ can do an you must use a mirror pool. A failure there looses all redundancy and a hot spare is your only option.
When To Use a Distributed Spare
A distributed spare recovers in half the time from a disk loss, and is always better than a dedicated spare - though you should almost never use a spare anyway. The only time to use a normal hot spare is when you have a single global spare.
Testing Speed
The speed difference isn’t charted. So let’s test that some.
Given 24 disks, and deciding to live dangerously, should you should have a single, 24 disk vdev with three parity disks, or three VDevs with a single parity disk each? The reason for the 1st case is better resiliency, and the latter better write speed and recovery from disk failures.
Build a 3-Wide RAIDZ1
Create the pool across 24 disks
zpool create \
-f -m /srv srv \
raidz sdb sdc sdd sde sdf sdg sdh sdi \
raidz sdj sdk sdl sdm sdn sdo sdp sdq \
raidz sdr sds sdt sdu sdv sdw sdx sdy
Now copy a lot of random data to it
#!/bin/bash
no_of_files=1000
counter=0
while [[ $counter -le $no_of_files ]]
do echo Creating file no $counter
touch random-file.$counter
shred -n 1 -s 1G random-file.$counter
let "counter += 1"
done
Now yank (literally) one of the physical disks and replace it
allen@server:~$ sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: none requested
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
6847353731192779603 UNAVAIL 0 0 0 was /dev/sdc1
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
errors: No known data errors
allen@server:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 449.9G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 15.9G 0 part [SWAP]
sdb 8:16 1 931.5G 0 disk
├─sdb1 8:17 1 931.5G 0 part
└─sdb9 8:25 1 8M 0 part
sdc 8:32 1 931.5G 0 disk
sdd 8:48 1 931.5G 0 disk
├─sdd1 8:49 1 931.5G 0 part
└─sdd9 8:57 1 8M 0 part
...
sudo zpool replace srv 6847353731192779603 /dev/sdc -f
allen@server:~$ sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Mar 22 15:50:21 2019
131G scanned out of 13.5T at 941M/s, 4h7m to go
5.40G resilvered, 0.95% done
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
6847353731192779603 OFFLINE 0 0 0 was /dev/sdc1/old
sdc ONLINE 0 0 0 (resilvering)
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
...
A few hours later…
$ sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices has experienced an error resulting in data corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 571G in 5h16m with 2946 errors on Fri Mar 22 21:06:48 2019
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 208 0 2.67K
raidz1-0 DEGRADED 208 0 5.16K
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
6847353731192779603 OFFLINE 0 0 0 was /dev/sdc1/old
sdc ONLINE 0 0 0
sdd ONLINE 208 0 1
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdj ONLINE 0 0 1
sdk ONLINE 0 0 1
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
The time was 5h16m. But notice the error - during resilvering drive sdd
had 208 read errors and data was lost. This is the classic RAID situation where resilvering stresses the drives, another goes bad and you can’t restore.
It’s somewhat questionable if this is a valid test as the affect of the error on resilvering duration is unknown. But on with the test.
Let’s wipe that away and create a raidz3
sudo zpool destroy srv
zpool create \
-f -m /srv srv \
raidz3 \
sdb sdc sdd sde sdf sdg sdh sdi \
sdj sdk sdl sdm sdn sdo sdp sdq \
sdr sds sdt sdu sdv sdw sdx sdy
zdb
zpool offline srv 15700807100581040709
sudo zpool replace srv 15700807100581040709 sdc
allen@server:~$ sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Mar 24 10:07:18 2019
27.9G scanned out of 9.14T at 362M/s, 7h19m to go
1.21G resilvered, 0.30% done
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
sdd OFFLINE 0 0 0
sdc ONLINE 0 0 0 (resilvering)
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
...
allen@server:~$ sudo zpool status
pool: srv
state: ONLINE
scan: resilvered 405G in 6h58m with 0 errors on Sun Mar 24 17:05:50 2019
config:
NAME STATE READ WRITE CKSUM
srv ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
...
The time? 6h58m. Longer, but safer.
7 - ZFS Cache
There is a lot out there about ZFS cache config. I’ve found the most significant feature to be putting your metadata on a dedicated NVMe devices. This is noted as a ‘Special’ VDev. Here’s an example of a draid with such a device at the end.
Note: A 2x18 is bad practice - just more fun than a 3x12 with no spares.
zpool create -f pool02 \
draid3:14d:18c:1s \
ata-ST4000NM000A-2HZ100_WJG04M27 \
ata-ST4000NM000A-2HZ100_WJG09BH7 \
ata-ST4000NM000A-2HZ100_WJG0QJ7X \
ata-ST4000NM000A-2HZ100_WS20ECCD \
ata-ST4000NM000A-2HZ100_WS20ECFH \
ata-ST4000NM000A-2HZ100_WS20JXTA \
ata-ST4000NM0024-1HT178_Z4F14K76 \
ata-ST4000NM0024-1HT178_Z4F17SJK \
ata-ST4000NM0024-1HT178_Z4F17YBP \
ata-ST4000NM0024-1HT178_Z4F1BJR1 \
ata-ST4000NM002A-2HZ101_WJG0GBXB \
ata-ST4000NM002A-2HZ101_WJG11NGC \
ata-ST4000NM0035-1V4107_ZC1168N3 \
ata-ST4000NM0035-1V4107_ZC116F11 \
ata-ST4000NM0035-1V4107_ZC116MSW \
ata-ST4000NM0035-1V4107_ZC116NZM \
ata-ST4000NM0035-1V4107_ZC118WV5 \
ata-ST4000NM0035-1V4107_ZC118WW0 \
draid3:14d:18c:1s \
ata-ST4000NM0035-1V4107_ZC118X74 \
ata-ST4000NM0035-1V4107_ZC118X90 \
ata-ST4000NM0035-1V4107_ZC118XBS \
ata-ST4000NM0035-1V4107_ZC118Z23 \
ata-ST4000NM0035-1V4107_ZC11907W \
ata-ST4000NM0035-1V4107_ZC1192GG \
ata-ST4000NM0035-1V4107_ZC1195PR \
ata-ST4000NM0035-1V4107_ZC1195V5 \
ata-ST4000NM0035-1V4107_ZC1195ZJ \
ata-ST4000NM0035-1V4107_ZC11AHH9 \
ata-ST4000NM0035-1V4107_ZC11CDD0 \
ata-ST4000NM0035-1V4107_ZC11CE77 \
ata-ST4000NM0035-1V4107_ZC11CV5E \
ata-ST4000NM0035-1V4107_ZC11D2AQ \
ata-ST4000NM0035-1V4107_ZC11HRGR \
ata-ST4000NM0035-1V4107_ZC1B200R \
ata-ST4000NM0035-1V4107_ZC1CBXEH \
ata-ST4000NM0035-1V4107_ZC1DC98B \
special mirror \
ata-MICRON_M510DC_MTFDDAK960MBP_164614A1DBC4 \
ata-MICRON_M510DC_MTFDDAK960MBP_170615BD4A74
zfs set special_small_blocks=64K pool02
Metadata is stores automatically on the special device but there’s a benefit in also directing the pool to use the special vdev for small files as well.
Sources
https://forum.level1techs.com/t/zfs-metadata-special-device-z/159954
8 - ZFS Encryption
You might want to store data such that it’s encrypted at rest. Or replicate data to such as system. ZFS offers this on a per-dataset option.
Create an Encrypted Fileset
Let’s assume that you’re at a remote site and want to create an encrypted fileset to receive your replications.
zfs create -o encryption=on -o keylocation=prompt -o keyformat=passphrase pool02/encrypted
Replicating to an Encrypted Fileset
This example uses mbuffer and assumes a secure VPN. Replace with SSH as needed.
# On the receiving side
sudo zfs load-key -r pool02/encrypted
mbuffer -4 -s 128k -m 1G -I 8990 | sudo zfs receive -s -F pool02/encrypted
# On the sending side
zfs send -i pool01/archive@snap1 pool01/archived@snap2 | mbuffer -s 128k -m 1G -O some.server:8990
9 - VDev Sizing
Best practice from Oracle says a VDev should be less than 9 disks. So given 24 disks you should have 3 VDevs. However, when using RAIDZ, the math shows they should be as large as possible with multiple parity disks. I.e. with 24 disks you should have a single, 24 disk VDev.
The reason for the best practice seems to be about the speed of writing and recovering from disk failures.
It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).
With a single VDev, you break up the data to send a chunk to each drive, then wait for them all to finish writing before you send the next. With several VDevs, you can move on to the next while you wait for the others to finish.
Build a 3-Wide RAIDZ1
Create the pool across 24 disks
#
# -O is the pool's root dataset. Lowercase letter -o is for pool properties
# sudo zfs get compression to check. lz4 is now prefered
#
zpool create \
-m /srv srv \
-O compression=lz4 \
raidz sdb sdc sdd sde sdf sdg sdh sdi \
raidz sdj sdk sdl sdm sdn sdo sdp sdq \
raidz sdr sds sdt sdu sdv sdw sdx sdy -f
Copy a lot of random data to it.
#!/bin/bash
no_of_files=1000
counter=0
while [[ $counter -le $no_of_files ]]
do echo Creating file no $counter
touch random-file.$counter
shred -n 1 -s 1G random-file.$counter
let "counter += 1"
done
Yank out (literally) one of the physical disks and replace it.
sudo zpool status [433/433]
pool: srv
state: DEGRADED
status: One or more devices could not be used because the label is missing or
invalid. Sufficient replicas exist for the pool to continue
functioning in a degraded state.
action: Replace the device using 'zpool replace'.
see: http://zfsonlinux.org/msg/ZFS-8000-4J
scan: none requested
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
6847353731192779603 UNAVAIL 0 0 0 was /dev/sdc1
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdj ONLINE 0 0 0
sdk ONLINE 0 0 0
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
errors: No known data errors
Insert a new disk and replace the missing one.
allen@server:~$ lsblk
NAME MAJ:MIN RM SIZE RO TYPE MOUNTPOINT
sda 8:0 0 465.8G 0 disk
├─sda1 8:1 0 449.9G 0 part /
├─sda2 8:2 0 1K 0 part
└─sda5 8:5 0 15.9G 0 part [SWAP]
sdb 8:16 1 931.5G 0 disk
├─sdb1 8:17 1 931.5G 0 part
└─sdb9 8:25 1 8M 0 part
sdc 8:32 1 931.5G 0 disk # <-- new disk showed up here
sdd 8:48 1 931.5G 0 disk
├─sdd1 8:49 1 931.5G 0 part
└─sdd9 8:57 1 8M 0 part
...
sudo zpool replace srv 6847353731192779603 /dev/sdc -f
sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Fri Mar 22 15:50:21 2019
131G scanned out of 13.5T at 941M/s, 4h7m to go
5.40G resilvered, 0.95% done
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz1-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
6847353731192779603 OFFLINE 0 0 0 was /dev/sdc1/old
sdc ONLINE 0 0 0 (resilvering)
sdd ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
...
We can see it’s running at 941M/s. Not too bad.
Build a 1-Wide RAIDZ3
sudo zpool destroy srv
zpool create \
-m /srv srv \
-O compression=lz4 \
raidz3 \
sdb sdc sdd sde sdf sdg sdh sdi \
sdj sdk sdl sdm sdn sdo sdp sdq \
sdr sds sdt sdu sdv sdw sdx sdy -f
Copy a lot of random data to it again (as above)
Replace a disk (as above)
allen@server:~$ sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices is currently being resilvered. The pool will
continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
scan: resilver in progress since Sun Mar 24 10:07:18 2019
27.9G scanned out of 9.14T at 362M/s, 7h19m to go
1.21G resilvered, 0.30% done
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 0 0 0
raidz3-0 DEGRADED 0 0 0
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
sdd OFFLINE 0 0 0
sdc ONLINE 0 0 0 (resilvering)
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
So that’s running quite a bit slower. Not exactly 1/3, but closer to it than not.
Surprise Ending
That was all about speed. What about reliability?
Our first resilver was going a lot faster, but it ended badly. Other errors popped up, some on the same VDev as was being resilvered, and so it failed.
sudo zpool status
pool: srv
state: DEGRADED
status: One or more devices has experienced an error resulting in data
corruption. Applications may be affected.
action: Restore the file in question if possible. Otherwise restore the
entire pool from backup.
see: http://zfsonlinux.org/msg/ZFS-8000-8A
scan: resilvered 571G in 5h16m with 2946 errors on Fri Mar 22 21:06:48 2019
config:
NAME STATE READ WRITE CKSUM
srv DEGRADED 208 0 2.67K
raidz1-0 DEGRADED 208 0 5.16K
sdb ONLINE 0 0 0
replacing-1 OFFLINE 0 0 0
6847353731192779603 OFFLINE 0 0 0 was /dev/sdc1/old
sdc ONLINE 0 0 0
sdd ONLINE 208 0 1
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
sdh ONLINE 0 0 0
sdi ONLINE 0 0 0
raidz1-1 ONLINE 0 0 0
sdr ONLINE 0 0 0
sds ONLINE 0 0 0
sdt ONLINE 0 0 0
sdu ONLINE 0 0 0
sdj ONLINE 0 0 1
sdk ONLINE 0 0 1
sdl ONLINE 0 0 0
sdm ONLINE 0 0 0
raidz1-2 ONLINE 0 0 0
sdv ONLINE 0 0 0
sdw ONLINE 0 0 0
sdx ONLINE 0 0 0
sdy ONLINE 0 0 0
sdn ONLINE 0 0 0
sdo ONLINE 0 0 0
sdp ONLINE 0 0 0
sdq ONLINE 0 0 0
errors: 2946 data errors, use '-v' for a list
Our second resilver was going very slowly, but did slow and sure when the race? It did, but very very slowly.
allen@server:~$ sudo zpool status
[sudo] password for allen:
pool: srv
state: ONLINE
scan: resilvered 405G in 6h58m with 0 errors on Sun Mar 24 17:05:50 2019
config:
NAME STATE READ WRITE CKSUM
srv ONLINE 0 0 0
raidz3-0 ONLINE 0 0 0
sdb ONLINE 0 0 0
sdc ONLINE 0 0 0
sde ONLINE 0 0 0
sdf ONLINE 0 0 0
sdg ONLINE 0 0 0
...
...
It slowed even further down, as 400G in 7 hours is something like a 16M/s. I didn’t see any checksum errors this time, but that time is abysmal.
Though, to paraphrase Livy, better late than never.
10 - ZFS Replication Script
#!/usr/bin/env bash
#
# zfs-pull.sh
#
# Pulls incremental ZFS snapshots from a remote (source) server to the local (destination) server.
# Uses snapshots made by zfs-auto-snapshot. Locates the latest snapshot common to both sides
# to perform an incremental replication; if none is found, it does a full send.
#
# Usage: replicate-zfs-pull.sh <SOURCE_HOST> <SOURCE_DATASET> <DEST_DATASET>
#
# Example:
# ./replicate-zfs-pull.sh mysourcehost tank/mydata tank/backup/mydata
#
# Assumptions/Notes:
# - The local server is the destination. The remote server is the source.
# - We're using "zfs recv -F" locally, which can forcibly roll back the destination
# dataset if it has diverging snapshots. Remove or change -F as desired.
# - This script is minimal and doesn't handle advanced errors or timeouts gracefully.
# - Key-based SSH authentication should be set up so that `ssh <SOURCE_HOST>` doesn't require a password prompt.
#
set -euo pipefail
##############################################################################
# 1. Parse command-line arguments
##############################################################################
if [[ $# -ne 3 ]]; then
echo "Usage: $0 <SOURCE_HOST> <SOURCE_DATASET> <DEST_DATASET>"
exit 1
fi
SOURCE_HOST="$1"
SOURCE_DATASET="$2"
DEST_DATASET="$3"
##############################################################################
# 2. Gather snapshot lists
#
# The command zfs list -H -t snapshot -o name -S creation -d 1
# -H : Output without headers for script-friendliness
# -t snapshot : Only list snapshots
# -o name : Only list the name
# -d 1 : Only descend one level - i.e. don't tree out child datasets
##############################################################################
# - Remote (source) snapshots: via SSH to the remote host
# - Local (destination) snapshots: from the local ZFS
echo "Collecting snapshots from remote source: ${SOURCE_HOST}:${SOURCE_DATASET}..."
REMOTE_SNAPSHOTS=$(ssh "${SOURCE_HOST}" zfs list -H -t snapshot -o name -d 1 "${SOURCE_DATASET}" 2>/dev/null \
| grep "${SOURCE_DATASET}@" \
| awk -F'@' '{print $2}' || true)
echo "Collecting snapshots from local destination: ${DEST_DATASET}..."
LOCAL_SNAPSHOTS=$(zfs list -H -t snapshot -o name -d 1 "${DEST_DATASET}" 2>/dev/null \
| grep "${DEST_DATASET}@" \
| awk -F'@' '{print $2}' || true)
##############################################################################
# 3. Find the latest common snapshot
#
# The snapshots names have prefixes like "zfs-auto-snap_daily" and "zfs-auto-snap_hourly"
# that confuse sorting for the linux comm program, so we strip the prefix with sed before
# using 'comm -12' to find common elements of input 1 and 2, and tail to get the last one.
#
COMMON_SNAPSHOT=$(comm -12 <(echo "$REMOTE_SNAPSHOTS" | sed 's/zfs-auto-snap_\w*-//' | sort) <(echo "$LOCAL_SNAPSHOTS" | sed 's/zfs-auto-snap_\w*-//' | sort) | tail -n 1)
# We need the full name back for the transfer, so grep it out of the local list. Make sure to quote the variable sent to grep or you'll loose the newlines.
COMMON_SNAPSHOT=$(echo "$LOCAL_SNAPSHOTS" | grep $COMMON_SNAPSHOT)
if [[ -n "$COMMON_SNAPSHOT" ]]; then
echo "Found common snapshot: $COMMON_SNAPSHOT"
else
echo "No common snapshot found—will perform a full send."
fi
##############################################################################
# 4. Identify the most recent snapshot on the remote source
#
# This works because we zfs list'ed the snapshots originally in order
# so we can just take the first line with 'head -n 1'
##############################################################################
LATEST_REMOTE_SNAPSHOT=$(echo "$REMOTE_SNAPSHOTS" | head -n 1)
if [[ -z "$LATEST_REMOTE_SNAPSHOT" ]]; then
echo "No snapshots found on the remote source. Check if zfs-auto-snapshot is enabled there."
exit 1
fi
##############################################################################
# 5. Perform replication
##############################################################################
echo "Starting pull-based replication from ${SOURCE_HOST}:${SOURCE_DATASET} to local ${DEST_DATASET}..."
if [[ -n "$COMMON_SNAPSHOT" ]]; then
echo "Performing incremental replication from @$COMMON_SNAPSHOT up to @$LATEST_REMOTE_SNAPSHOT."
ssh "${SOURCE_HOST}" zfs send -I "${SOURCE_DATASET}@${COMMON_SNAPSHOT}" "${SOURCE_DATASET}@${LATEST_REMOTE_SNAPSHOT}" \
| zfs recv -F "${DEST_DATASET}"
else
echo "Performing full replication of @$LATEST_REMOTE_SNAPSHOT."
ssh "${SOURCE_HOST}" zfs send "${SOURCE_DATASET}@${LATEST_REMOTE_SNAPSHOT}" \
| zfs recv -F "${DEST_DATASET}"
fi
echo "Replication completed successfully!"