VDev Sizing

Best practice from Oracle says a VDev should be less than 9 disks1. So given 24 disks you should have 3 VDevs. However, when using RAIDZ, the math shows they should be as large as possible with multiple parity disks2. I.e. with 24 disks you should have a single, 24 disk VDev.

The reason for the best practice seems to be about the speed of writing and recovering from disk failures.

It is not recommended to create a zpool with a single large vdev, say 20 disks, because write IOPS performance will be that of a single disk, which also means that resilver time will be very long (possibly weeks with future large drives).

With a single VDev, you break up the data to send a chunk to each drive, then wait for them all to finish writing before you send the next. With several VDevs, you can move on to the next while you wait for the others to finish.

Build a 3-Wide RAIDZ1

Create the pool across 24 disks

#
# -O is the pool's root dataset. Lowercase letter -o is for pool properties
# sudo zfs get compression to check. lz4 is now prefered
#
zpool create \
-m /srv srv \
-O compression=lz4 \
raidz sdb sdc sdd sde sdf sdg sdh sdi \
raidz sdj sdk sdl sdm sdn sdo sdp sdq \
raidz sdr sds sdt sdu sdv sdw sdx sdy -f

Copy a lot of random data to it.

#!/bin/bash

no_of_files=1000
counter=0
while [[ $counter -le $no_of_files ]]
 do echo Creating file no $counter
   touch random-file.$counter
   shred -n 1 -s 1G random-file.$counter
   let "counter += 1"
 done

Yank out (literally) one of the physical disks and replace it.

sudo zpool status                                                               [433/433]

  pool: srv                                                                                              
 state: DEGRADED                                                                                         
status: One or more devices could not be used because the label is missing or                            
        invalid.  Sufficient replicas exist for the pool to continue                                     
        functioning in a degraded state.                                                                 
action: Replace the device using 'zpool replace'.                                                        
   see: http://zfsonlinux.org/msg/ZFS-8000-4J                                                            
  scan: none requested                                                                                   
config:                                                                                                  
                                                                                                         
        NAME                     STATE     READ WRITE CKSUM                                              
        srv                      DEGRADED     0     0     0                                              
          raidz1-0               DEGRADED     0     0     0                                              
            sdb                  ONLINE       0     0     0                                              
            6847353731192779603  UNAVAIL      0     0     0  was /dev/sdc1                               
            sdd                  ONLINE       0     0     0                                              
            sde                  ONLINE       0     0     0                                              
            sdf                  ONLINE       0     0     0                                              
            sdg                  ONLINE       0     0     0                                              
            sdh                  ONLINE       0     0     0                                              
            sdi                  ONLINE       0     0     0                                              
          raidz1-1               ONLINE       0     0     0                                              
            sdr                  ONLINE       0     0     0                                              
            sds                  ONLINE       0     0     0                                              
            sdt                  ONLINE       0     0     0                                              
            sdu                  ONLINE       0     0     0                                              
            sdj                  ONLINE       0     0     0                                              
            sdk                  ONLINE       0     0     0                                              
            sdl                  ONLINE       0     0     0                                              
            sdm                  ONLINE       0     0     0  
          raidz1-2               ONLINE       0     0     0  
            sdv                  ONLINE       0     0     0  
            sdw                  ONLINE       0     0     0  
            sdx                  ONLINE       0     0     0  
            sdy                  ONLINE       0     0     0  
            sdn                  ONLINE       0     0     0  
            sdo                  ONLINE       0     0     0  
            sdp                  ONLINE       0     0     0  
            sdq                  ONLINE       0     0     0             
                                                                           
errors: No known data errors

Insert a new disk and replace the missing one.

allen@server:~$ lsblk                                                                                    
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT                                                              
sda      8:0    0 465.8G  0 disk                                                                         
├─sda1   8:1    0 449.9G  0 part /                                                                       
├─sda2   8:2    0     1K  0 part                                                                         
└─sda5   8:5    0  15.9G  0 part [SWAP]                                                                  
sdb      8:16   1 931.5G  0 disk                                                                         
├─sdb1   8:17   1 931.5G  0 part                                                                         
└─sdb9   8:25   1     8M  0 part                                                                         
sdc      8:32   1 931.5G  0 disk                       # <-- new disk showed up here
sdd      8:48   1 931.5G  0 disk                                                                         
├─sdd1   8:49   1 931.5G  0 part                                                                         
└─sdd9   8:57   1     8M  0 part    
...


sudo zpool replace srv 6847353731192779603 /dev/sdc -f

sudo zpool status

  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
        continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Fri Mar 22 15:50:21 2019
    131G scanned out of 13.5T at 941M/s, 4h7m to go
    5.40G resilvered, 0.95% done
config:

        NAME                       STATE     READ WRITE CKSUM
        srv                        DEGRADED     0     0     0
          raidz1-0                 DEGRADED     0     0     0
            sdb                    ONLINE       0     0     0
            replacing-1            OFFLINE      0     0     0
              6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
              sdc                  ONLINE       0     0     0  (resilvering)
            sdd                    ONLINE       0     0     0
            sde                    ONLINE       0     0     0
            sdf                    ONLINE       0     0     0
...

We can see it’s running at 941M/s. Not too bad.

Build a 1-Wide RAIDZ3

sudo zpool destroy srv

zpool create \
-m /srv srv \
-O compression=lz4 \
raidz3 \
 sdb sdc sdd sde sdf sdg sdh sdi \
 sdj sdk sdl sdm sdn sdo sdp sdq \
 sdr sds sdt sdu sdv sdw sdx sdy -f

Copy a lot of random data to it again (as above)

Replace a disk (as above)

allen@server:~$ sudo zpool status
  pool: srv
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool will
 continue to function, possibly in a degraded state.
action: Wait for the resilver to complete.
  scan: resilver in progress since Sun Mar 24 10:07:18 2019
    27.9G scanned out of 9.14T at 362M/s, 7h19m to go
    1.21G resilvered, 0.30% done
config:

 NAME             STATE     READ WRITE CKSUM
 srv              DEGRADED     0     0     0
   raidz3-0       DEGRADED     0     0     0
     sdb          ONLINE       0     0     0
     replacing-1  OFFLINE      0     0     0
       sdd        OFFLINE      0     0     0
       sdc        ONLINE       0     0     0  (resilvering)
     sde          ONLINE       0     0     0
     sdf          ONLINE       0     0     0

So that’s running quite a bit slower. Not exactly 1/3, but closer to it than not.

Surprise Ending

That was all about speed. What about reliability?

Our first resilver was going a lot faster, but it ended badly. Other errors popped up, some on the same VDev as was being resilvered, and so it failed.

sudo zpool status

  pool: srv
 state: DEGRADED
status: One or more devices has experienced an error resulting in data
 corruption.  Applications may be affected.
action: Restore the file in question if possible.  Otherwise restore the
 entire pool from backup.
   see: http://zfsonlinux.org/msg/ZFS-8000-8A
  scan: resilvered 571G in 5h16m with 2946 errors on Fri Mar 22 21:06:48 2019
config:

 NAME                       STATE     READ WRITE CKSUM
 srv                        DEGRADED   208     0 2.67K
   raidz1-0                 DEGRADED   208     0 5.16K
     sdb                    ONLINE       0     0     0
     replacing-1            OFFLINE      0     0     0
       6847353731192779603  OFFLINE      0     0     0  was /dev/sdc1/old
       sdc                  ONLINE       0     0     0
     sdd                    ONLINE     208     0     1
     sde                    ONLINE       0     0     0
     sdf                    ONLINE       0     0     0
     sdg                    ONLINE       0     0     0
     sdh                    ONLINE       0     0     0
     sdi                    ONLINE       0     0     0
   raidz1-1                 ONLINE       0     0     0
     sdr                    ONLINE       0     0     0
     sds                    ONLINE       0     0     0
     sdt                    ONLINE       0     0     0
     sdu                    ONLINE       0     0     0
     sdj                    ONLINE       0     0     1
     sdk                    ONLINE       0     0     1
     sdl                    ONLINE       0     0     0
     sdm                    ONLINE       0     0     0
   raidz1-2                 ONLINE       0     0     0
     sdv                    ONLINE       0     0     0
     sdw                    ONLINE       0     0     0
     sdx                    ONLINE       0     0     0
     sdy                    ONLINE       0     0     0
     sdn                    ONLINE       0     0     0
     sdo                    ONLINE       0     0     0
     sdp                    ONLINE       0     0     0
     sdq                    ONLINE       0     0     0

errors: 2946 data errors, use '-v' for a list

Our second resilver was going very slowly, but did slow and sure when the race? It did, but very very slowly.

allen@server:~$ sudo zpool status
[sudo] password for allen: 
  pool: srv
 state: ONLINE
  scan: resilvered 405G in 6h58m with 0 errors on Sun Mar 24 17:05:50 2019
config:

 NAME        STATE     READ WRITE CKSUM
 srv         ONLINE       0     0     0
   raidz3-0  ONLINE       0     0     0
     sdb     ONLINE       0     0     0
     sdc     ONLINE       0     0     0
     sde     ONLINE       0     0     0
     sdf     ONLINE       0     0     0
     sdg     ONLINE       0     0     0
      ...
      ...

It slowed even further down, as 400G in 7 hours is something like a 16M/s. I didn’t see any checksum errors this time, but that time is abysmal.

Though, to paraphrase Livy, better late than never.


Last modified February 18, 2025: Site restructure (2b4b418)