Fedora ZFS

Further to my recent post, the good news is that the ZFS team have released a version of ZFS which will work on the latest Fedora:

The 0.6.5.9 version supports kernels up to 4.10 (which is handy, as that’s the next kernel release I can upgrade my Fedora 25 to, so there’s some future-proofing going on there at last). And I can confirm that, indeed, ZFS installs correctly now.

I’m still a bit dubious about ZFS on a fast-moving distro such as Fedora, though, because just a single kernel update could potentially render your data inaccessible until the good folk at the ZFS development team decide to release new kernel drivers to match. But the situation is at least better than it was.

With a move to the UK pending, I am disinclined to suddenly start wiping terabytes of hard disk and dabbling with a new file system… but give it a few weeks and who knows?!

Fed Up, Zed Up?

Spot the problem shown in these two screen grabs:

You will immediately notice from the second screenshot that version 0.6.5.8 of whatever it is which happens to have been screenshotted only supports up to 4.8 kernels, whereas the first screenshot shows that a Fedora 25 installation is using kernel version 4.9. Clearly that Fedora installation won’t be able to run whatever is being referred to in the second screenshot.

So what is it that I’ve taken that second screen shot of? This:

Ooops. It happens to be a screenshot of the current stable release number of the world’s greatest file system for Linux.

Put together, and in plain English, the combination of the two version numbers means: I can’t install ZFS on Fedora.

Or rather, I could have done so when Fedora 25 was freshly installed, straight off the DVD (because it ships with a 4.8 kernel, so the 0.6.5.8 version of ZFS would have worked just fine on that). ZFS on 4.8-kernel-using-Fedora 25 works fine, therefore.

But if I had, say, copied 4.8TB of data onto a freshly created zpool and then updated Fedora, I would now not be able to access my 4.8TB of data at all (because the relevant ZFS kernel modules won’t be able to load into the newly-installed 4.9 kernel). Which sort of makes the ZFS file system a bit less than useful, no?!

Of course, once they release version 0.7 version of ZFS (which is currently at release candidate 2 state), then we’re back in business -because ZFS 0.7 supports 4.9 kernels. Unless Fedora go and update themselves to using kernel 4.10, of course… in which case it’s presumably back to being inaccessible once more. And so, in cat-and-mouse fashion, ad infinitum…

But here’s the thing: Fedora is, by design, bleeding edge, cutting edge… you name your edge, Fedora is supposed to be on it! So it is likely to be getting new kernel releases every alternate Thursday afternoon, probably. What chance the ZFS developers will match that release cadence, do you think… given that their last stable release is now 4 months old?

About zilch I’d say. Which gives rise to a certain ‘impedance mismatch’, no? Try running ZFS on Fedora, it seems to me, and you’ll be consigning yourself to quite regularly not being able to access your data at all for weeks or months on end, several times a year. (Point releases of the 4.x kernel have been coming every two or three months since 4.0 was unleashed in April 2015, after all).

It strikes me that ZFS and Fedora are, in consequence, not likely to be good bed-fellows, which is a shame.

Perhaps it is time to investigate the data preservative characteristics of Btrfs at last?!

Incidentally, try installing ZFS on a 4.9-kernel-using-Fedora 25 whilst the 0.6.5.8 version of ZFS is the latest-and-greatest on offer and the error you’ll get is this:

The keywords to look for are ‘Bad return status’ and ‘spl-dkms scriptlet failed’. Both mean that the spl-dkms package didn’t get installed, and the net effect of that is the ZFS kernel modules don’t get loaded. In turn, this means trying to issue any ZFS-related commands will fail:

Of course, you will think that you should then do as the error message tells you: run ‘/sbin/modprobe zfs’ manually. It’s only when you try to do so you see the more fundamental problem:

And there’s no coming back from that. 🙁

No practical ZFS for a distro? That’s a bit of a deal-breaker for me these days.

Expanding NAS

zfs01For various historical reasons (basically ToH being tight with the purse-strings!), my two HP servers got populated with different disks at different times. One got a set of four 3TB drives; the other got luckier and got a set of four 4TB drives.

4 x 3 in RAID0 gives you about 12TB of (vulnerable!) storage; 4 x4 in RAID5 gives you about 12TB of (relatively safe!) storage. So the two were comparable, and I set them up in these configurations. The server that does media streaming around the house used RAID0, the server that was the ‘source of truth’ for everything used RAID5, and the RAID5 box regularly replicated itself to the RAID0 box, so the fact that the RAID0 box was vulnerable to a single disk failure wasn’t a concern for overall household data integrity.

But it was a bit cludgy, and I’d have preferred to use two equally-sized RAID5 arrays, if only ToH didn’t mind me splashing out (or “wasting” as it was called) about AUD$1000 on a set of four, fresh 4TB disks.

Colour me surprised, therefore, when I discovered Amazon was offering my preferred Western Digital 4TB Red drives at near-enough half-price! Never being one to look a bargain in the mouth, ToH immediately approved the purchase… and thus put me in a bit of a tricky situation, because things have moved on since the days I ran RAID0 and RAID5 arrays. I upgraded the two servers at Christmas, for example; and switched to using ZFS on Solaris x86 -both servers running raidz (the near-equivalent of RAID5), but of different array sizes because of the different disk sizes.

So my problem now is: how does one migrate a 3TB x 4-disk raidz zpool to being a 4TB x 4-disk zpool, assuming one doesn’t just want to destroy the original zpool, re-create it with the new disks and then re-copy all the data onto the new zpool? Well, here’s how I did it…

As root, check whether the existing zpool is auto-expandable and auto-replaceable, and make it so if not:

zpool get all safedata | grep auto

safedata  autoexpand     off                  local
safedata  autoreplace    off                  default

“Autoexpand” means that if I stick a bunch of 4TB disks in as replacements for a bunch of 3TB disks, the array will automatically see and make use of the extra space. “Autoreplace” means that if I swap out a single old disk for a single new one, will the array automatically begin the process of ‘re-silvering’ the new disk -that is, writing the data which used to be on the old disk back to the new without being manually instructed to do so. Both of these seem to be good ideas, but as the results above show, both features are currently off for my zpool, ‘safedata’. So let me start by fixing that:

zpool set autoexpand=on safedata
zpool set autoreplace=on safedata

I should mention at this point that I nevertheless trigger my replacements manually in what follows, though I do end up relying on the auto-expansion capability, as you’ll discover by the end of the piece.

Next, list the disks which are part of the pool:

zpool status safedata
  pool: safedata
 state: ONLINE
  scan: scrub repaired 0 in 7h21m with 0 errors on Fri May  6 18:33:01 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors

So we know the device names from that listing and can see that all devices are online. All you need to do now is to offline one of the disks. I’ll start with the c3t0d0 drive (which I guess to be the one sitting in the left-most drive bay of my server… there’s no way to actually tell for certain with my hardware, unfortunately!):

zpool offline safedata c3t0d0

To confirm what has happened, I re-check the zpool status:

[email protected]:~# zpool status safedata
  pool: safedata
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 7h21m with 0 errors on Fri May  6 18:33:01 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            c3t0d0  OFFLINE      0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors

As expected, the whole pool is now running in DEGRADED status (just like RAID5 would). This isn’t fatal to the zpool as a whole, but isn’t a good thing, either: if I had a disk failure now, I’d lose all my data!

Another unfortunate thing about my HP microservers: the drives aren’t hot-swappable. So though I think I’ve offlined the left-most drive in my array, I can’t just pop that out and see. At this point, therefore, I have to shutdown my server completely and when it’s down, remove drive 1 and reboot. I check my zpool status again and if it still lists it as being ‘degraded’ as before, then we know for certain that c3t0d0 really was the left-most drive in the server. (If it wasn’t, we’d have pulled a good drive out of a degraded array, and at that point the zpool as a whole would be listed as OFFLINE. You could then replace the drive you pulled, reboot once more and re-check the zpool status …and keep going through the drives until you finally pull the correct one).

Assuming you did pull the correct drive, shut the server down again and you can now physically install the new 4TB drive in the correctly-vacated drive slot. Power the server up and issue these commands:

zpool online safedata c3t0d0
zpool replace safedata c3t0d0

Now check the zpool status once more:

[email protected]:~# zpool status safedata       
  pool: safedata
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool 
        will continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri May 20 10:31:52 2016
    660G scanned out of 6.88T at 3.58G/s, 29m43s to go
    0 resilvered
config:

        NAME              STATE     READ WRITE CKSUM
        safedata          DEGRADED     0     0     0
          raidz1-0        DEGRADED     0     0     0
            replacing-0   DEGRADED     0     0     0
              c3t0d0/old  UNAVAIL      0     0     0
              c3t0d0      DEGRADED     0     0     0  (resilvering)
            c3t1d0        ONLINE       0     0     0
            c3t2d0        ONLINE       0     0     0
            c3t3d0        ONLINE       0     0     0

errors: No known data errors
You have new mail in /var/mail/root

Notice how the new drive has been detected and is being resilvered. You must wait for this resilvering process to complete. You pull another disk out at this point and your zpool is toast. Happily, it gives you an estimate of when it will be finished. Note that there are two parts to the process: scanning the data on the surviving disks -which is what is being mentioned above- and then actually writing the data back to the new disk, which takes a lot longer to complete. In any case, you aren’t left entirely guessing when its work might be done.

So: just keep doing ‘zpool status’ commands to see where it’s got to:

  scan: resilver in progress since Fri May 20 10:31:52 2016
    6.88T scanned
    1.61T resilvered at 364M/s, 93.28% done, 22m10s to go
config:

        NAME              STATE     READ WRITE CKSUM
        safedata          DEGRADED     0     0     0
          raidz1-0        DEGRADED     0     0     0
            replacing-0   DEGRADED     0     0     0
              c3t0d0/old  UNAVAIL      0     0     0
              c3t0d0      DEGRADED     0     0     0  (resilvering)
            c3t1d0        ONLINE       0     0     0
            c3t2d0        ONLINE       0     0     0
            c3t3d0        ONLINE       0     0     0

errors: No known data errors
[email protected]:~# date
Friday, May 20, 2016 04:00:35 PM AEST

When the re-silver has finished, you will see this sort of thing:

[email protected]:~# zpool status safedata
  pool: safedata
 state: ONLINE
  scan: resilvered 1.72T in 5h53m with 0 errors on Fri May 20 16:25:44 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors
[email protected]:~# date
Friday, May 20, 2016 04:30:03 PM AEST

Note how the pool is no longer degraded; neither is the c3t0d0 disk… and the old one is no longer even listed. Job done… for that disk, at least.

Once drive #1 has been re-silvered successfully, you can repeat the process for each of the other disks, one at a time, in turn. It will take a while, but eventually, if you do it carefully, you’ll have a newly-sized raidz zpool without having had to destroy and re-create the original.

In my case, each disk took around 5 hours to re-silver. So, for all four disks, it basically took me about a day.

At the end of it, the file system declares itself satisfied:

[email protected]:~$ zpool list
NAME       SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool      111G  24.9G  86.1G  22%  1.00x  ONLINE  -
safedata  14.5T  7.17T  7.33T  49%  1.00x  ONLINE  -

ZFS’s ability to report on physical disk or array size are never entirely transparent to me, but if it’s claiming I’ve got 14.5TB, I assume that’s what 4 x 4TB looks like in “real money” (i.e., accounting for disk manufacturer’s inability to count in base 2, and for similar assorted space overheads). Compare it with the zpool list I got before I started all this, anyway:

[email protected]:~# zpool list
 NAME       SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
 rpool      111G  11.3G  99.7G  10%  1.00x  ONLINE  -
 safedata  10.9T  7.32T  3.55T  67%  1.00x  ONLINE  -

You see how safedata used to report just 10.9TB; now it’s 14.5TB -an increase of about 3.6TB, or roughly the 1 extra TB per disk for four disks that I would hope to have seen! The figures might be a little bit ‘odd’, therefore, in their specific values; but they do generally indicate the zpool has automatically grown to accommodate the new disks supplied.

Which is all good and makes me happy.