Fedora ZFS

Further to my recent post, the good news is that the ZFS team have released a version of ZFS which will work on the latest Fedora:

The 0.6.5.9 version supports kernels up to 4.10 (which is handy, as that’s the next kernel release I can upgrade my Fedora 25 to, so there’s some future-proofing going on there at last). And I can confirm that, indeed, ZFS installs correctly now.

I’m still a bit dubious about ZFS on a fast-moving distro such as Fedora, though, because just a single kernel update could potentially render your data inaccessible until the good folk at the ZFS development team decide to release new kernel drivers to match. But the situation is at least better than it was.

With a move to the UK pending, I am disinclined to suddenly start wiping terabytes of hard disk and dabbling with a new file system… but give it a few weeks and who knows?!

Fed Up, Zed Up?

Spot the problem shown in these two screen grabs:

You will immediately notice from the second screenshot that version 0.6.5.8 of whatever it is which happens to have been screenshotted only supports up to 4.8 kernels, whereas the first screenshot shows that a Fedora 25 installation is using kernel version 4.9. Clearly that Fedora installation won’t be able to run whatever is being referred to in the second screenshot.

So what is it that I’ve taken that second screen shot of? This:

Ooops. It happens to be a screenshot of the current stable release number of the world’s greatest file system for Linux.

Put together, and in plain English, the combination of the two version numbers means: I can’t install ZFS on Fedora.

Or rather, I could have done so when Fedora 25 was freshly installed, straight off the DVD (because it ships with a 4.8 kernel, so the 0.6.5.8 version of ZFS would have worked just fine on that). ZFS on 4.8-kernel-using-Fedora 25 works fine, therefore.

But if I had, say, copied 4.8TB of data onto a freshly created zpool and then updated Fedora, I would now not be able to access my 4.8TB of data at all (because the relevant ZFS kernel modules won’t be able to load into the newly-installed 4.9 kernel). Which sort of makes the ZFS file system a bit less than useful, no?!

Of course, once they release version 0.7 version of ZFS (which is currently at release candidate 2 state), then we’re back in business -because ZFS 0.7 supports 4.9 kernels. Unless Fedora go and update themselves to using kernel 4.10, of course… in which case it’s presumably back to being inaccessible once more. And so, in cat-and-mouse fashion, ad infinitum…

But here’s the thing: Fedora is, by design, bleeding edge, cutting edge… you name your edge, Fedora is supposed to be on it! So it is likely to be getting new kernel releases every alternate Thursday afternoon, probably. What chance the ZFS developers will match that release cadence, do you think… given that their last stable release is now 4 months old?

About zilch I’d say. Which gives rise to a certain ‘impedance mismatch’, no? Try running ZFS on Fedora, it seems to me, and you’ll be consigning yourself to quite regularly not being able to access your data at all for weeks or months on end, several times a year. (Point releases of the 4.x kernel have been coming every two or three months since 4.0 was unleashed in April 2015, after all).

It strikes me that ZFS and Fedora are, in consequence, not likely to be good bed-fellows, which is a shame.

Perhaps it is time to investigate the data preservative characteristics of Btrfs at last?!

Incidentally, try installing ZFS on a 4.9-kernel-using-Fedora 25 whilst the 0.6.5.8 version of ZFS is the latest-and-greatest on offer and the error you’ll get is this:

The keywords to look for are ‘Bad return status’ and ‘spl-dkms scriptlet failed’. Both mean that the spl-dkms package didn’t get installed, and the net effect of that is the ZFS kernel modules don’t get loaded. In turn, this means trying to issue any ZFS-related commands will fail:

Of course, you will think that you should then do as the error message tells you: run ‘/sbin/modprobe zfs’ manually. It’s only when you try to do so you see the more fundamental problem:

And there’s no coming back from that. 🙁

No practical ZFS for a distro? That’s a bit of a deal-breaker for me these days.

Expanding NAS

zfs01For various historical reasons (basically ToH being tight with the purse-strings!), my two HP servers got populated with different disks at different times. One got a set of four 3TB drives; the other got luckier and got a set of four 4TB drives.

4 x 3 in RAID0 gives you about 12TB of (vulnerable!) storage; 4 x4 in RAID5 gives you about 12TB of (relatively safe!) storage. So the two were comparable, and I set them up in these configurations. The server that does media streaming around the house used RAID0, the server that was the ‘source of truth’ for everything used RAID5, and the RAID5 box regularly replicated itself to the RAID0 box, so the fact that the RAID0 box was vulnerable to a single disk failure wasn’t a concern for overall household data integrity.

But it was a bit cludgy, and I’d have preferred to use two equally-sized RAID5 arrays, if only ToH didn’t mind me splashing out (or “wasting” as it was called) about AUD$1000 on a set of four, fresh 4TB disks.

Colour me surprised, therefore, when I discovered Amazon was offering my preferred Western Digital 4TB Red drives at near-enough half-price! Never being one to look a bargain in the mouth, ToH immediately approved the purchase… and thus put me in a bit of a tricky situation, because things have moved on since the days I ran RAID0 and RAID5 arrays. I upgraded the two servers at Christmas, for example; and switched to using ZFS on Solaris x86 -both servers running raidz (the near-equivalent of RAID5), but of different array sizes because of the different disk sizes.

So my problem now is: how does one migrate a 3TB x 4-disk raidz zpool to being a 4TB x 4-disk zpool, assuming one doesn’t just want to destroy the original zpool, re-create it with the new disks and then re-copy all the data onto the new zpool? Well, here’s how I did it…

As root, check whether the existing zpool is auto-expandable and auto-replaceable, and make it so if not:

zpool get all safedata | grep auto

safedata  autoexpand     off                  local
safedata  autoreplace    off                  default

“Autoexpand” means that if I stick a bunch of 4TB disks in as replacements for a bunch of 3TB disks, the array will automatically see and make use of the extra space. “Autoreplace” means that if I swap out a single old disk for a single new one, will the array automatically begin the process of ‘re-silvering’ the new disk -that is, writing the data which used to be on the old disk back to the new without being manually instructed to do so. Both of these seem to be good ideas, but as the results above show, both features are currently off for my zpool, ‘safedata’. So let me start by fixing that:

zpool set autoexpand=on safedata
zpool set autoreplace=on safedata

I should mention at this point that I nevertheless trigger my replacements manually in what follows, though I do end up relying on the auto-expansion capability, as you’ll discover by the end of the piece.

Next, list the disks which are part of the pool:

zpool status safedata
  pool: safedata
 state: ONLINE
  scan: scrub repaired 0 in 7h21m with 0 errors on Fri May  6 18:33:01 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors

So we know the device names from that listing and can see that all devices are online. All you need to do now is to offline one of the disks. I’ll start with the c3t0d0 drive (which I guess to be the one sitting in the left-most drive bay of my server… there’s no way to actually tell for certain with my hardware, unfortunately!):

zpool offline safedata c3t0d0

To confirm what has happened, I re-check the zpool status:

[email protected]:~# zpool status safedata
  pool: safedata
 state: DEGRADED
status: One or more devices has been taken offline by the administrator.
        Sufficient replicas exist for the pool to continue functioning in a
        degraded state.
action: Online the device using 'zpool online' or replace the device with
        'zpool replace'.
  scan: scrub repaired 0 in 7h21m with 0 errors on Fri May  6 18:33:01 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    DEGRADED     0     0     0
          raidz1-0  DEGRADED     0     0     0
            c3t0d0  OFFLINE      0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors

As expected, the whole pool is now running in DEGRADED status (just like RAID5 would). This isn’t fatal to the zpool as a whole, but isn’t a good thing, either: if I had a disk failure now, I’d lose all my data!

Another unfortunate thing about my HP microservers: the drives aren’t hot-swappable. So though I think I’ve offlined the left-most drive in my array, I can’t just pop that out and see. At this point, therefore, I have to shutdown my server completely and when it’s down, remove drive 1 and reboot. I check my zpool status again and if it still lists it as being ‘degraded’ as before, then we know for certain that c3t0d0 really was the left-most drive in the server. (If it wasn’t, we’d have pulled a good drive out of a degraded array, and at that point the zpool as a whole would be listed as OFFLINE. You could then replace the drive you pulled, reboot once more and re-check the zpool status …and keep going through the drives until you finally pull the correct one).

Assuming you did pull the correct drive, shut the server down again and you can now physically install the new 4TB drive in the correctly-vacated drive slot. Power the server up and issue these commands:

zpool online safedata c3t0d0
zpool replace safedata c3t0d0

Now check the zpool status once more:

[email protected]:~# zpool status safedata       
  pool: safedata
 state: DEGRADED
status: One or more devices is currently being resilvered.  The pool 
        will continue to function in a degraded state.
action: Wait for the resilver to complete.
        Run 'zpool status -v' to see device specific details.
  scan: resilver in progress since Fri May 20 10:31:52 2016
    660G scanned out of 6.88T at 3.58G/s, 29m43s to go
    0 resilvered
config:

        NAME              STATE     READ WRITE CKSUM
        safedata          DEGRADED     0     0     0
          raidz1-0        DEGRADED     0     0     0
            replacing-0   DEGRADED     0     0     0
              c3t0d0/old  UNAVAIL      0     0     0
              c3t0d0      DEGRADED     0     0     0  (resilvering)
            c3t1d0        ONLINE       0     0     0
            c3t2d0        ONLINE       0     0     0
            c3t3d0        ONLINE       0     0     0

errors: No known data errors
You have new mail in /var/mail/root

Notice how the new drive has been detected and is being resilvered. You must wait for this resilvering process to complete. You pull another disk out at this point and your zpool is toast. Happily, it gives you an estimate of when it will be finished. Note that there are two parts to the process: scanning the data on the surviving disks -which is what is being mentioned above- and then actually writing the data back to the new disk, which takes a lot longer to complete. In any case, you aren’t left entirely guessing when its work might be done.

So: just keep doing ‘zpool status’ commands to see where it’s got to:

  scan: resilver in progress since Fri May 20 10:31:52 2016
    6.88T scanned
    1.61T resilvered at 364M/s, 93.28% done, 22m10s to go
config:

        NAME              STATE     READ WRITE CKSUM
        safedata          DEGRADED     0     0     0
          raidz1-0        DEGRADED     0     0     0
            replacing-0   DEGRADED     0     0     0
              c3t0d0/old  UNAVAIL      0     0     0
              c3t0d0      DEGRADED     0     0     0  (resilvering)
            c3t1d0        ONLINE       0     0     0
            c3t2d0        ONLINE       0     0     0
            c3t3d0        ONLINE       0     0     0

errors: No known data errors
[email protected]:~# date
Friday, May 20, 2016 04:00:35 PM AEST

When the re-silver has finished, you will see this sort of thing:

[email protected]:~# zpool status safedata
  pool: safedata
 state: ONLINE
  scan: resilvered 1.72T in 5h53m with 0 errors on Fri May 20 16:25:44 2016

config:

        NAME        STATE     READ WRITE CKSUM
        safedata    ONLINE       0     0     0
          raidz1-0  ONLINE       0     0     0
            c3t0d0  ONLINE       0     0     0
            c3t1d0  ONLINE       0     0     0
            c3t2d0  ONLINE       0     0     0
            c3t3d0  ONLINE       0     0     0

errors: No known data errors
[email protected]:~# date
Friday, May 20, 2016 04:30:03 PM AEST

Note how the pool is no longer degraded; neither is the c3t0d0 disk… and the old one is no longer even listed. Job done… for that disk, at least.

Once drive #1 has been re-silvered successfully, you can repeat the process for each of the other disks, one at a time, in turn. It will take a while, but eventually, if you do it carefully, you’ll have a newly-sized raidz zpool without having had to destroy and re-create the original.

In my case, each disk took around 5 hours to re-silver. So, for all four disks, it basically took me about a day.

At the end of it, the file system declares itself satisfied:

[email protected]:~$ zpool list
NAME       SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
rpool      111G  24.9G  86.1G  22%  1.00x  ONLINE  -
safedata  14.5T  7.17T  7.33T  49%  1.00x  ONLINE  -

ZFS’s ability to report on physical disk or array size are never entirely transparent to me, but if it’s claiming I’ve got 14.5TB, I assume that’s what 4 x 4TB looks like in “real money” (i.e., accounting for disk manufacturer’s inability to count in base 2, and for similar assorted space overheads). Compare it with the zpool list I got before I started all this, anyway:

[email protected]:~# zpool list
 NAME       SIZE  ALLOC   FREE  CAP  DEDUP  HEALTH  ALTROOT
 rpool      111G  11.3G  99.7G  10%  1.00x  ONLINE  -
 safedata  10.9T  7.32T  3.55T  67%  1.00x  ONLINE  -

You see how safedata used to report just 10.9TB; now it’s 14.5TB -an increase of about 3.6TB, or roughly the 1 extra TB per disk for four disks that I would hope to have seen! The figures might be a little bit ‘odd’, therefore, in their specific values; but they do generally indicate the zpool has automatically grown to accommodate the new disks supplied.

Which is all good and makes me happy.

Stop the Rot

I was reading this article towards the end of the week: interesting, I thought. Then I got home to discover that a batch convert-FLAC-to-MP3 job that had been running all week had finally finshed… and was reporting that 16 files couldn’t be converted because they contained internal errors. Knowing that the files were fine about three months ago, suddenly the idea of bitrot didn’t seem of quite such academic interest. I suspect a couple of power outages just before Christmas might have been to blame: one was planned and we were able to shut things down cleanly ahead of time, but the second came out of the blue and meant servers in full flight were suddenly crashed.

Fortunately, I have two servers that sync with each other, and I was able to fetch good copies of the affected files from the other server. That in itself is interesting: had the file’s date stamps been altered, the sync job would have copied the bad files over on top of the spare server’s good copies. That the bad versions hadn’t been copied over to the second server was itself proof that the problem had occured on the first server’s file system, silently enough to not arouse the attention of the synchronisation software (thankfully!)

These servers have both been running Windows 2012 and a software RAID5 with NTFS for quite a while now -and this isn’t the first time I’ve noticed FLAC files which I know to have been good at some point become bad by some other point. The files are still playable, generally -but not convertible, because internal corruption confuses the converter.

Since that’s not acceptable, it’s time to do something about it.

This weekend, Windows 2012 bit the dust and was replaced with FreeNAS 9.2. It’s incredibly easy to install and set up, and within 20 minutes, I’d installed the OS, created a RAIDZ ZFS volume, shared it as a CIFS (Samba) share and started copying known good copies of everything onto it. By Sunday morning, the entire 5TB of data was across, and I had a shiny new NAS with the ZFS file system protecting me from further, future silent corruptions.

I’d used FreeNAS before, when I first got the servers, but I’d ditched it for the more familiar realms of CentOS and mdadm (and subsequently for Windows 2012 by way of ‘testing it out’). I’m not sure what version of FreeNAS that was, but since it was mid-2012, it was probably version 8.something -and I had “issues” with it (stuttering streaming audio and video files, for example, indicating I/O woes). It put me off at the time, but -although it’s early days- I have no such qualms with this latest version. The web interface is self-explanatory, slick and fast; there have been no audio or video performance issues; the system status reports are nicely done and give me assurance that my modest server is coping well with what’s being asked of it:

This afternoon, it’s time for the other server to be similarly wiped and for the file copying to proceed in the reverse direction. Sometime in the middle of the week, I hope to work out how to get the two servers to sync with each other.

One unintended side-effect has been the loss of Windows-based Internet Connection Sharing (the now-freeNAS’d server was doing duty as the ICS server, too). Where I live, there’s no wired broadband, so we have to use our mobile phones as Internet wifi hotspots …and the Windows server has been using an ancient Realtek-based wireless USB stick to connect to that before sharing it out via the wired network. Worked quite well, all things considered… but without Windows, it was always going to be a struggle getting the same USB stick working on either the same server in its new BSD guise, or on my Linux desktop.

I spent a few hours fiddling with it and at one point, Linux did happily connect to the Internet with it… but it would randomly disconnect after just a minute or two, so that was not something I was going to be able to rely on.

In the end, I just cut out the middleman: my phone now connects directly to my desktop via USB tethering and gives my laptop all the Internet it wants. A quick iptables -t nat -A POSTROUTING -o usb0 -j MASQUERADE then meant my desktop was doing Internet Connection Sharing duty to the rest of the house without further fuss.

New Year NAS

It was six months ago that I got a couple of HP microservers, slapped Windows 2008 R2 on them, software-raid5′d 3 2TB disks in each and thus acquired, in total, about 8TB of usable, protected storage. It now being the New Year, it was time I had a bit of a re-think about those servers.

First, I wanted to bump up the storage. The servers come with only 4 disk bays (each), and one of them was occupied by the Windows O/S disk -hence there were only 3 bays for the ‘safe storage’ drives. Happily, I find I own a couple of 60GB solid state hard disks that are sitting on the shelf doing nothing -and it’s trivially easy to plonk them is as the boot disk, freeing up all four drive bays for safe storage duty. Even more happily, I had a couple of spare 2TB drives sitting around, so those newly freed-up drive bays could be put to good use.

So, each server now has a 60GB O/S drive, plus 8TB of storage (though if you RAID5 4 2TB drives, you end up with around 6TB usable storage). Per server.

So then it came down to a choice of OS and raiding technology. There were three basic options:

  • Stick with Windows 2008 (or maybe upgrade to Windows 2012)
  • Switch to FreeNAS and start using ZFS
  • Switch to CentOS and use an mdadm software raid, plus a traditional file system (like ext3 or ext4)

In the end, I decided to dump the Windows options, simply because it works, is dull and isn’t very educational. I don’t have moral objections to Microsoft these days, and Win2008R2 has done sterling service for the past six months… but I just can’t get excited about NTFS and dynamic disks anymore!

Much more fun was to back everything up very carefully (only I wasn’t quite as careful as I should have been… see below!) and then wipe one of the servers with a FreeNAS install. If you haven’t met FreeNAS before, I can certainly recommend it. Your server ends up running a console-only BSD O/S which you access and manage via a nicely polished web interface from a remote PC of some kind. Setting up new volumes as Raid-Z (ZFS’s new take on the basic RAID5 principles) was a matter of mere minutes, even as a complete beginner; and setting up Samba shares to expose the newly-protected storage was trivial. It’s slick and I was happy… for about 3 days.

The samba shares are there because ToH still insists on using Windows and the home entertainment consists of a Windows 8 PC sitting under the telly running Windows Media Centre. It was as I was trying to play music via these shares that I encountered a couple of rare ‘stutters’, where the music would pause until “whatever it was” cleared and sorted itself out. That had never happened before. I did some research and discovered various tweaks you can apply to a vigin FreeNAS install to make samba shares perform better -and, as far as I can remember, they worked well enough, but I would occasionally still encounter a stutter or two.

I don’t think I’m alone in having found that the N40L is a great little server that lacks a bit of puff, though -and once I read that others have had “issues” with it and FreeNAS, I’m afraid my mind got a little bit closed on the matter. The N40L is a lovely little server, but it’s not a performance giant …and asking it to cope with ZFS raid, even with 8GB RAM fitted, was probably a bit on the ambitious side. I’ll confess, too, that whilst FreeNAS seems a great way to do things, it’s new (to me) and thus has a learning curve: this being the Christmas Hols, and me having more inclination to Deck the Halls than battle with ZFS, I fear the pull of the tried-and-true began to outweigh the excitement of doing something sexily unfamiliar.

Hence, I ended up wiping the Windows 2008 off my second server and installing ye olde CentOS 6.3. A quick refresher on mdadm (the software RAID application) and I had 6TB of protected storage up-and-running using nothing more exciting than ext3 pretty quickly. I then wandered about, lost and forlorn, for a couple of hours as I struggled to remember how to set up Samba shares manually. I was failing miserably until I remembered that although I had edited my SElinux configuration to be disabled, I hadn’t actually rebooted the server afterwards, so the ‘enforced’ setting was still effectively in place. That which had remained unshareable for hours, after one quick reboot, became trivially easy to remotely access once more. (PS: I hate SELinux!!)

So, I now have one server running sexy FreeNAS -perfectly happily, as far as I can tell, though I am suspicious of its performance and suspect it might go horribly wrong at any moment. And then I have the other server running traditional CentOS -also perfectly happily, it would seem, with me basking in the glow of the comforting and familiar and sure that if it ever goes horribly wrong, it will only be because of something I typed.

The eventual plan, of course, is to have both servers configured identically and replicating amongst themselves with a scheduled rsync. So, ultimately, one of them will have to “win” and the other will become a mere clone of it.

Having to decide a winner is therefore a bit of a tricky one. In the red corner, there’s the fact that we watched The Dark Knight Rises on New Year’s Day streamed from the FreeNAS box, in high-def, with not one glitch, unscheduled pause or hiccough (though I did fall asleep after 50 minutes, so it’s possible the rest of the boredom actually streamed really badly, just without me noticing). Over in the blue corner, however, is the fact that a 1.5GHz AMD Turion is probably not up to running ZFS effectively, even with 8GB RAM to play with… whereas boring ext3 and a bit of mdadm probably suits that quite well. Difficult call to make, I think, without me spending a lot of time doing benchmarking I don’t actually have time to do…

There is one real nasty in all of this -and it’s an oldie and a goodie! Foreign characters in file names have been an issue once more. Take this bit of gibberish from an ssh session connected directly to the FreeNAS server:

[[email protected]] /mnt/SafeData/Music/classical/Richard Wagner# ls
./                                Parsifal/
../                               Parsifal (Levine)/
Das Rheingold/                    Rienzi/
Der fliegende Holl??nder/         Siegfried/
Die Meistersinger von N??rnberg/  Siegfried-Idyll/
Die Walk??re/                     Tannh??user/
G??tterd??mmerung/                Tristan and Isolde/
Lohengrin (Keilberth)/            Tristan und Isolde (Bernstein)/
Orchestral Music from the Operas/ Wesendonk-Lieder/

Anyone wanting to listen to a bit of Walküre or Götterdämmerung is going to be a bit disappointed, I think. However, maybe not, judging by this screenshot of the very same directory of the very same server taken on my Fedora-running PC:

Even Windows clients have no problem with that directory listing:

Similarly, when I copy exactly the same files from exactly the same external USB (formatted with NTFS by the original Windows 2008 server) onto the CentOS server, I have absolutely no problem with foreign characters at all… so somewhere during the FreeNAS construction process, I suspect there was a file system mounted without UTF8 being specified as one of the mount options. Since I know -from long and bitter experience- to do this on my Linux boxes without even thinking about it, I imagine this is one unfortunate result of my inexperience with all things FreeNAS and BSD! It is, therefore, another black mark against FreeNAS, despite it almost certainly being a result of a PEBKAC.

Anyway, I haven’t quite decided yet, but I suspect that CentOS will win out in this race, simply because I have other things to worry about at the moment.

Such as what operating system to run on my main PC this year? Yup: that’s still up for grabs, despite all the experimental, exploratory installing I did a few months ago. I spent some weeks with Windows 8, which wasn’t as horrible as I expected, but ultimately hit the boredom buffers, as often happens with me and Windows. Currently, as one of the above screenshot shows, I am running Fedora 17 with the XFCE windows manager (Gnome 3 being utterly beyond the pale as far as I am concerned). It’s OK, but I am encountering a hell of a lot of bugs, crashes and other assorted weirdnesses). So I don’t think it’s for me long-term.

Funnily enough, I reckon I know what I’ll be installing on my new 32GB RAM PC (when it finally arrives): CentOS 6.3. It is stable, cuddly (in a Gnome 2ish sort-of way) and functional, yet still requires copious quantities of command line affection from time to time. Sounds like all I could really ask for in a desktop OS! Roll on PC delivery!!

Until then, or the next time, Happy New Year to all my readers.