Moving my server to a hypervisor

My server was, until recently, Ubuntu (ending up with 21.10), which ran my services as docker containers. I ran my server on zfs root & boot, which meant that I could switch between different operating systems relatively easily. Particularly nice was the ability to switch to a previous known running version in case of panic. I used this very occasionally, but it’s worthwhile insurance.

The services running on my system, as docker-compose stacks, included:

  • Mail (mailu)
  • Internal cloud file shares (owncloud & nextcloud oscillating between the two)
  • Home automation (home assistant)
  • NFS
  • Samba
  • FTP server
  • Jitsi
  • Database server
  • 3 WordPress sites
  • Piwigo gallery

I was generally happy with this for a few years. But the main shortcomings were:

  • I had to develop my own backup server solutions
  • The mailu container system was very fragile, requiring extreme orchestral manoeuvres in the dark for no particular reason on rebuilding
  • Owncloud likewise. Things like Samba access to external store required me to build my own docker image.
  • The home assistant stack was very large
  • Let’s Encrypt certificate management involved a fragile pas-de-deux between my pfsense firewall and my ftp server container, whose only job was to provide writable access to my filesystems in a form that pfsense could understand.

I had watched the emergence of hypervisors and drooled over the promised capabilities and ease of use they offered.

So I transitioned to a hypervisor. In fact, I did it several times. I knew I wanted to run TrueNas Scale as my nfs/samba server (because I’d used FreeNas way back when), but I did not know if it was good enough as a hypervisor.

A transition in my case needed everything to be running with minimal downtime. I bought a Fujitsu 32GB RAM 8-core 4th generation Intel i7 for £150 on ebay, and in each of the transitions make installed the target hypervisor on this machine and transferred all the services to it before hosing the main server and transferring them all back. How easy it was to “transfer” depended on the hypervisor. My default mode of operation involved lots of “zfs send | ssh <other machine> zfs recv” operations. I moved the entire set of docker-compose stacks into a virtual machine under all the hypervisors, and brought up each stack in turn. For web services, I kept an nginx proxy pointing at the working version by manually editing its config files.

XCP-NG was my first attempt at a hypervisor. It has local roots (I live in Cambridge, UK), which biased me. It promised the most efficient virtualization. But, it really doesn’t play nicely with a zfs root, and has minimal tools for looking after data it hosts. I really don’t know if it’s promise of efficiency is met, but as we’ll see below, efficiency isn’t everything.

TrueNas Scale running bare-metal was my next attempt. It more or less does what I wanted, except that its support for virtual machines is limited. As long as you want to do what it supports, you are fine. But if you want to start passing hardware through to the virtualized machine, it gets a lot harder.

So I ended up with proxmox. I like the following features it offers:

  • Can create a cluster and move virtual machines between nodes with little downtime
  • A proxmox host can be used as a backup destination, and honours systemctl suspend. The Proxmox GUI also supports wake-on-lan for a node, which is a nice touch. So backing up can be power-efficient, combined with cron jobs to wake and sleep the backup target.
  • It has support for backups (non-incremental) and replication (incremental). It supports snapshots – although it hides zfs backups. ZFS backups are sometimes used under the hood.
  • Allows pass-through of whole disks and PCI devices relatively easily.

The downside is that there is a definite performance hit. It’s not the 1-3 percent quoted for containers, it’s more like 25% for whole disks passed through, based on a non-scientific sticking a finger in the wind. But, given that the virtualized TrueNas can support my nfs and samba clients at 1Gbps, I can live with the loss of potential throughput.

The main server proxmox configuration includes:

  • My main server including
    • 16 Core Xeon server, about 5 years old on Supermicro motherboard
    • 64 GB ECC RAM
    • A 2TB Samsung EVO SSD as boot disk and main VM store
    • 4 x 8TB WD drives, which are passed through as whole block devices to the TrueNas VM
    • 2 x 1TB NVME drives, which are passed through as PCI devices to the TrueNas VM
    • 5 Virtual machines and one container running services
    • TrueNas Scale running as one of those VMs providing Samba and NFS shares, and managing an incremental backup of those shares to the backup server
    • Proxmox providing backups of those services to the local TrueNas Scale instance and one running on the backup server
  • My backup server
    • Fujitsu retired mid-range server
    • 8 Core Intel i7, about 8 years old
    • 32 GB RAM
    • 1TB Samsung EVO as boot disk and VM store
    • 1 x 8 TB WD drive, passed through as a whole block device to TrueNas
    • TrueNas Scale acting as an incremental backup target from the main server TrueNas instance, and as a backup targed from the main server proxmox instance.
    • Proxmox cluster member, replication target from main server proxmox instance.

So, I ended up with a two-cluster proxmox node, one of which is awake only during backup periods, and provides a backup and replication target for the other. I had unwound my overly-Byzantine docker-compose stacks into virtual machines with updates managed by the application and individual backup and disaster recovery. I had a potential migration target in case I needed to hose my main server, and a whole-server disaster recovery capability.

Switching services between backup and main servers is now easy. Switching between TrueNas Scale instances as hosting my data takes some work, as I have to bring everything down, perform an incremental zfs copy to the target. Bring up TrueNas shares on the backup server, adjust the local DNS server to point to the new location of the file server, and adjust the nfs mount paths, which embed the name of the pool providing the storage. <rant> Groan, thank you to TrueNas for actively preventing me from hiding the pool names. Given that one TrueNas server has fast and slow pools (NVMEs and hard disks) and the other does not, any NFS clients need to know this and be updated when moving to the backup server. This is a ridiculous state of affairs, and one TrueNas should have fixed years ago. But there you are. </rant>

Experience with the Linux Desktop

Just a quick and short update on my experience with the Linux Desktop. Generally positive, but I’m used to things crashing more frequently than Windows. This goes against the narrative of “linux is more stable”. Well, the operating system might be, but the apps are not.

I’ve tried a number of different operating systems: Kubuntu (20.04, 21.10, 22.04), Neon, Fedora, Manjaro. They each have their pros and cons. Essentially, I keep at least two installations “fresh” in the sense that I can boot them and they work, so that if I run into trouble with one installation, I can switch to another for a period.

My current favourite is Kubuntu 22.04. With Firefox browser (Opera as backup).

I use zfs to store separate boot/root datasets for each operating system. And I have a boot-menu grub2 configuration installed under UEFI to allow me to choose at boot time. This has its own zfs dataset.

I also have a persistent subset of my home directories that move with me between operating systems. I cannot just carry my entire home directory, because some applications (e.g. firefox) are temperamental about having an old (or newer than expected) configuration file.

I have a lower-powered mini-PC in my recording shed (fanless). I did run this with both Manjaro and Fedora root on zfs. But I switched to LVM to manage the multiple OSs, and it runs much better. I think 8GB is just too little to run zfs in. I’ve given up the snapshoting capabilities of zfs (which I don’t really need on this machine) for speed.

Moving from Windows to Linux

I have been a long-term Windows user. I started developing an application under Windows 1 (a disaster due to memory leaks in the OS), I developed device drivers under Windows 3.1. I have 4 computers at home quite happily running Windows 10, and with very little frustration.

Why then should I want to migrate to Linux? There are multiple reasons. Perhaps the most important is that Microsoft has branded most of my hardware unfit to run its next revision of Windows; at some point Windows 10 will cease to be supported and vendors of the various Apps I depend on will stop supporting it. The second reason (and perhaps an equal first) is that I like to tinker, and Linux has plenty of scope for tinkering. Thirdly, I am familiar with running Linux, which I use on my home server. Finally, I wanted to use ZFS to provide management and protection of my filesystem.

But can I drop Windows entirely? Unfortunately not. A couple of the apps I depend on run only in Windows. These are VideoPsalm (which is used to present in Church) and Microsoft Access (which I use to maintain multiple databases, e.g., for a publication I edit).

I would really like to have moved my existing Windows 10 installation into the virtual domain. I never did find a way to do this. I tried virtualizing the nvme PCI hardware in qemu-kvm. I moved my installation to an SSD (where it still booted) and then virtualized this as a raw block device. The boot up in the virtual machine (VM) gave an error on a fetchingly attractive light blue background (gone are the bad old days of dark blue). Booting the Windows 10 installation image and attempted to repair the installation failed. I could boot it in safe mode, but not in normal mode. Eventually I gave up and bought a one-time install license for Windows 10, and for Microsoft Office, which I was able to install in a fresh VM. There I was able to continue to use VideoPsalm and Microsoft Access (Office).

Here is a list of observations from this work:

  • I selected Zorin 16 Core (based on Ubuntu 20.04) as my OS. It has good reviews. The Core version is free, but the existence of paid versions means the features in Core get professional attention. It’s based on Ubuntu 20.04 LTS, which is also well maintaned.
  • I initially installed Zorin 16 to a SSD using ext4 as the filesystem, then I moved the installation to my nvme using the zfs filesystem (separate boot and root pools). I kind-of followed some of the ideas in the Ubuntu ZSYS project, having separate datasets for the root, and each user. However, I did not use the OS option to install on zfs because I have found zsys to be buggy, and in the past it failed me when I needed it most.
  • Ubuntu 20.04 domain name resolution is a right pain too if you have a local DNS server. I disabled Network Manager and used netplan (rendered by networkd) to define a bridge (needed by the virtual machines) with static IP and DNS settings. If I relied on the values supplied by my DHCP server, the OS would occasionally reset itself so that it resolved only external names, and sometimes didn’t work at all. I never did find out why.
  • Linux apps, in general, do not talk to data on the network directly. I mounted my network resources using nfs in the fstab.
  • Printing proved to be a right royal pain in the posterior. I spent a day messing about with different printer drivers and trying to coerce otherwise functional programs to behave rationally. I ended up insalling Gutenprint drivers for my Canon mx925 printer, and installing a printer using the same connection for each combination of job options (page size, borderless printing, paper type) that I wanted, because applications generally don’t seem to want to remember prior combinations of print options.
  • Sound. Managing sound devices is also a pain in the derriere. Ubuntu 20.04 / Zorin uses Pulse Audio. Some applications work seamlessly (e.g. Zoom). Others support ALSA device selection, or the Pulse Audio default device, such as Audacity and Skype. Evenually I learned to disable a device in Pulse Audio in order to allow Audacity to access it via ALSA, reserving headphones for editing, and my main speakers for everything else. But, Pulse has the annoying habits of randomly changing the “fallback”=default output device, for no reason that I could find. I ended up keeping the Pulse volume control open to manage which device audio should come out of. I also had to edit the pulse config files to specify initial default source and sink, because it appeared to have no memory from boot to boot of device selection.
  • I had to be sensitive to the source of a program: OS-supplied, supplied by a PPA (i.e., an unofficial release for the OS), Snap, Flatpack or AppImage. Snap and Flatpack applications are isolated from much of the machine – for example, limiting the user’s home directory subdirectories that are visible, and making the printer unavailable. Also start-up time for Snap, Flatpack and AppImage applications may be slow. Opera installed from Snap had a 10-second start-up time. This is not acceptable, IMHO.

And comments on specific applications:

  • Thunderbird is my go-to email, calendar and contacts app. Works seamlessly in Linux.
  • Audacity is used for my audio narration. I encountered a number of issues. The latest AppImage (3.1.2) has one feature I use (Noise Reduction) slowed down a factor of seval, and when installed as a Flatpack, it slows down a factor of several more. I had to install an old version (3.0.2) from a PPA, which restored the speed.
  • My image library is managed using Faststone on Windows. I evaluated a number of alternatives on Linux. I wanted to use Shotwell, but found it too unstable – crashing for some drag-and-drop operations. I settled on digiKam, which is way more powerful than I need. However, printing images in digiKam has issues. Using the print tool from the LightTable results in weird cropping aspect ratios, i.e., the first print is OK, but subsequent prints are stretched or squashed. I resorted to printing one-at-a-time using GIMP.
  • Google drive. I was unwilling to pay for a linux replacement. I evaluated some of the free alternatives with no success. So I reduced by dependence on it, and used the gnome ability to see it as a directory under nautilus to drag and drop files into it, rather than accessing files in the Google drive direct from applications.
  • Desktop search. Part of my self-employed work requires me to research in a large corpus of Microsoft Office files. I uses Lookeen (paid) on Windows. After some evaluation, I settled on Recoll under Linux. I did have to work around a system crash that occurred when Recoll indexed my 40GB corpus directly mounted on nfs. I synchronised the files to a local directory (using ownCloud) and Recoll indexed that without issue.
  • Voip client. I was using MicroSIP on Windows. I evaluated the usual suspects on Linux (Linphone, Twinkle). Eventually I was forced to drop the open source choices due to limitations or bugs and go with the free version of Zoiper, which perfectly meets my needs.
  • Brower. My Windows preference is Opera, although there are some websites that don’t like it. Under Linux the limitations of Opera are more evident. I moved to Firefox, which also has support for hardware video acceleration, an added plus.
  • What’s App: there is no native App, but web.whatsapp.com works well enough in a browser.
  • Applications which work seamlessly in both environments:
    • RSS reader: QuiteRSS
    • Video editor: Shotcut
    • Teamviewer
    • Ultimaker Cura
    • FreeCad
    • Zoom
    • Calibre
    • Heidi SQL
    • Inkscape

Towards a cheap and reliable PIR (infrared) motion detector

I thought it would be fun to “play” with the internet of things (IoT) and looked for a suitable project. I assembled a collection of cheap IoT devices into a box, mounted it on my garage wall, and configured software to make it turn on an exterior light when motion of detected.

This is the story of how I did that.

Caveat – this was all done in the sense of a hobby project. It’s not necessarily the best way of achieving the same goal. I’ll share the code at the bottom.

The hardware

I assembled a number of devices together, only two are relevant here, a cheap PIR detector (HW-416B) and a microprocessor ESP8266 NodeMCU. They both can be bought for about £4. I printed a box, wired them together and mounted them high up on the wall of the garage. I have a 20W LED spotlight mounted on the wall and controlled by a Sonoff basic wi-fi relay (costs a few pounds). Finally there is an indoor light (known as the “cat light”, because everybody should have one) controlled by another Sonoff switch, which is used to monitor motion detections.

The PIR sensor provides a digital output, and the NodeMCU just provides access to that digital output. The PIR has controls for sensitivity and hold time, both of which are turned to the minimum value.

Although not essential to the question of detection, the detector box also has a light sensor and a camera.

The software

I had previously experimented with NodeRed, an MQTT server, and Tasmota running on the Sonoff switches.

This time I abandoned NodeRed and switched to Home Assistant (HA), ESPHome and App Daemon. These are all installed in separate Docker containers on my home server (running Docker under Ubuntu). About the only non-standard part of the installation was to put the HA container on both a macvtap network (so it can be discovered by Alexa) and also a network shared with the other two containers.

I built an ESPHome image for the detector and installed it on the NodeMCU using a USB connection. Subsequent changes were done over the air (OTA) using WiFi. Home Assistant discovered the new device using its ESPHome integration.

I wrote an AppDaemon script that did the following:

  • Triggered on changes of state of the motion detector
  • Flashed the internal light for 2s on detected motion
  • Turned on the external light for 30s on detected motion

The light sensor was used to turn on the external light only if the light level was below a certain threshold. The camera was triggered on detected motion.

The thing I noticed (it was hard to miss) is the number of false positive detections of the PIR sensor, even if the sensitivity was turned to its minimum level. I can’t explain why. Sometimes it was stable for hours at a time, and other periods it triggered every 10s or so. I have no idea if this behaviour is electronic or environmental.

I built a tube to “focus” the detector on a patch of gravel on our drive, but that appeared to have little effect on the rate of false triggers.

Clearly this configuration is useless as an actual detector.

So I added another identical detector. I was hoping that false detections would be independent (uncorrelated) but true detections would be correlated. By “correlated” I mean that trigger events happened on both detectors within a certain period of time.

The two-detector configuration fixed the problem of false detections. If I walk up and down the drive, I get a detection. Although both detectors still spontaneously generate false detections, they generally don’t do so that they are close enough together in time to trigger the light.

Future ideas

Perhaps I might build in a microwave radar based proximity detector. I suspect this will be more reliable than PIR. It’s another thing to play with.

The Code

This code comes with no warrantee. It might not work for you. It might cause your house to explode and your cat to have apoplexy. If it does, I’m not to blame.

ESPHome code for motion detector

esphome:
  name: garage_2
  platform: ESP8266
  board: nodemcuv2

wifi:
  ssid: !secret ssid
  password: !secret password
  domain: !secret domain

captive_portal:
logger:
api:
ota:

binary_sensor:
  - platform: gpio
    pin: D1
    device_class: motion
    name: Motion Sensor 2

sensor:
  - platform: uptime
    name: Uptime Sensor
    update_interval: 10s

AppDaemon code

import hassapi as hass
import datetime

class MotionDetector(hass.Hass):

  def initialize(self):

    # Configuration variables
    self.trigInterval = 10    # Interval between m1/m2 triggers to be considered coincident
    self.luxMinPhoto = 10     # minimum light level for a photo
    self.luxMaxLight = 25     # maximum light level to turn on outside light
    self.durationCatFlash = 2 # seconds duration of cat light flash
    self.durationLight = 30   # seconds to turn on outside/garage light
    self.delayPhoto = 1       # seconds from turning on light to taking photo

    # State variables
    self.catTriggered = 0     # Cat light triggered
    self.m1Triggered = 0      # m1 triggered at most trigInterval previous
    self.m2Triggered = 0      # m2 triggered at most trigInterval previous

    # Listen for events
    self.listen_state(self.m1, "binary_sensor.motion_sensor", new='on')
    self.listen_state(self.m2, "binary_sensor.motion_sensor_2", new='on')

  # m1 has been triggered
  def m1(self, entity, attribute, old, new, kwargs):
    self.log(f"m1 {entity} changed from {old} to {new}")

    self.m1Triggered += 1
    self.run_in(self.m1Done, self.trigInterval)       

    # If m2 has been triggered within the last trigInterval
    if self.m2Triggered:
      self.triggered(entity, attribute, old, new, kwargs)

  # m1 trigger interval complete
  def m1Done(self, kwargs):
    self.log(f"m1 Done")
    self.m1Triggered -= 1

  def m2(self, entity, attribute, old, new, kwargs):
    self.log(f"m2 {entity} changed from {old} to {new}")

    self.m2Triggered += 1
    self.run_in(self.m2Done, self.trigInterval)       

    # If m1 has been triggered within the last trigInterval
    if self.m1Triggered:
      self.triggered(entity, attribute, old, new, kwargs)

  def m2Done(self, kwargs):
    self.log(f"m2 Done")
    self.m2Triggered -= 1

  def triggered(self, entity, attribute, old, new, kwargs):
    self.log(f"Triggered {entity} changed from {old} to {new}")
    light_state = self.get_state('switch.garage_light_relay')
    time_now = datetime.datetime.now().time()
    light_level = float(self.get_state('sensor.garage_light_level'))
    
    self.log(f'light level is {light_level}')

    too_early = time_now < datetime.datetime.strptime("06:30", "%H:%M").time()
    too_late = time_now > datetime.datetime.strptime("22:00", "%H:%M").time()
    too_bright = light_level > self.luxMaxLight
    already_on = light_state == 'on'

    self.log(f'time now: {time_now} too_early: {too_early} too_late: {too_late} too_bright: {too_bright} already_on: {already_on}') 

    light_triggered = not too_bright and not too_early and not too_late and not already_on
    if light_triggered:
      # Low light level during waking hours,  trigger garage light
      # don't trigger if already on to avoid turning off a manual turn-on
      self.triggerLight()

    if (light_level > self.luxMinPhoto):
      # enough light for a photo
      self.makePhoto(kwargs)
    else:
      if light_triggered:
        # Can do a photo, but have to wait a bit for it to turn on
        self.log('delayed photo')
        self.run_in(self.makePhoto, self.delayPhoto)   

    # Flash the cat light always
    self.triggerCat()


  # Flash the cat light for 2 s
  def triggerCat(self):
    if  not self.catTriggered:
        self.toggle('switch.cat_light')

    self.catTriggered += 1
    self.run_in(self.catDone, self.durationCatFlash)      


  def catDone(self, kwargs):
    self.log(f"cat Done")

    self.catTriggered -= 1
    if not self.catTriggered:
      self.toggle('switch.cat_light')


  # Turn on garage light for 30s
  def triggerLight(self):
    self.log(f"Trigger Light")    
    self.turn_on('switch.garage_light_relay')
    self.run_in(self.lightDone, self.durationLight)        


  def lightDone(self, kwargs):
    self.log(f"Light Done")
    self.turn_off('switch.garage_light_relay')


  def makePhoto(self, kwargs):
    date_string = datetime.datetime.now().strftime("%Y%m%d-%H%M%S")      
    file_name = f'/config/camera/{date_string}.jpg'
    self.log(f'Snapshot file_name: {file_name}')
    self.call_service('camera/snapshot', entity_id='camera.garage_camera', filename=file_name)

Dockerisation

How to install and use Portainer for easy Docker container management -  TechRepublic

I moved my services from a virtual-machine environment to docker. Here’s how and why.

Let’s start with the “why”. The answer is that it crept up on me. I thought I’d experiment with docker containers, to learn what they did. The proper way to experiment is to do something non-trivial. So I chose one of my virtual machines and decided to dockerise it.

I achieved my objective of learning docker, because I was forced to read and re-read the documentation and become familiar with docker commands and the docker-compose file. Having started I just, er, kept going, until 2 months later, my entire infrastructure was dockerised.

I ended up with the following docker-compose stacks:

apacheserves a few static pages, currently only used by my Let’s Encrypt configuration
wordpressThree WordPress websites for personal and business use. You are looking at one of them.
nextcloudNextcloud installation, using SMB to access my user files
postgresqldatabase for my Gnucash financial records
embyDLNA and media server. Replaces Plex. Used to share music and photos with the TV in the lounge.
freepbxA freepbx installation.
This container appears on my dhcp-net (see below), has its own IP addresses. This is, in part, because it reserves a large number of high-numbered ports for RTP, and I didn’t want to map them.
ftpftp server used by my Let’s Encrypt processes
iotNode-red installation, used for my very limited home automation. Rather than starting with an existing Node-red image, I rolled this one from a basic OS container, basing the Dockerfile on instructions for installing Node-red on Ubuntu. This is another container on my dhcp-net, because it has to respond to discovery protocols from Alexa, including on port 80.
mailIredmail installation. It is highly questionable whether this should have been done, because I ended up with a single container running a lot of processes: dovecot, postfix, amavis, apache. I should really split these out into separate containers, but that would take a lot of work to discover the dependencies between these various processes. Anyhow, it works.
nfsnfs exporter
piwigoGallery at piwigo.chezstephens.org.uk
portainerManage / control / debug containers
proxyNginx proxy directing various SNI (hostname based) http queries to the appropriate container
sambaSamba server
svnSVN server
tgtTarget (ISCSI) server
zabbixMonitor server. Does a good job of checking docker container status and emailing me if something breaks.

One thing missing from docker is the ability to express dependencies between services. For example, my nextcloud container depends on the samba server becuase it uses SMB external directories. I wrote a Makefile to allow me to start and stop (as well as up/down/build) all the services in a logical order.

My docker installation (/var/lib/docker) has its own zfs dataset. This causes docker to use zfs as the copy-on-write filesystem in which containers run, with a probable performance benefit. It also has the side effect of polluting my zfs dataset listing with hundreds (about 800) of meaningless datasets.

One of the needs of many of my servers is to persist data. For example the mail container has thousands of emails and a MySQL database. I needed to persist that data across container re-builds, which assume that you are rebuilding the container from scratch and want to initialse everything.

Each docker-compose stack had its own zfs dataset (to allow independent rollback), and each such stack was only depdenent on data within that dataset. The trick is to build the container, run it (to perform initialization), then docker copy data you want to keep (such as specific directories in /etc and /var) to the dataset, then modify docker-compose.yaml to mount that copy in the appropriate original location. The only fly in the ointment is that docker cp doesn’t properly preserve file ownership, so you may need to manually copy file ownership from the initial installation using chown commands.

Several of the stacks run using the devplayer0/net-dhcp plugin, which allows them to appear as indepdendent IP addresses. A macvtap network would have achieved the same effect, except I would have to have hard-coded the IP addressed into the docker-compose files. The net-dhcp plugin allows an existing dhcp server to provide the IP addresses, which fits better into my existing infrastructure.

At the end of all this, was it worth it? Well, I certainly enjoyed the learning experience, and proving that I was up the challenge. I also ended up with a system that is arguably easier to manage. Next time I update/reinstall my host OS, I think I will find it easier to bring docker up than to bring up the virtual machines, which requires the various virtual machine domains to be exported and imported using various virsh commands.

ZFS tiered storage

This post documents changes I made to my zfs server setup to resolve the issue of slow hard disk access to my performance-sensitive datasets.

The problem

When you access random data on hard disks,  the disks have to seek to find the data.   If you are lucky,  the data will  already be in a cache.  If you are unlucky the disk will have to seek to find it.   The average seek time on my WD Red disks is 30ms.

So although the disks are capable of perhaps hundress of MB/s,  given an optimal read request, given a typical read requests of a virtual hard disk from one of my virtual machine clients,  performance is very much lower.

ZFS already provides

ZFS provides performance optimations to help to alleviate this.   A ZIL (ZFS intention log) is written by ZFS before writing data proper.  This redundant writing provides integrity against loss of power part way through a write cycle,  but it also increases the load on the hard disk.

The ZIL can be moved to a separate disk, when it is called an SLOG (separate log).   Putting it on a faster (e.g. SSD) disk can improve performance of the system as a whole by making writes faster.  The SLOG doesn’t need to be very big – just the amount of data that would be written in a few seconds.   With a quiet server,  I see that the used space on my SLOG is 20 MB.

Secondly there is a read cache.   ZFS attempts to predict reads based on frequency of access,  and caches data in something called an ARC. You can also provide a cache on an SSD (or NVME) device,  which is called a level 2 ARC (L2ARC).   Adding an L2ARC effectively extends the size of the cache.  On my server,  when it’s not doing anything disk-intensive,  I see a used L2ARC of about 50 GB.

The SSD/NVME

A benefit an SSD is that it doesn’t have a physical seek time.  So the performance of random reads is much better than a rotating disk.   The transfer rate is limited by its electronics,  not the rotational speed of the physical disk platter.

NVMEs have an advantage over SATA that they can use multi-lane PCI interfaces and increase the transfer rate substantially over the 6 Gbps limit of today’s SATA.

Local backup

I wanted to improve my ZFS performance over and above the limitations of my Western Digital Red hard disks.   Replacing the 16TB mirrored pool (consisting of 4 x 8 TB disks,  plus a spare) would take 17 x 2 TB disks.  A 2TB Samsung Evo Pro disk in early 2020 costs £350,  and is intended for server applications (5 years warrantee or 2,400 TB written). At this cost,  replacing the entire pool would be almost £6,000 – which is way too expensive for me.  Perhaps I’ll do this in years to come when the cost has come down.

My current approach is to create a fast pool based on a single 2TB SSD,  and host only those datasets that need the speed on this pool.   The problem this approach then creates is that the 2TB SSD pool has no redundancy.

I already had a backup server in a different physical location.  The main server wakes the backup server once a day and pushes an incremental backup of all zfs datasets.

However,  I wanted a local copy on the slower pool that could be synchronized with the fast pool fairly frequently,  and more importantly,  which I could revert to quickly (e.g. by switching a virtual machine hard disk)  if the fast pool datatset was hosed.

So I decided to move the speed-critical datasets to the fast pool,  and perform an hourly incremental backup to a copy in the slow pool.

zrep mechanism

I already used zrep to back up most of my datasets to my backup server.

I added zrep backups from the fast pool datasets to their slow pool backups.   As all these datasets already had a backup on the backup server,  I set the  ZREPTAG environment variable to a new value “zrep-local” for this purpose so that zrep could treat the two backup destinations as distinct.

“I added” above hides some subtlely.  Zrep is not designed for local backup like this,  even though it treats a “localhost” destination as something special.   But the zrep init command with a localhost destination creates a broken configuration such that zrep subsequently consider both the original and the backup to be masters.  It is necessary to go one level under the hood of zrep to set the correct configuration thus:

export ZREPTAG=zrep-local
zrep changeconfig -f $fastPool/$1 localhost $slowPool/$1
zrep changeconfig -f -d $slowPool/$1 localhost $fastPool/$1
zrep sentsync $fastPool/$1@zrep-local_000001
zrep sync $fastPool/$1

A zrep backup can fail for various reasons,  so it is worth keeping an eye on it and making sure that failures are reported to you via email.   One reason it can fail is because some process has modified the backup destination.   If the dataset is not mounted,  such modification should not occur,  but my experience was that zrep found cause to complain anyway.   So I made as part of my local backup a rollback to the latest zrep snapshot before issuing a zrep sync.

Interaction with zfs-auto-snapshot

If you are running zfs-auto-snapshot on your system (and if not,  why not?),  this tool has two implications for local backup.   Firstly,  it attempts to modify your backup pool,  which upsets zrep.  Secondly,  if you address the first problem, you end up with lots of zfs-auto-snapshot snapshots on the backup pool as there is then no reason why these should expire.

You solve the first problem by setting the zfs attribute com.sun:zfs-auto-snapshot=false on all such datasets.

You solve the second problem by creating an equivalent of the zfs-auto-snapshot expire behaviour and running it on the slow pool after performing a backup.

The following code performs this operation:

fastPool=f1
slowPool=p5
zfsLabel=zfs-auto-snap

process_category()
{
# process snapshots for stated zrep-auto-snap category keeping stated number
ds=$1
category=$2
keep=$3
zfsCategoryLabel=${zfsLabel}_$category

snapsToDelete=`zfs list -rt snapshot -H -o name $slowPool/$ds | grep $zfsCategoryLabel | head –lines=-$keep`

for snap in $snapsToDelete
do
zfs destroy $snap
done

}

process()
{
ds=$1
# echo processing $ds
process_category $ds “frequent” 4
process_category $ds “hourly” 24
process_category $ds “daily” 7
process_category $ds “weekly” 4
process_category $ds “monthly” 12
}

# get list of datasets in fast pool
dss=`zfs get -r -s local -o name -H zrep-local:master $fastPool`

for ds in $dss
do
# remove pool name
ds=${ds#$fastPool/}
process $ds
done

 

 

 

ZFS rooted system disaster recovery

I recently had occasion to test my disaster recovery routine for my server, which is Ubuntu 18.04 LTS rooted on zfs.

The cause was that I did a command-line upgrade to 19.10. The resulting system did not boot. I am not exactly sure why. I’d hoped to enjoy a possible disc perforance improvement in 19.10.

Anyhow, after messing around for a while, I decided to revert to before the upgrade. I have a bootable flash drive with Ubuntu 18.04, not running from zfs, but with zfs tools installed.

I booted from the flash drive. Then I zfs rolled-back the /boot and root datasets to before the upgrade. Then I mounted the root dataset to /mnt/root, the boot to /mnt/root/boot, mount –rbind /dev /mnt/dev, and the same for sys, proc and run. Then chroot /mnt/root. Then update-grub (probably unnecessary) and grub-install /dev/sd{abcde}.

That’s it, although it did take 2 hours with inumerable boots and struggling to understand the Supermicro BIOS settings. I think the next time I have to do this, it will take about 30 minutes. Without knowing I could do this reasonably easily, I would never have tried the OS upgrade. The upgrade almost worked, but I’ll wait for the next LTS before trying again.

Christmas letter 2019

64 Lamb’s Lane, Cottenham, Cambridge, CB24 8TA
Tel: +44 1954 204610
Email: adrian.stephens@ntlworld.com

Christmas 2019 Letter

Dear Friends,

It is that time of year.  Our revolting cousins over the water have just consumed their Turkeys and are thankful.  Black Friday is upon us. The shops are full of Christmas presents, wrapping, and suggestions. My Good Woman ™ has the present shopping well in hand, courtesy of Amazon.

So we perform a retrospective. This year was the year of the Unexpected Holiday ™ (UH).

Our daughter, Ruth, left the shores of the UK for Virginia, USA – leaving us all Ruthless. Understandable, given that her Good Man ™ had been posted there to blue-tack things that go bang to the bottom of things that go whoosh.  So we arranged to visit her in April to see how they were settling down, and take John his fix of Lion Bars.

After that we had a lovely holiday in Portugal with Sarah, Derek and Isabella, which was a Christmas present and Tina’s 60th birthday present from them to us.

Then I unexpectedly received an award from the IEEE Computer Society, and travelled with Tina to Miami to pick it up.  So we had a week of holiday there, sampling the local delights; as well as seeing a family of racoons in the wild of the everglades and alligators as roadkill and lunch, in Tina’s case! We also had the opportunity to meet up with a couple of Adrian’s ex-colleagues who live in Florida.

Then we heard that John had been posted overseas, and Ruth was left on her own.  So we arranged, at fairly short notice, a trip to Hawaii to share with her in September; and Tina arranged to travel to Virginia in November to keep Ruth company.  Not much after making these arrangements, John was shipped back to the USA with a ruptured Achilles tendon.  So Ruth joined us on holiday, leaving John on his own at their place, hobbling round on crutches.

Fortunately Ruth took to snorkelling, which was one of our favourite occupations.  The other being dining out.  Adrian’s favourite is the Kona brewing company, who do great beers and pizzas.

Tina still made the trip out there in November and had a lovely couple of weeks, even hiring a car for the first time in the USA.

Ruth and John have settled down nicely in Virginia, making firm friends there, and a drivable distance from John’s family. They are coming home for a short visit in January 2020.

Sarah is now working evaluating people’s disabilities for UK benefits.  Derek is a full-time driving instructor with Red, and is making a decent business out of it.  Their children are all growing up far too quickly.  Luke is 6’ 5” at the age of 16.  Holly is teenager at the age of 12.  Mia (10) is almost as tall as Holly.  Isabella (2) is talking more, with less need to scream loudly – thank goodness.

David and Eleanor moved from Bar Hill to Northstowe – a new village being constructed about 5 miles from us. David is now a fire brigade crew chief in London. He is also really into his photography, and did some paid portraiture recently.  Eleanor is into sustainability etc. Florence (2) is a chatterbox.  Joe (born January this year) is a lovable brick. We spent a happy week, including my Mum and Eleanor’s parents and brother at Centre-parc Elvedon.  The rumours about Adrian’s crazy-golf abilities are just that.

We’re looking forward to spending time with them over the next month or so.  We’ll be having the usual family silliness – charades and murder in the dark, provided the house passes a safety assessment.

Adrian continues to be busy, even though largely retired.  He has a number of clients willing to pay for his expertise (adrianstephensconsulting.uk).  He also discovered librivox.org this year, and has participated in a number of dramatic and poetry readings. The shed is now the recording booth, draped with curtains.

Tina continues much as usual, growing older disgracefully and enjoying reading and hanging out with the family and the cat. She is still Treasurer at GBC, but is hoping there will be someone to take over when this term comes to an end in March 2021. We continue to worship at Girton Baptist Church where Adrian plays keyboard.  Adrian has also helped out with the occasional “open the book” assembly at the Junior school.

Wishing you a very merry Christmas and happy new year.   With hugs, kisses, manly handshake, nose rubs etc. as (in-)appropriate to our relationship.

Tina and Adrian Stephens

How to make a home zfs backup server

I have a home file and compute server that provides several TB of storage and supports a bunch of virtual machines (for setup see here: https://chezstephens.org.uk/server-upgrade/). The server uses a zfs 3-copy mirror and automated snapshots to provide what should be reliable storage.

However, reliable as the system might be, is is vulnerable to admin user error, theft, fire or a plane landing on the garage where it is kept. I wanted to create a backup in a different physical location that could step in in the case of disaster.

I serve a couple of community websites, so I’d like recovery from disaster to be assured and relatively painless.

I recently was given a surplus 4-core 2.2 GHz 3 GB HP desktop by a friend. I replaced her hard disk with a 100GB SSD for the system and a 4TB hard disk (Seagate Barracuda) for backup storage (which I already had). I upgraded the memory by replacing 2x512MB with 2x2GB for a few pounds. That’s the hardware complete.

Installing the software

The OS installed was Ubuntu 18.04 LTS. The choice was made because this has long term support (hence the LTS), supports zfs, and is the same OS as my main server. My other choice would have been Debian.

OS install

I wanted to run the OS from a small zfs pool on the SSD. To get the OS installed booted from zfs is relatively straightforward, but not directly supported by the install process. To set this up: install the OS on a small partition (say 10GB), leaving most of the SSD unused. Then install the zfs utilities, create a partition and pool (“os”) on the rest of the SSD, copy the OS to a dataset on “os”, create the grub configuration and install the boot loader. The details for a comprehensive root install are here: https://github.com/zfsonlinux/zfs/wiki/Ubuntu-18.04-Root-on-ZFS. I used a subset of these instructions to performs the steps shown above.

VNC install

I set up VNC to remotely manage this system. You can follow the instructions here: https://www.digitalocean.com/community/tutorials/how-to-install-and-configure-vnc-on-ubuntu-18-04. I never did get vncserver running from the user’s command-line to work. But it works fine as that user started as a system service, with the exception that I don’t have a .Xresources file in my home directory. This lack doesn’t appear to have any effect, apart from some icons are missing from the start menu. As this doesn’t bother me, I didn’t spend any time trying to fix it.

So the system boots up from zfs, and after a pause I can connect to it from VNC.

QEMU install

I followed the instructions here: https://www.linuxtechi.com/install-configure-kvm-ubuntu-18-04-server/, except I didn’t follow the instructions for creation of the bridge.

For the networking side, I followed the instructions here: https://linuxconfig.org/install-and-set-up-kvm-on-ubuntu-18-04-bionic-beaver-linux.

I installed virt-manager and libvirt-daemon-driver-storage-zfs. I could then create virtual machines with zvol device storage for virtual disks.

Backup Process

The backup server was configured to pull a backup from the production server once a day, and to sleep when not doing this. I suppose the reason to sleep is to save electricity. A Watt of power amounts to about £1.40 a year. So this saves me about £30 pounds a year.

The backup process runs from a crontab entry at 2:00am. The process ends by setting a suspend until 1:50 the next morning using this code:

!/bin/sh
target=date +%s -d 'tomorrow 01:50'
echo “sleeping”
/usr/sbin/rtcwake -m mem -t $target
echo “woken”

A pool (“n2”) on the backup server (“nas3”) was created in the 4TB disk to hold the backups. Each dataset on the production server (“shed9” **) was copied to the backup server using zfs send and recv commands. Because I had a couple of TB of data, this was done by temporarily attaching the 4TB disk to the production server.

The zrep program (http://www.bolthole.com/solaris/zrep/) provides a means of managing incremental zfs backups.

The initial setup of the dataset was equivalent to:

from=shed9
to=nas3
frompool=p5
topool=n2
ssh=”ssh root@shed9″

# Delete any old zrep snapshots on source

$ssh "zfs list -H -t snapshot -o name -r $frompool/$1 | grep zrep | xargs -n 1 zfs destroy"

# Use the following 3 lines if actually copying
zfs destroy -r $topool/$1

$ssh "zfs snapshot $frompool/$1@zrep_000001"

$ssh "zfs send $frompool/$1@zrep_000001" | zfs recv $topool/$1

# Set up the zrep book-keeping

$ssh "zrep changeconfig -f $frompool/$1 $to $topool/$1" zrep changeconfig -f -d $topool/$1 $from $frompool/$1

$ssh "zrep sentsync $frompool/$1@zrep_000001"

And the periodic backup script looks like this:

# Find last local zrep snapshot

lastZrep=`zfs list -H -t snapshot -o name -r $pool/$1 | grep zrep | tail -1`

# Undo any local changes since that last snapshot

zfs rollback -r $lastZrep

# Do the actual incremental backup

zrep refresh $pool/$1

The normal (i.e., what it was clearly designed to support) use of zrep involves a push from production to the backup server. My use of it requires a “pull” because the production server doesn’t know when the backup server is awake. This reversal of roles complicates the scripts resulting in those above, but it is all documents on the zrep website. Another side effect is that the resulting datasets remain read-write on the backup server. Any changes made locally are discarded by the rollback command each time the backup is performed.

Creating the backup virtual machines

Using virt-manager, I created virtual machines to match the essential virtual machines from my production server. The number of CPUs and memory space were squeezed to fit in the reduced resources.

Each virtual machine is created with the same MAC address as on the production server. This means that both copies cannot be run at the same time, but it also creates least disruption switching from one to the other as ARP caches do not need to be refreshed.

I also duplicated the NFS and Samba exports to match the production machine.

I tested the virtual machines one at a time by pausing the production VM and booting up the backup VM. Note that booting a backup copy of a device dataset is safe in the sense that any changes made during testing are rolled back before doing the incremental backup. It also means you can be cavalier with pulling the virtual plug on the running backup VM.

How would I do a failover?

I will assume the production server is dead, has kicked the bucket, has shuffled off this mortal coil, and is indeed an ex-server.

I would start by removing the backup & suspend crontab entry. I don’t think a rollback would work while a dataset is open by a virtual machine, but I don’t want to risk it.

I would bring up my pfsense virtual machine on the backup. Using the pfsense UI, I would update the “files” DNS server entry to point to the IP address of the backup. This is the only dependency that the other VMs have on the broken server. Then I would bring up the essential virtual machines. That should be it.

Given that the most likely cause of the failover is admin error (rm -rf in the wrong place), recovery from failover would be a hard-to-predict partial software rebuild of the production server. If a plane really did land on my garage, recovery from failover may take a little longer depending on how much of the server hardware is functional when I pull it from the wreckage. And on that happy note, it’s time to finish.

**
For what it’s worth, this is the 9th generation of file server. They started off running the shed before moving to the garage at about shed7. But the name stuck.

Christmas 2018 Letter

Version including pictures: Christmas letter 2018

64 Lambs Lane, Cottenham, Cambridge CB24 8TA, United Kingdom

Tel: 01954 204610, Email: adrian.stephens@ntlworld.com

Stephens Christmas Letter 2018

Well it’s that time of year when we review the address list, print the labels and think about what to say about the year.  The turkeys are starting to get nervous (or for our friends in the USA, the turkeys are past caring).

This year has been momentous and notable in many ways.  We are hoping that next year will prove less momentous.

In March, Adrian did not stand for re-election as IEEE 802.11 chair.  The then vice-chair Dorothy Stanley was elected.  Adrian attended the meeting in May in Warsaw to say goodbye, and has since bowed out of all things to do with the 802.11 working group.   Adrian is now mostly retired, but is doing a little consultancy for law firms related to 802.11. Adrian also took on the editor position for the local community association.

In April our daughter Ruth was married by Tina to John in Chicago, USA. This was a first in so many ways! Although it was cold outside (there was snow on the ground) the welcome was warm. We enjoyed the wonderful hospitality of Chip and Lito, who hosted the event.  John’s family are Filipino (I’d never typed that word before, isn’t spell-check wonderful?), which means some interesting cultural and culinary learnings for us.

In July we celebrated our 40th Wedding Anniversary.  We had a party and invited lots of people to celebrate our wedding anniversary, my retirement, Tina’s 60th birthday, and Ruth and John’s wedding (because it being in the USA folks could not attend easily from the UK).

In August, Tina’s Aunt Elsie died. She had been part of Tina’s life, and the relationship was closer to mother-daughter than aunt-niece. She will be greatly missed.

Aunt Elsie left her house to Tina.  Adrian’s Mum decided she wanted to move to be closer to family, and has just moved in to Elsie’s old house.  We’re still struggling with what to call it.  As you can imagine, we have been busy doing clearance at Elsie’s house and Mum’s house and are still working on unpacking and getting records updated.  With Mum just down the road we can support her much better.

We had three holidays-ish this year.  The wedding in March also gave us a couple of days of holiday.  We both attended the May Warsaw meeting and spent a couple of days touring the city.   We did have a proper holiday in Santorini.   Tina was adamant that we needed a rest from the various comings and goings, so we took off in November for a week.  It was really nice there.  Warm sunshine, photogenic buildings, sea, sunset boat cruise.  Hot and cold running cats.  Breakfast on the balcony overlooking the ocean. Walking up and down endless steps.

On the baby front David and Eleanor’s Florence continues to grow.  She is very good at understanding and making herself understood.  Sarah and Derek’s Isabella is likewise growing up, but not quite at the walking stage.  Her vocabulary is limited to “ootcha”.

David and Eleanor have signed contracts to move house to Northstowe, which is a new village being built about 6 miles from here.  This will give them more room as they are expecting number 2 baby soon.  Eleanor’s parents also moved to Northstowe from London earlier in the year.  And as it turns out they will be just round the corner from each other.

Ruth stayed in the UK after their marriage in order to get the paperwork to live in the USA.  It has taken her almost 9 months.  She leaves in December to join John living in Virginia.  They have a house there with lots of trees and visiting deer in the garden.

Tina has had problems with her knee, requiring a steroid injection.  Adrian’s health has been stable, for which thanks to God.

We wish you a blessed Christmas.

With hugs/kisses/etc… as (in-)appropriate to our relationship.

Adrian and Tina Stephens