Migrating a VirtualBox Windows installation

I have been using Linux as my primary OS since 1999ish, except for a brief period early in the history of 67 Bricks when I had an iMac. Whenever I have used Windows it has invariably been in some kind of virtualised form; this was necessary in the iMac days when I was developing .NET applications in Visual Studio, but these days I work solely on Scala / Play projects developed in IntelliJ in Linux. Nevertheless, I have found it convenient to have an installation of Windows available for the rare instances where it’s actually necessary (for example, to connect to someone’s VPN for which no Linux client is available).

My Windows version of choice is the venerable Windows 7. This is the last version of Windows which can be configured to look like “proper Windows” as I see it by disabling the horrible Aero abomination. I tried running the Windows 10 installer once out of morbid curiosity, it started talking to me, so I stopped running it. I am old and set in my ways, and I feel strongly that an OS does not need to talk to me.

So anyway, I had a VirtualBox installation of Windows 7 and, because I am an extraordinarily kind and generous soul, I had given it a 250G SATA SSD all to itself using VirtualBox’s raw hard disk support.

Skip forward a few years, and I decided I would increase the storage in my laptop by replacing this SSD with a 2TB SSD since such storage is pretty cheap these days. The problem was – what to do with the Windows installation? I didn’t fancy the faff of reinstalling it and the various applications, and in fact I wasn’t even sure this would be possible given that it’s no longer supported. In any case, I didn’t want to give Windows the entire disk this time, I wanted a large partition available to Linux.

It turns out that VirtualBox’s raw disk support will let you expose specific partitions to the guest rather than the whole disk. The problem is that with a full raw disk, the guest sees the boot sector (containing the partition table and probably the bootloader), whereas you presumably don’t want the guest to see the boot sector if you’re only exposing certain partitions. How does this work?

The answer is that when you create a raw disk image and specify specific partitions to be available, in addition to the sda.vmdk file, you also get a sda-pt.vmdk file containing a copy of the boot sector, and this is presented to the guest OS instead of the real thing. Here, then, are the steps I took to clone my Windows installation onto partitions of the new SSD, keeping a partition free for Linux use, and ensuring Windows still boots. Be warned that messing about with this stuff and making a mistake can result in possibly irrecoverable data loss!

Step 1 – list the partitions on the current drive

My drive presented as /dev/sda

$ fdisk -x /dev/sda
Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Crucial_CT250MX2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xc3df4459

Device     Boot  Start       End   Sectors Id Type            Start-C/H/S   End-C/H/S Attrs
/dev/sda1  *      2048    206847    204800  7 HPFS/NTFS/exFAT     0/32/33   12/223/19    80
/dev/sda2       206848 488394751 488187904  7 HPFS/NTFS/exFAT   12/223/20 1023/254/63 

Step 2 – create partitions on the new drive

I put the new SSD into a USB case and plugged it in, whereupon it showed up as /dev/sdb

Then you need to do the following

  1. Use fdisk /dev/sdb to edit the partition table on the new drive
  2. Create two partitions with the same number of sectors (204800 and 488187904) as the drive you are hoping to replace, and then you might as well turn all of the remaining space into a new partition
  3. Set the types of the partitions correctly – the first two should be type 7 (HPFS/NTFS/exFAT), if you’re creating another partition it should be type 83 (Linux)
  4. Toggle the bootable flag on the first partition
  5. Use the “expert” mode of fdisk to set the disk identifier to match that of the disk you are cloning; in my case this was 0xc3df4459 – Googling suggested that Windows may check for this and some software licenses may be tied to it.

So now we have /dev/sdb looking like this:

$ fdisk -x /dev/sdb
Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: 500SSD1         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc3df4459

Device     Boot     Start        End    Sectors Id Type            Start-C/H/S End-C/H/S Attrs
/dev/sdb1  *         2048     206847     204800  7 HPFS/NTFS/exFAT     0/32/33 12/223/19    80
/dev/sdb2          206848  488394751  488187904  7 HPFS/NTFS/exFAT   12/223/20 705/42/41 
/dev/sdb3       488394752 3907029167 3418634416 83 Linux             705/42/42 513/80/63

Note that the first two partitions are an exact match in terms of starts, end and sector count. I think we can ignore the C/H/S (cylinder / head / sector) values.

Step 3 – clone the data

Taking great care to get this right, we can use the dd command and we can tell it to report its status from time to time so we know something is happening. There are two partitions to clone:

$ dd if=/dev/sda1 of=/dev/sdb1 bs=1G status=progress
... some output (this is quite quick as it's only 100M)
$ dd if=/dev/sda2 of=/dev/sdb2 bs=1G status=progress
... this took a couple of hours to copy the partition, the bottleneck probably being the USB interface

Step 4 – create (temporarily) a VirtualBox image representing the new disk

There are various orders in which one could do the remaining steps, but I did it like this, while I still had /dev/sda as the old disk and the new one plugged in via USB as /dev/sdb

$ VBoxManage internalcommands createrawvmdk -filename sdb.vmdk -rawdisk /dev/sdb -partitions 1,2

This gives two files; sdb.vmdk and sdb-pt.vmdk. To be honest, I got to this point not knowing how the boot sector on the disk would show up, but having done it I was able to verify that sdb-pt.vmdk appeared to be a copy of the first sector of the real physical disk (at least the first 512 bytes of it). I did this by comparing the output of xxd sdb-pt.vmdk | head -n 32 with xxd /dev/sdb | head -n 32 which uses xxd to show a hex dump and picks the first 32 lines which happen to correspond to the first 512 bytes. In particular, you can see at offset 0x1be the start of the partition table, which is 4 blocks of 16 bytes each ending at 0x1fd, with the magic signature 55aa at 0x1fe. The disk identifier can also be seen as the 4 bytes starting at 0x1b8. And note that everything leading up to the disk identifier is all zeroes.

Step 5 – dump the boot sector of the existing disk

At this point, we have /dev/sda with the bootsector containing the Windows bootloader, we have a freshly initialised /dev/sdb containing a copy of the Windows partitions and a bootsector that’s empty apart from the partition table and disk identifier, and we have the sdb-pt.vmdk file containing a copy of that empty bootsector. We have also arranged for the new disk to have the same identifier as the old disk.

What is needed now is to create a new sdb-pt.vmdk file containing the bootloader code from /dev/sda and then the partition table from /dev/sdb. There are 444 bytes that we need from the former. So we can do something like this:

$ head -c 444 /dev/sda > sdb-pt-new.vmdk
$ tail -c +445 sdb-pt.vmdk >> sdb-pt-new.vmdk

We can confirm that the new file has the same size as the original, we can also use xxd to confirm that we’ve got the bootloader code and then everything from that point is as it was (including the new partition table)

Step 6 – the switch

We’re almost done. All that’s left to do is:

  1. Replace the old SATA SSD with the new 2TB SSD
  2. Run VirtualBox – do not start Windows but instead remove the existing hard drive attached to it (which is the “full raw disk” version of sda.vmdk), and use the media manager to delete this.
  3. Quit VirtualBox
  4. Recreate the raw disk with partitions file but for /dev/sda (which is what the new drive is): VBoxManage internalcommands createrawvmdk -filename sda.vmdk -rawdisk /dev/sda -partitions 1,2
  5. The previous command will have created the sda-pt.vmdk boot sector image from the drive, which will again be full of zeroes. Overwrite this with the sdb-pt-new.vmdk that we made earlier, obviously ensuring we preserve the original sda-pt.vmdk name and the file permissions.
  6. Start VirtualBox, add the newly created sda.vmdk image to the media manager, and then as the HD for the Windows VM
  7. Start the Windows VM and hope for the best

I was very gratified to find that this worked more-or-less first time. In truth, I originally just tried replacing the old sda.vmdk with the new one, which gave the error “{5c6ebcb7-736f-4888-8c44-3bbc4e757ba7} of the medium ‘/var/daniel/virtualbox/Windows 7/sda.vmdk’ does not match the value {e6085c97-6a18-4789-b862-8cebcd8abbf7} stored in the media registry (‘/home/daniel/.config/VirtualBox/VirtualBox.xml’)”. It didn’t seem to want to let me remove this missing drive from the machine using the GUI, so I edited the Windows 7.vbox file to remove it, then removed it from the media library and added the replacement, which I was then able to add to the VM.

Programming a Tesla

I have a friend called Chris who is a big fan of the band China Drum. Many years ago he challenged me to program their song Last Chance as a custom ringtone on his Nokia phone and, being vaguely musical, I obliged.

Time has moved on since then. With his hitherto rock-star hair cut to a respectable length, he is now the CEO of a company providing disease model human cells. And he owns a Tesla, something he likes to remind me about from time to time. Now it turns out that one of the silly things you can do with a Tesla is program the lights to flash to make a custom light show for a piece of music of your choice. You can probably see where this is heading.

Since a) I was going up to meet him (and his Tesla) in Yorkshire last weekend and b) I wasn’t sure how long the “I’ve been busy” excuse would work in response to the “Where’s my light show?” question, I figured I’d probably better actually try to do the damn thing.

Problem 1 – I don’t have the song

Sadly I didn’t have a copy of China Drum’s Goosefair album, so I didn’t have the audio. But I do have Linux, and a Spotify subscription, and the command-line ncspot client. I reasoned that if I could play the audio, I could surely also record the audio, it just might mean losing the will to live while trying to understand how Pulseaudio works (or maybe it’s PipeWire now, or who knows?)

Cutting a long and tedious story short, by doing some mystical fiddling about, I was able to send the output of ncspot to Audacity and thereby record myself a WAV of the song. In the interests of remaining on the right side of the law, I made good faith attempts to locate an original copy of the album, but it seems to be out of print. So I now have a copy via eBay.

Problem 2 – I don’t have the application for building the light show

To build a light show, one needs an open-source application called xLights. Since this doesn’t appear to be in the Ubuntu package repository, I had to build it from source. For some reason, I’m a bit averse to installing random libraries and things on my machine, but fortunately there is a “build it in Docker” option which I used and seemed to work successfully, except that I couldn’t figure out how to get at the final built application! It existed as a file in the filesystem of the docker container, but since the executing script had finished, the container wasn’t running, and there seemed to be no obvious way to get at it (it is entirely possible that I was just being stupid, of course). In the end, I reasoned that any file on a container filesystem has to be in my /var/lib directory somewhere, and with a bit of poking around, I located the xLights-2023.08-x86_64.AppImage file and copied it somewhere sensible.

Problem 3 – I don’t know where the beats are

I followed various instructions and got myself set up with a working application, a fresh musical sequence project for the file, and the .wav imported and displaying as a waveform.

The way xLights works is that you start with a bunch of horizontal lines representing available lights, you set up a bunch of timing markers which present as vertical lines thus dividing the work area up into a grid, and you can create light events in various cells of the resulting grid and the start / stop transitions of each light are aligned with the timing markers (though they can subsequently be moved). Thus you need a way of creating these markers. Fortunately, it is possible to download an audio plugin to figure these out for you. After doing this, you end up with a screen looking something like this:

An empty xLights grid.

Problem 4 – I can’t copy and paste

So I started filling things in with the idea that if I got something that I was happy with for a couple of bars, I could copy and paste it elsewhere rather than having to enter every note manually. Unfortunately, I ran into an unexpected problem, which is that the timing on the track is very variable. For example, I picked a couple of channels and used one to show where the “1” beats were in each bar and put lights on “2”, “3” and “4” in the other. But if you then try to copy and paste this bar into two bars, then four, then eight and so on, you quickly get out of sync with the beat lines, because the band speed up as the song goes on. Also, I couldn’t see any obvious option in xLights to quantise a track (i.e. to adjust the starts and ends of notes to match a set of timing marks).

Fortunately, the format that xLights uses to save these sequences is XML-based. Therefore I was able to write a Scala application to read in the sequence, make a note of all of the timestamps corresponding to the timing marks, and then shuffle all starts and ends of notes to the nearest timestamp. Actually, originally I wrote a thing to try to regularise the timestamp markers (i.e. keep the same number of them but make the spacing uniform) and quickly realised that the resulting markers were woefully out of sync with the music, which is when I realised the tempo of the song was variable.

Problem 5 – I can’t count

I dimly remembered from the Tesla light-show instructions that there was a limit of 3500 lights in a show, so I had to make sure the total number didn’t go above that. I couldn’t see an obvious way to do this in xLights, but I could see that it was just a question of running a suitable XPath expression against the XML file. So I fired up oXygen and wrote one (the correct XPath – see below – is count(//Effect[not(parent::EffectLayer/parent::Element[@type='timing'])]))

And so my development cycle was basically:

  • Stare at lots of blobs on the grid, adding and removing them and generally twiddle until it seems satisfactory
  • Run the Scala application to quantise it all to the timing marks, thus dealing with any slight mismatches caused by copying-and-pasting
  • Check in oXygen how many lights I’ve used (editing the raw XML was also useful to copy between channels e.g. to make the rear left turn indicators do the same as the front left turns)

Problem 6 – I can’t read instructions

Having got something I was reasonably happy with, I double-checked the instructions. Oh no! I am an idiot! It doesn’t say 3500 lights, it says 3500 commands where a command is “turn light on” or “turn light off”. So I now have twice the number of allowed lights, and drastic editing would be needed (naturally, this was the Friday night before I was due to drive up with it, after a number of late nights working on it).

Fortunately, I had also been an idiot (very slightly) with my counting XPath; because of the way the XML format works, each timing mark was being counted as an event. So having tweaked it not to count the timing marks, I had around 2600 lights, which is rather more than the 1750 budget but less bad than the 3400ish I started with.

So I had to scale things back a bit. I dropped the tracks I’d been using for the “1 2 3 4” beat. I removed some of the doubling up between channels. I dropped some of the shorter notes in tracks where I’d been trying to reproduce the rhythm of the vocals. I simplified bits and generally chopped it about until I had something that met the requirements.

Part of the finished product

Problem 7 – I don’t have a USB stick

Well, I didn’t at the start of the project. I did by the end; the smallest one I could find in Curry’s was 32GB. Which is ridiculous overkill for the size of the files needed (the audio was 27M, the compiled lightshow file was 372K). But, well, whatever…

And that was it. I followed the instructions about what needed to be on the stick (a folder called “LightShow” with that exact capitalisation containing “lightshow.wav” and “lightshow.fseq”) and took it with me to Yorkshire. Chris plugged it into the Tesla. It ran successfully. Hooray!

I’m not sure there’s a great moral to this story (other than maybe “read the instructions carefully”) but it was a fun challenge. Thanks to China Drum for a great song, Tesla for building the light show feature, and the xLights authors for an open-source application that made building the light show possible.

Embracing Impermanence (or how to check my sbt build works)

Stable trading relationships with nearby countries. Basic human rights. A planet capable of sustaining life. What do these three things have in common?

The answer is that they are all impermanent. One moment we have them, the next moment – whoosh! – they’re gone.

Today I decided I would embrace our new age of impermanence insofar as it pertains to my home directory. Specifically, I wondered whether I could configure a Linux installation so that my home directory was mounted in a ramdisk, created afresh each time I rebooted the server.

Why on earth would I want to do something like that?

The answer is that I have a Scala project, built using sbt (the Scala Build Tool), and I thought I’d clear some of the accumulated cruft out of the build.sbt file, starting with the configured resolvers. These are basically the repositories which will be searched for the project’s dependencies – there were a few special case ones (e.g. one for JGit, another for MarkLogic) and I strongly suspected that the dependencies in question would now be found in the standard Maven repository. So they could probably be removed, but how to check, since all of the dependencies would now exist in caches on my local machine?

A simple solution would have been to delete the caches, but that involves putting some effort into finding them, plus I have developed a paranoid streak about triggering unnecessary file writes on my SSD. So I had a cunning plan – build a VirtualBox VM and arrange for the home directory on it to be a ramdisk, thus I could check the code out to it and verify that the code will build from such a checkout, and this would then be a useful resource for conducting similar experiments in the future.

Obviously this is not quite a trivial undertaking, because I need some bits of the home directory (specifically the .ssh directory) to persist so I can create the SSH keys needed to authenticate with GitHub (and our internal GitLab). Recreating those each time the machine booted would be a pain.

After a bit of fiddling, my home-grown solution went something like this:

  • Create a Virtualbox VM, give it 8G memory and a 4G disk (maybe a bit low if you think you’ll want docker images on it; I subsequently ended up creating a bigger disk and mounting it on /var/lib/docker)
  • Log into VM (my user is daniel), install useful things like Git, curl, zip, unzip etc.
  • Create SSH keys, upload to GitHub / GitLab / wherever
  • Install SDKMAN! to manage Java versions
  • Create /var/daniel and copy into it all of the directories and files in my home directory which I wanted to be persisted; these were basically .ssh for SSH keys, .sdkman for java installations, .bashrc which now contains the SDKMAN! init code, and .profile
  • Save the following script as /usr/local/bin/create_home_dir.sh – this wipes out /home/daniel, recreates it and mounts it as tmpfs (i.e. a ramdisk) and then symlinks into it the stuff I want to persist (everything in /var/daniel)
#!/bin/bash
DIR=/home/daniel

mount | grep $DIR && umount $DIR

[ -d $DIR ] && rm -rf $DIR

mkdir $DIR

mount -t tmpfs tmpfs $DIR

chown -R daniel:daniel $DIR

ls -A /var/daniel | while read FILE
do
  sudo -u daniel ln -s /var/daniel/$FILE $DIR/
done
  • Save the following as /etc/systemd/system/create-home-dir.service
[Unit]
description=Create home directory

[Service]
ExecStart=/usr/local/bin/create_home_dir.sh

[Install]
WantedBy=multi-user.target
  • Enable the service with systemctl enable create-home-dir
  • Reboot and hope

And it turns out that this worked; when the server came back I could ssh into it (i.e. the authorized_keys file was recognised in the symlinked .ssh directory) and I had a nice empty workspace; I could git clone the repo I wanted, then build it and watch all of the dependencies get downloaded successfully. I note with interest that having done this, .cache is 272M and .sbt is 142M in size. That seems to be quite a lot of downloading! But at least it’s all in memory, and will vanish when the VM is switched off…

Zen and the art of booking vaccinations

This is a slightly abridged version of a painful experience I had recently when trying to book a Covid vaccination for my 5-year-old daughter, and some musing about what went wrong (spoiler: IT systems). It’s absolutely not intended as a criticism of anyone involved in the process. All descriptions of the automated menu process describe how it was working today.

At the beginning of April, vaccinations were opened up for children aged 5 and over. Accordingly, on Saturday 2nd, we tried to book an appointment for our daughter using the NHS website (https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/book-coronavirus-vaccination/). After entering her NHS record and date of birth, we were bounced to an error page:

The error page after failing to book an appointment.

There’s no information on the page about what the error might be – possibly this is reasonable given patient confidentiality etc. (at no point had I authenticated myself). I noticed that the URL ended “/cannot-find-vaccination-record?code=9901” but other than that, all I can do is call 119 as suggested.

Dialling 119, of course, leads you to a menu-based system. After choosing the location you’re calling from, you get 4 options:

  1. Test and trace service
  2. Covid-19 vaccination booking service
  3. NHS Covid pass service
  4. Report an issue with Covid vaccination record

So the obvious choice here is “4”. This gives you a recorded message “If you have a problem with your UK vaccination records, the agent answering your call can refer you on to the data resolution team”. This sounds promising! Then there’s a further menu with 3 choices:

  1. If your vaccination record issue is stopping you from making a vaccination booking
  2. If your issue is with the Covid pass
  3. If your issue relates to a vaccination made overseas

Again, the obvious choice here is “1”. This results in you being sent to the vaccination booking service.

The first problem I encountered (in the course of the day I did this many times!) was that many of the staff on the other end seemed to be genuinely confused about how I’d ended up with them. They told me I should redial and choose option 4, and I kept explaining that I had done exactly that and this is where I had ended up. So either menu system is not working and is sending me to the wrong place (although given the voice prompts it sounds like it was doing the right thing), or the staff taking the calls have not been briefed properly.

Eventually I was able to get myself referred to the slightly Portal-esque sounding Vaccination Data Resolution Service. They explained that my daughter did not appear to be registered with a GP, which surprised me because she definitely is. So, they said, I should get in touch with her GP practice and get them to make sure the records were correct on “the national system”.

This I did. I actually went down there (it was lunchtime), the staff at the surgery peered at her records and reported that everything seemed to be present and correct, with no issues being flagged up.

So, then I had more fun and games trying to get one of the 119 operators to refer me back to the VDRS. This was eventually successful, and someone else at the VDRS called me back. She took pity on me, and gave me some more specific information – the “summary case record” on the “NHS Spine portal” which should have listed my daughter’s GP did not.

I phoned the GP surgery, explained this, various people looked at the record and reported that everything seemed fine to them.

More phoning of 119, a third referral, to another person. He wasn’t able to suggest anything, sadly.

So, at this point, I was wracking my brains trying to work out where the problem could lie. I had the VDRS people saying that this data was missing from my daughter’s record, and the GP surgery insisting that all was well. The VDRS chap had mentioned something about it potentially taking up to 4 weeks (!) for updates to come through to them, which suggests that behind the scenes there must be some data synchronisation between different systems. I wondered if there was some kind of way of tagging bits of a patient record with permissions to say who is allowed to see them, and the GP surgery could and the VDRS people couldn’t.

Finally, on the school run to collect my daughter, I thought I’d have one last try at talking to the GP practice in person. I spoke to one of the ladies I’d spoken to on the phone, she took my bit of paper with “NHS Spine portal – summary care record” scrawled on it and went off to see the deputy practice manager. A short while later she returned; they’d looked at that bit of the record and spotted something in it about “linking” the local record with the NHS Spine (she claimed they’d never seen this before1), and that this was not set. I got the impression that in fact it couldn’t be set, because her proposed fix (to be tried tomorrow) is going to be to deregister my daughter and re-register her. And then I should try the vaccine booking again in a couple of days.

As someone who is (for want of a better phrase) an “IT professional”, the whole experience was quite frustrating. As noted at the top, I’m not trying to criticise any person I dealt with – everyone seemed keen to help. I’m also not trying to cast aspersions on the GP’s surgery – as far as they were aware, there was nothing amiss with the record (until they discovered this “linking” thing). My suspicion is that the faults lie with IT systems and processes.

For example, it sounds like it’s an error for a GP surgery to have a patient record that’s not linked to a record in the NHS Spine, but since many people took a look at the screens showing the data without noticing anything amiss, I’d say that’s some kind of failure of UI design. I wonder how it ended up like that; maybe my daughter’s record predated linking with Spine and somehow got missed in a transitioning process (or no such process occurred)?

It would have been nice if I’d been able to get from the state of knowing there’s some kind of error with the record, to knowing the actual details of the error, without having to jump through so many telephonic hoops. I presume the error code 9901 means something to somebody, but it didn’t mean anything to any of the people I spoke to. In any case, I only spotted it because, as a developer, I thought I’d peep at the URL, but it didn’t seem to be helpful from the point of view of diagnosing the problem. It feels like there’s a missed opportunity here – since people seeing that error page are directed to the 119 service, it would have been helpful to have provided some kind of visible code to enable the call handlers to triage the calls effectively.

In terms of my own development, it was an important reminder that the systems I build will be used by people who are not necessarily IT-literate and don’t know how they work under the covers, and if they go wrong then it might be a bewildering and perplexing experience for them. Being at the receiving end of vague and generic-sounding errors, as I have been today, has not been a lot of fun.


1I found this curious, but a subsequent visit to the surgery a couple of days later to see how the “fix” was progressing clarified things somewhat. My understanding now is that the regular view of my daughter’s record suggested that all was well, and that it was linked with Spine, and it was only when the deputy practice manager clicked through to investigate that it popped up a message saying that it was not linked correctly.

Obviously I don’t know the internals of the system and this is purely speculation, but suppose that the local system system had set a “I am linked with Spine” flag when it tried to set up the link, the linking failed for some reason, and the flag was never unset (or maybe it got set regardless of whether the linking succeeded or failed). Suppose furthermore that the “clicking through” process described actually tries to pull data from Spine. That could give a scenario in which the record looks fine at first glance and gives no reason to suppose anything is wrong, and you only see the problem with some deeper investigation. We can still learn a lesson from this hypothetical conjecture – if you are setting a flag to indicate that some fallible process has taken place, don’t set it before or after you run this process, set it when you have run it and confirmed that it has been successful.


Postscript

Sadly, I never found out what the underlying problem was. We just tried the booking process one day and it worked as expected (although by this point it was moot as my daughter had tested positive for Covid, shortly followed by myself and my wife). A week or so later, we got a registration confirmation letter confirming that she’d been registered with a GP practice, which was reassuring. I’d like to hope that somewhere a bug has been fixed as a result of this…

What have I been listening to?

A while ago, Tim suggested we could have a #now-listening channel in our company Slack, in which people could post details of what they were listening to. It occurred to me that it might be a fun challenge to try to figure out from what I’d posted on there who my favourite artist was, and which was my most-listened-to album. So I rolled up my sleeves and got to work. This is an account of what I did and my various thought processes as I went along…

Challenge: figure out how to get my posts from our #now-listening channel and do some statistics to them.

Session 1: After school run, but before work…

Start – there’s an API. https://api.slack.com

Read the documentation: https://api.slack.com/methods/search.messages looks useful – how do I call it?

I NEED A TOKEN! Aha – https://api.slack.com/apps – a “generate token” button…

Access token: xoxe.xoxp-blah-blah-blah. SUCCESS!

First obvious question: has someone done this already? Google knows everything: https://github.com/slack-scala-client/slack-scala-client

Create a new project: sbt new scala/scala-seed.g8 – add dependency on slack-scala-client, ready to rock! In such a hurry; I can’t even be bothered to set up a package, just hijack the Hello app that came in the skeletal project…

From docs:

val token = "MY TOP SECRET TOKEN"
implicit val system = ActorSystem("slack")
try {
  val client = BlockingSlackApiClient(token)
  client.searchMessages(WHAT TO PUT HERE?)
} finally {
  Await.result(system.terminate(), Duration.Inf)
}

… maybe something like…?:

val ret = client.searchMessages("* in:#67bricks-now-listening from:@Daniel", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

RUN IT

Fails. Because HelloSpec fails (I mentioned I just hijacked the OOTB Hello app). Fix with the delete key.

RUN IT

[WARN] [11/26/2021 08:40:37.418] [slack-akka.actor.default-dispatcher-2] [akka.actor.ActorSystemImpl(slack)] Illegal header: Illegal 'expires' header: Illegal weekday in date 1997-07-26T05:00:00: is 'Mon' but should be 'Sat'
Exception in thread "main" slack.api.ApiError: missing_scope
at slack.api.SlackApiClient$.$anonfun$makeApiRequest$3(SlackApiClient.scala:92)

:-​(

Google: “missing_scope” and interpret results

The token used is not granted the specific scope permissions required to complete this request.

:-​( :-​(

Maybe I have to create an app and add it to the workspace? I’ll try that.

Created, figured out how to add the user token scope “search:read” – and I got a new token!

Token= xoxp-blahblahblah

Rerun: I got a response!

{
  "ok":true,
  "query":"* in:#67bricks-now-listening from:@Daniel",
  "messages": {
    "total":0,
    "pagination": {
      "total_count":0,
      "page":1,
      "per_page":5,
      "page_count":0,
      "first":1,
      "last":0
    },
    "paging": {
      "count":5,
      "total":0,
      "page":1,
      "pages":0
    },
    "matches": []
  }
}

:-​(

Let’s just search in the channel without specifying a name…?

val ret = client.searchMessages("in:#67bricks-now-listening", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening",
  "messages": {
    "total": 7113,
    "pagination": {
      "total_count": 7113,
      "page": 1,
      "per_page": 5,
      "page_count": 1423,
      "first": 1,
      "last": 5
    },
    "paging": {
      "count": 5,
      "total": 7113,
      "page": 1,
      "pages": 1423
    },
    "matches": [
      {
        "username": "daniel.rendall",
        "other": "field_here"
      }
    ]
  }
}

Aha! My username is daniel.rendall, let’s try that:

val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening from:@daniel.rendall",
  "messages": {
    "total": 3213,
    "pagination": { ... etc }
  }
}

Success! Also – 3213 messages – sounds plausible. This is looking good… but sort direction seems wrong…? Try switching to “desc” => same result.

(Time spent so far: about half an hour – better stop or will miss the morning call!)

Session 2: Re-run – still works (hooray!)

Copy and paste output and save as response.json, fix up with jq so I can examine it:

cat response.json | jq '.' > response_tidied.json

And now:

"pagination": {
  "total_count": 3234,
  "page": 1,
  "per_page": 5,
  ... etc

Number has gone up – I’m still listening to things!

So, I could parse the responses to work out what the next page should be, or I could just loop – with pages of size 100 (if the API will return them) there should be 33. So we will loop and save these as 1.json, 2.json etc. First rule of scraping – aim to do it just once and save the result locally.

Horrible quick and dirty code alert!

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  outDir.mkdirs()
  implicit val system = ActorSystem("slack")
  try {
    val client = BlockingSlackApiClient(token)
    (1 to 33).foreach { pageNum =>
      try {
        val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall",
sort = Some("timestamp"),
sortDir = Some("desc"),
count = Some(100),
page = Some(pageNum))
        Files.write(new File(outDir, "" + pageNum + ".json").toPath, ret.toString().getBytes(StandardCharsets.UTF_8), StandardOpenOption.CREATE)
        println(s"Got page $pageNum")
      } catch {
        case NonFatal(e) =>
          println(s"Couldn't get page $pageNum - ${e.getMessage}")
      }
      Thread.sleep(1000)
    }
  } finally {
    Await.result(system.terminate(), Duration.Inf)
  }
}

… prints up a reasuring list “Got page 1” => “Got page 33” and no (reported) errors!

Second rule of scraping – having done it and got the data, zip it up and put it somewhere just in case you destroy it…

Tidy it all (non essential, but makes it easier to look at):

mkdir tidied
ls output | while read JSON ; do cat output/$JSON | jq '.' > tidied/$JSON ; done

On scanning the data – it looks plausible, I can’t see an obvious “date” field but there’s a cryptic “ts” field (sample value: “1638290823.124400”) which is maybe a timestamp? A problem for another day…

(Time spent this session: about 20 minutes)

Session 3: I can haz stats?

Need to load it in. A new main method in a new object…

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  val jsObjects = outDir.listFiles().map { f =>
    Json.parse(new FileInputStream(f))
  }
  println(jsObjects.head)
}

Prints something sensible. Now need to get it in a useful form: define simplest class that could possibly work.

case class Message(iid: UUID, ts: String, text: String, permalink: String)
object Message {
  implicit val messageReads: Reads[Message] = (
  (__ \ "iid").read[UUID] and
  (__ \ "ts").read[String] and
  (__ \ "text").read[String] and
  (__ \ "permalink").read[String]
  ) (Message.apply _)
}

Not sure if I need the ID, but I like IDs. Looks like a UUID.

… oh, also some classes to wrap the whole result with minimum of faff (and Reads, omitted for brevity):

case class SearchResult(messages: Messages)
case class Messages(total: Int, matches: Seq[Message])

Go for broke:

val jsObjects: Array[JsResult[Seq[Message]]] = outDir.listFiles().map { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches)
}

Unpleasant type signature alert – Array[JsResult[Seq[Message]]] Let’s assume nothing will go wrong and just use “.get” and “.flatMap”:

val messages: Seq[Message] = outDir.listFiles().flatMap { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches).get
}.toList

That gives me 3234 Message objects, which is reassuring. They include top-level messages, and responses to threads. As far as I can see, the thread responses include a ?thread_ts parameter in their permalink, therefore filter them out – leaves 1792 remaining.

val filtered = messages.filterNot(_.permalink.contains("?thread_ts"))
filtered.take(10).map(_.text).foreach(println)

…and voila:

The things I’m looking for will all have the format “Artist – Album”. Regex time!

val ArtistAlbumRegex = "(.*?) - (.*)".r("artist", "album")

Wait, what…? “@deprecated(“use inline group names like (?<year>X) instead”, “2.13.7”)”

Didn’t know that had changed. Ho hum…

val ArtistAlbumRegex: Regex = "(?<artist>.*?) - (?<album>.*)".r

  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

Even more promising:

Getting there! Now, there are bound to be loads of duplicates. So I guess the most obvious thing to do is count them. Let’s see if I can find the albums I’ve listened to the most, and their counts. I’m going to define a canonical key for grouping an ArtistAndAlbum just in case I’ve not been completely consistent in capitalisation.

case class ArtistAndAlbum(artist: String, album: String) {
  val groupingKey: (String, String) = (artist.toLowerCase, album.toLowerCase)
}

Then we should be able to count by:

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)sorted.take(10).foreach(println)

(The Bob Lazar Story – Vanquisher,1)
(The Pretenders – Pretenders (152),1)
(Saxon – Lionheart,1)
(Leprous – Tall Poppy Syndrome,1)
(Tom Petty and the Heartbreakers – Damn the Torpedoes (231),1)
(2Pac – All Eyez on Me (436),1)
(Bon Iver – For Emma, Forever Ago (461),1)
(James – Stutter,1)
(Elton John – Honky Château (251),1)
(Ice Cube – AmeriKKKa’s Most Wanted,1)

Ooops – wrong way – also the numbers in brackets need to be removed. Not sure there’s a nicer way to invert the ordering then explicitly passing the Ordering that I want to use…

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonAlbums.take(10).foreach(println

(Benny Andersson – Piano,8)
(Meilyr Jones – 2013),7)
(Brian Eno – Here Come The Warm Jets),4)
(Richard &amp; Linda Thompson – I Want To See The Bright Lights Tonight),4)
(Admirals Hard – Upon a Painted Ocean,4)
(Neuronspoiler – Emergence,4)
(Global Communication – Pentamerous Metamorphosis),4)
(Steely Dan – Countdown To Ecstasy,4)
(Pole – 2,4)
(Faith No More – Angel Dust,4)

That looks plausible, actually. I like Piano. I’m guessing there are loads of other “4” albums…

But who is my most listened to artist? I have a shrewd idea I know who it will turn out to be – my prediction is that it will be a four word band name with the initials HMHB. Use the fact that I defined my grouping key to start with the artist

val mostCommonArtists = artistsAndAlbums.groupBy(_.groupingKey._1)
.view.map { case (_, seq) => seq.head.artist -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonArtists.take(10).foreach(println)

(Half Man Half Biscuit,28)
(Fairport Convention,19)
(R.E.M.,18)
(Steeleye Span,15)
(Julian Cope,15)
(Various,13)
(James,12)
(Cardiacs,11)
(Faith No More,9)
(Brian Eno,9)

Bingo! The mighty Half Man Half Biscuit in there at #1. One flaw is immediately apparent – this naive approach doesn’t distinguish between “listening to lots of albums by an artist as part of business-as-usual” and “listening to an artist’s entire back catalogue in one go” (which accounts for the high showings of Fairport Convention, R.E.M. and Steeleye Span). Worry about that some other time.

How many albums have I listened to?

val distinctAlbums = artistsAndAlbums.distinctBy(_.groupingKey)
println("Total albums = " + artistsAndAlbums.length)
println("Distinct albums = " + distinctAlbums.length)

Total albums = 1371
Distinct albums = 1255

.. but that will be wrong because I’ve listened to some albums in the context of e.g. working through the Rolling Stone or NME’s list of top 500 albums, and in those cases I appended the number to the list e.g. “Battles – Mirrored (NME 436)”. So chop that off the end of the album name:

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) =>
    ArtistAndAlbum(artist, album.replaceAll("\\([^)]+\\)$", "").trim)
}

Distinct albums = 1191

This final session took about 50 minutes, so if my maths is correct, the total time spent on this was a little under 2 hours. TBH I’m slightly dubious about the results; after listing all of the albums I’ve listened to in alphabetical order I’m sure there are some missing (e.g. I tackled the entire Prince back catalogue, but there were only a handful of Prince albums in there, ditto for David Bowie). I suspect a bit more work and exploration of the Slack API might reveal what I’m missing. Or maybe my method for distinguishing main messages from responses is wrong (just had a thought; maybe a main message that begins a thread also gets the ?thread_ts parameter).  But it’s close enough for now, and appears to confirm my suspicion that Half Man Half Biscuit are my most listened to artist.

And now, what with it being the season of goodwill and all that, it’s time for my special Christmas Playlist

Session musician programmers

As a software consultancy, we are always in the business of trying to recruit good developers. One of the more annoying phrases that has cropped up in the industry lexicon of late is “rockstar programmer”, as both an ideal to which developers are assumed to aspire, and a glib description of the kind of programmer that software publishers are assumed to be desperate to employ.

In my very humble opinion, the software industry has no need for rockstar programmers (or rockstar programmers, as it happens). If we are going to use a musical analogy, a better kind of programmer could be termed a “session musician programmer”. Session musicians are the (usually) fairly anonymous musicians who are relied upon for actually getting records recorded. Attributes of a good session musician would include at least some of the following.

  • Courtesy and professionalism towards colleagues and clients.
  • Playing, or being willing to learn, a number of different instruments.
  • Playing in a range of styles depending on what is required.
  • Being able to read music, in order to play someone else’s arrangement.
  • Ability to improvise, if the arrangement just calls for “a solo”.
  • Ability to work with other musicians to come up with a workable arrangement for a piece if there isn’t one.
  • Some understanding of music production, and an understanding of how what is played will translate to what is eventually heard on the recording.

Translated roughly into development terms, with a lot of artistic license, these might be equivalent to the following desirable attributes in a developer:

  • Courtesy and professionalism towards colleagues and clients.
  • Having some knowledge of a number of different languages.
  • Being able to fit one’s code into the “style” of the project being worked on.
  • Willingness to follow direction from a technical lead and implement a precisely written specification, where necessary.
  • Ability to understand and fill gaps in the technical description of a project.
  • Ability to work with other developers to come up with a plan for implementing something.
  • Some understanding of the tools used for building, testing and deploying code.

I am not sure that the concept of the “rockstar programmer” embodies these things. Rockstars exist to do one thing well, generally singing or playing guitar, and that often seems to come with a whole lot of tiresome baggage in the form of massive ego, tantrums, and demands for excessive amounts of money (yes, there are exceptions). It so happens that plenty of session musicians can sing, plenty of them can play guitar, and they are capable of generating far more in the way of recorded music in a year than most rockstars would manage in their careers.

My advice, therefore, is that aspiring software developers looking to the world of music for role models should aim to be part of The Wrecking Crew rather than Guns N’ Roses.