Coding principles 3: Favour simplicity over complexity

This is the 3rd part of a series about 67 Bricks’s coding principles. The previous posts are: 1 and 2.

The principle

Aim for simplicity over complexity. This applies to everything from overarching architectural decisions down to function implementations.

This principle is a close cousin of the previous one – aim for clear, readable code – but emphasises one particular aspect of what makes code clear and readable: simplicity.

Simpler solutions tend to be easier to implement, to maintain, to reason about and to discuss with colleagues and clients.

It can be tempting to think that for software to be good or valuable it must be complicated. There can be an allure to complexity, I think partly because we tend to equate hard work with good work. So if we write something labyrinthine and hard to understand, it’s tempting to think it must also be good. But this is a false instinct when it comes to software. In code, hard does not equal good. In general complexity for its own sake should be avoided. It’s important to remember that there’s absolutely nothing wrong with a simple solution if it does what’s needed.

There’s also value in getting a simple solution working quickly so that it can be demoed, reviewed and discussed early compared to labouring for a long time over a complex solution that might not be correct. Something we emphasise a lot working at 67 Bricks is the value of iteration in the agile process. It can be extremely powerful to implement a basic version of a feature, site or application so that stakeholders can see and play with it and then give feedback rather than trying to discuss an abstract idea. Here, simplicity really shines because often getting a simple thing in front of a stakeholder in a week can be a lot more valuable than getting a complicated thing in front of them in a month.

This principle applies at every level at which we work, from designing your architectural infrastructure, down through designing the architecture of each module in your system, down to writing individual functions, frontend components and tests. At every level, if you can achieve what you need with fewer moving parts, simpler abstractions and fewer layers of indirection, the maintainability of your whole system will benefit.

Of course there are caveats here. Some code has to be complicated because it’s modelling complicated business logic. Sometimes there must be layers of abstraction and indirection because the problem requires it. This principle is not an argument that code should never be complicated, because sometimes it is unavoidable. Instead, it is an argument that simplicity is a valuable goal in itself and should be favoured where possible.

Another factor that makes this principle deceptively tricky is that it is the system (the architecture, the application, the class etc) that should be simple, not necessarily each individual code change. A complex system can very quickly emerge from a number of simple changes. Equally, a complicated refactor may leave the larger system simpler. It’s important to see the wood for the trees here. What’s important isn’t necessarily the simplicity of an individual code change, but the simplicity of the system that results from it.

There’s also subjectivity here: what does “simple” really mean when talking about code? A good example of an overcomplicated solution is the FizzBuzz Enterprise Edition repo – a satirical implementation of the basic FizzBuzz code challenge using an exaggerated Enterprise Java approach, with layers of abstraction via factories, visitors and strategies. However, all the patterns in use there do have their purpose. In another context, a factory class can simplify rather than obfuscate. But it’s important not to bring in extra complexity or indirection before it’s necessary.

Resources

The Wrong Abstraction – Sandi Metz

The Grug Brained Developer

Coding principles 2: Prioritise readability

This is the 2nd part of a series about 67 Bricks’s coding principles. The first post, containing the introduction and first principle is here.

The principle

Aim for clear, readable code. Write clear, readable comments where necessary

You should make it a priority that your work be readable and understandable to those who might come to it after you. This means you should aim to write code that is as clear and unambiguous as possible. You should do this by:

  • using clear variable, function and class names
  • avoiding confusing, ambiguous or unnecessarily complicated logic
  • adhering to the conventions and idioms of the language or technology you’re using

What can’t be made clear through code alone should be explained in comments.

Comments should focus on “why” (or “why not” explanations) far more than “how” explanations. This is particularly true if there is some historical context to a coding decision that might not be clear to someone maintaining the code in the future.

Note however that just like code, comments must be maintained and can become stale or misleading if they don’t evolve with the code, so use them carefully and only where they add value.

It is important to recognise that your code will be read far more times that it is written, and it will be maintained by people who don’t know everything you knew when you wrote it; possibly including your future self. Aim to be kind to your future self and others by writing code that conveys as much information and relevant context as possible.

I expect we’ve all had the experience of coming to a piece of code and struggling to understand it, only to realise it was you who wrote it a few months or weeks (or even days?) ago. We should learn from this occasional experience and aim to identify what we could have changed about the code the first time that would have prevented it. Better variable names? More comments? More comprehensive tests?

“You’re not going to run out of ink,” is something a colleague once commented on a pull request of mine to say that I could clarify the purpose of a variable by giving it a longer, more descriptive name. I think that’s a point worth remembering. Use as many characters as you need to make the job of the next person easier.

Of course, there’s some subjectivity here. What you see as obscure, someone else might see as entirely clear and vice versa. And certainly there’s an element of experience in how easily one can read and understand any code. The point really is to make sure that at least a thought is spared for the person who comes to the code next.

Examples

Here is an example that does not follow this principle:

const a = getArticles('2020-01-01');
a && process(a);

This example is unclear because it uses meaningless variable names and somewhat ambiguous method names. For example, it’s not clear without reading further into each method what they do – what does the date string parameter mean in getArticles? It also uses a technique for conditionally executing a method that is likely to confuse someone trying to scan this code quickly.

Now, here’s an example attempts to follow the principle:

// The client is only interested in articles published after 1st Jan
// 2020. Older articles are managed by a different system.
// See <ticket number>
const minDate = '2020-01-01';

const articlesResult = getArticlesSince(minDate);
if (articlesResult) {
  ingestArticles(articlesResult);
}

It provides a comment to explain the “why” of the hardcoded date, including relevant context; it uses much more meaningful names for variables and functions; and it uses a more standard, idiomatic pattern for conditionally executing a method.

Resources

Naming is Hard: Let’s Do Better (Kate Gregory, YouTube)

Coding principles 1: Favour functional code

Introduction to the principles

When I started working at 67 Bricks in 2017, in a small Oxford office already slightly struggling to contain about 15 developers, I found a strong and positive coding culture here. I learnt very quickly over my first few weeks what kind of code and practices the company valued. Some of that learning came via formal routes like on-boarding meetings and code review comments, but a lot of it came just by being in the office among many excellent developers and chatting or overhearing chats about opinions and preferences.

While there’s something very nice about this organic, osmosis-like way of ingesting a company’s values, practices and principles, it has been forced to evolve by a few factors over the last year. First we switched to home-working during the Covid lockdowns of 2020 and 2021 and then settled into a hybrid working model in which home-working is the default for most of us and the office is used somewhat less routinely. Secondly, we’ve increasing our technical team quite significantly over the last several years. Thirdly, that growth has partly involved a focus on bringing in and developing more junior developers. Each of these changes has made the “osmosis” model for new starters to pick up the company’s values a bit less tenable.

So over recent months, the tech leads have undertaken a project to distil those unwritten values and principles into a set of slightly more formal statements that new starters and old hands alike can refer to to help guide our high level thinking.

We came up with 9 of these principles. This and the following 8 posts in this series will go through each principle describing it and explaining why we think it is important in our ultimate goal of producing good, well-functioning products that run robustly, meet customer needs and are easy to maintain. 67 Bricks’s semi-joking unofficial motto is “do sensible things competently”; these principles aim to formalise a little what we mean by “sensible” and “competent”.

Generally I’ve used Typescript to write any code examples. The commonality of Typescript and Javascript should mean that examples are understandable to a good number of people.

About the principles

Before diving into the first principle, it’s worth briefly describing what these principles are and what they’re not.

These are high-level, general principles that aim to guide approaches to writing code in a way that is language/framework/technology agnostic. They should be seen more as rules of thumb or guidelines with plenty of room for exceptions and caveats depending on the situation. A good comparison might Effective Java by Joshua Bloch where a statement like “Favor composition over inheritance” doesn’t rule out ever using inheritance, but aims to guide the reader to understand why – in some cases – inheritance can cause problems and composition may provide a more robust and flexible solution.

These principles are not a style guide – our individual project teams are self organising and perfectly capable of enforcing their own code style preferences as they see fit – nor a dogmatic, stone-carved attempt at absolute truth. They’re also not strongly opinionated hot takes that are likely to provoke flame wars. They are simply what we see as sensible guidelines towards good, easy-to-write, easy-to-maintain code, and therefore robust software.

That was a lot of ado, so without any further let’s get on with the first principle.

The principle

Favour functional, immutable code over imperative, mutable code

Functional code emphasises side effect free, pure, composable functions that deal with immutable objects and avoid mutable state. We believe this approach leads to more concise, more testable, more readable, less error-prone software and we advise that all code be written in this way unless there is a good reason not to.

Code written in this way is easier to reason about because it avoids side effects and state mutations; functions are pure, deterministic and predictable. This approach promotes writing small, modular functions that are easy to compose together and easy to test.

67 Bricks has a history of favouring Scala as a development language – which may be clear from browsing back through the history of this blog. While these days C# has become a more common language for the products we deliver, the functional-first spirit of Scala is still woven into the fabric of 67 Bricks development. I believe Martin Odersky’s Coursera course: Functional Programming Principles in Scala is an excellent starting point for anyone wanting to understand the functional programming mindset regardless of your interest in Scala as a language.

As an interesting aside, the implementations of many of the Scala collections library classes – such as ListMap and HashMap – use mutable data structures internally in some methods, presumably for purposes of optimisation. This illustrates the caveat mentioned above that there may be sensible, situation-specific reasons to override this principle and others. It’s worth noting however that while the internals of some functions may be implemented in an imperative way, those are implementation details that are entirely encapsulated and irrelevant to users of the API.

I think “functional programming” is better seen as a continuum than a black and white dichotomy. While certain languages – like Haskell and F# – may be strictly functional, most languages – including C#, Javascript/Typescript, Python and (increasingly) Java – have many features that allow you to write in a more functional way if you choose to use them.

Examples

There are many books describing and teaching functional programming and the various principles that make it up, so I don’t intend to go into too much detail, but I think a couple of examples may help illustrate what functional code is and why it’s useful.

The following is an example of some code that does not follow this principle:

let onOffer = false;

function applyOffersToPrices(prices: number[]) {
  onOffer = isOfferDate(new Date());
  if (onOffer) {
    for (let i = 0; i < prices.length; i++) {
      prices[i] /= 2;
    }
  }
  return onOffer;
}

const prices: number[] = await retrievePricesFromSomewhere();
const onOffer = applyOffersToPrices(prices)
if (onOffer) {
  // ... what values does `prices` contain here?
} else {
  // ... how about here?
}

This code is hard to reason about because applyOffersToPrices mutates one of its arguments in some instances. This makes it very hard to be sure what state the values in the prices array are in after that function is called.

The following is an example that attempts to follow the principle:

function discountedPrices(prices: number[], date: Date) {
  if (!isOfferDate(date)) {
    return prices;
  }
  return prices.map(price => price / 2)
}

const prices: number[] = await retrievePricesFromSomewhere();
const todayPrices = discountedPrices(prices, new Date());

In this example, applyOffersToPrices is a pure function that does not mutate its input, but returns a new array containing the updated prices. It is unambiguous that prices still contains the original prices while todayPrices contains the prices that apply on the current date with the offer applied as necessary.

Note also that discountedPrices has everything it needs – the original prices and the current date – passed into it as arguments. This makes it very easy to test with different values.

Resources

Functional Programming Principles in Scala – Martin Odersky on Coursera

Why Functional Programming

How to Teach Programming to Kids

This post is a follow-up to one I wrote just over a year ago about my experience running a computing club at a local primary school before the Covid pandemic, and then resuming my STEM Ambassador activities last summer by running a retro games arcade at the school summer fair (https://blog.67bricks.com/?p=541).  I’ve since resumed my computing club and thought it would be worthwhile to give a proper account of my experiences.I started the club in April 2018, full of enthusiasm but with little knowledge of appropriate techniques for imparting knowledge to 8-year-olds.  I was armed with five robots: two “Dash” robots and one “Cue” robot from Wonder Workshop (https://uk.makewonder.com) and two Lego Boost robots (https://www.lego.com/en-gb/product/boost-creative-toolbox-17101), a few ageing iPads and some Kindle Fires that I’d got cheap from a Black Friday deal.  I also had a working knowledge of Scratch (https://scratch.mit.edu/) and a bunch of ideas.

I spent an inordinate amount of time preparing a 10-week course for the first bunch of students that were unleashed upon me.  I prepared a full set of worksheets to cover concepts like algorithms, loops, functions and events – including an activity using the robots and an equivalent activity using Scratch.  Here’s an example of an activity to learn loops by getting the Dash robot to dance:

I turned up to my first session clutching my worksheets, with a suitcase full of robots and tablets, and half a plan for how to teach something useful to a group of 8-11 year-olds.  I learned a number of important things in that first session:

  • Kids don’t like worksheets.  At best they will be ignored.  At worst they will be crumpled up and trodden underfoot.  It doesn’t matter how beautiful or colourful they are, how carefully crafted – literally nobody is interested in them.  They will gather dust until you admit defeat and shove them in the recycling bin.
  • Any IT equipment the school has will either not work, or will be locked down to the extent that I won’t be able to use it.  If the school has anyone with IT expertise they’ll likely be a contractor who only turns up at the school for a few hours on a Tuesday morning, and their only interest in my club will be in making sure that I don’t break any of their kit.  The “smart screens” adorning the walls of the classrooms are pretty ornaments which are not to be used by the likes of me.  I got round this by bringing in my own projector and pointing it at a convenient wall.  The school might have iPads but nobody knows how to install any apps on them.  School laptops are always out of battery power and access to them is via some sort of free-for-all.
  • Kids are powered by snacks.  Lots and lots of them.  You have no power to stop them munching biscuits throughout your session, despite protestations that greasy fingers and school laptops are not a good combination.
  • Kids also have teeny tiny bladders (or at least claim to have) and so perpetually want to duck into and out of the session to visit the facilities.  Generally, preventing them from going in packs of four at a time is a good idea.
  • Robots are very popular – but robots made from Lego are very fragile and generally do not survive being driven off a desk.  It’s also tenuous as to whether they will survive the journey to school stuffed into a soft suitcase.
  • Children do not like to share.  Five robots between 12 children often resulted in tussles and gentle reminders that there was time for everyone to have a turn.
  • Some kids are better than me, and will storm ahead, completing all the exercises and then begin pestering me for more.  Some just want to draw pictures in Scratch and ignore whatever activity I have planned.  There are those who don’t get it at all, even if you sit beside them and write all their code.  Others just want to mess about and play with the robots.  All of these are fine.  An after school club should not be “just more school” and it’s OK as long as everyone is having fun.

The club progressed nicely for two years, with some just trying it out for a term and others returning again and again.  I gradually adapted the sessions to be a bit less planned.  What worked well was me working through a challenge on my screen step-by-step with the students following along. If some of them raced ahead I would encourage them to add their own ideas to their program.  If others lagged behind I would stop to help them, or pair them up with someone else who had already progressed to that point.

We would write simple games like Space Invaders or football.  We would simulate simple physical systems like diffusion, liquid flow or bouncing balls.  Or we would get the robots to draw pictures, make music, or dance.

Some of the highlights were:

Following a line on the floor:

Simulating a traffic crossing with three robots:

Tidying up Lego pieces:

Various types of digital (and not-so-digital) art:

Battling Wizards:  https://scratch.mit.edu/projects/738294798/

Football: https://scratch.mit.edu/projects/229839232/

Snooker: https://scratch.mit.edu/projects/726834506/

Bouncing: https://scratch.mit.edu/projects/214755030/

Liquid flow: https://scratch.mit.edu/projects/239536433/

When the COVID pandemic hit in March 2020 I had to shut the club down and I was only able to resume it again in September 2022.  I wrote in my previous blog post about the effects of the pandemic on education, but the lack of access to clubs and social activities is probably one of the less obvious impacts on a child’s wellbeing.

I was keen for my club to remain accessible to everyone and for activities to become more open-ended rather than just following my instructions step-by-step.  Some approaches I took were:

  • Giving guidance on some of the techniques required to write a game (e.g. getting a Scratch sprite to move, jump, bounce or fire projectiles) and then supervising while the children designed their own games.
  • Writing a “story” by creating a sequence of animated backdrops through which sprites moved through (it’s amazing how many of these turned into horror tales involving zombies and vampires)
  • Designing a quiz with multiple choice questions
  • Exploring some of the excellent courses offered by Code.org at https://studio.code.org/courses
  • Using the Turing Tumble to build a mechanical computer and learn exactly how logic gates work:  https://upperstory.com/turingtumble/
  • Using Nintendo Labo to program cars, fishing rods and more: https://www.nintendo.co.uk/Nintendo-Labo/Nintendo-Labo-1328637.html

The broad aim of the STEM Ambassador program is to provide young people with a link from STEM subjects to the real world of work, so as to inspire the next generation in STEM. I hope that in a small way my club has helped to do this.

Migrating a VirtualBox Windows installation

I have been using Linux as my primary OS since 1999ish, except for a brief period early in the history of 67 Bricks when I had an iMac. Whenever I have used Windows it has invariably been in some kind of virtualised form; this was necessary in the iMac days when I was developing .NET applications in Visual Studio, but these days I work solely on Scala / Play projects developed in IntelliJ in Linux. Nevertheless, I have found it convenient to have an installation of Windows available for the rare instances where it’s actually necessary (for example, to connect to someone’s VPN for which no Linux client is available).

My Windows version of choice is the venerable Windows 7. This is the last version of Windows which can be configured to look like “proper Windows” as I see it by disabling the horrible Aero abomination. I tried running the Windows 10 installer once out of morbid curiosity, it started talking to me, so I stopped running it. I am old and set in my ways, and I feel strongly that an OS does not need to talk to me.

So anyway, I had a VirtualBox installation of Windows 7 and, because I am an extraordinarily kind and generous soul, I had given it a 250G SATA SSD all to itself using VirtualBox’s raw hard disk support.

Skip forward a few years, and I decided I would increase the storage in my laptop by replacing this SSD with a 2TB SSD since such storage is pretty cheap these days. The problem was – what to do with the Windows installation? I didn’t fancy the faff of reinstalling it and the various applications, and in fact I wasn’t even sure this would be possible given that it’s no longer supported. In any case, I didn’t want to give Windows the entire disk this time, I wanted a large partition available to Linux.

It turns out that VirtualBox’s raw disk support will let you expose specific partitions to the guest rather than the whole disk. The problem is that with a full raw disk, the guest sees the boot sector (containing the partition table and probably the bootloader), whereas you presumably don’t want the guest to see the boot sector if you’re only exposing certain partitions. How does this work?

The answer is that when you create a raw disk image and specify specific partitions to be available, in addition to the sda.vmdk file, you also get a sda-pt.vmdk file containing a copy of the boot sector, and this is presented to the guest OS instead of the real thing. Here, then, are the steps I took to clone my Windows installation onto partitions of the new SSD, keeping a partition free for Linux use, and ensuring Windows still boots. Be warned that messing about with this stuff and making a mistake can result in possibly irrecoverable data loss!

Step 1 – list the partitions on the current drive

My drive presented as /dev/sda

$ fdisk -x /dev/sda
Disk /dev/sda: 232.89 GiB, 250059350016 bytes, 488397168 sectors
Disk model: Crucial_CT250MX2
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 4096 bytes
I/O size (minimum/optimal): 4096 bytes / 4096 bytes
Disklabel type: dos
Disk identifier: 0xc3df4459

Device     Boot  Start       End   Sectors Id Type            Start-C/H/S   End-C/H/S Attrs
/dev/sda1  *      2048    206847    204800  7 HPFS/NTFS/exFAT     0/32/33   12/223/19    80
/dev/sda2       206848 488394751 488187904  7 HPFS/NTFS/exFAT   12/223/20 1023/254/63 

Step 2 – create partitions on the new drive

I put the new SSD into a USB case and plugged it in, whereupon it showed up as /dev/sdb

Then you need to do the following

  1. Use fdisk /dev/sdb to edit the partition table on the new drive
  2. Create two partitions with the same number of sectors (204800 and 488187904) as the drive you are hoping to replace, and then you might as well turn all of the remaining space into a new partition
  3. Set the types of the partitions correctly – the first two should be type 7 (HPFS/NTFS/exFAT), if you’re creating another partition it should be type 83 (Linux)
  4. Toggle the bootable flag on the first partition
  5. Use the “expert” mode of fdisk to set the disk identifier to match that of the disk you are cloning; in my case this was 0xc3df4459 – Googling suggested that Windows may check for this and some software licenses may be tied to it.

So now we have /dev/sdb looking like this:

$ fdisk -x /dev/sdb
Disk /dev/sdb: 1.82 TiB, 2000398934016 bytes, 3907029168 sectors
Disk model: 500SSD1         
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xc3df4459

Device     Boot     Start        End    Sectors Id Type            Start-C/H/S End-C/H/S Attrs
/dev/sdb1  *         2048     206847     204800  7 HPFS/NTFS/exFAT     0/32/33 12/223/19    80
/dev/sdb2          206848  488394751  488187904  7 HPFS/NTFS/exFAT   12/223/20 705/42/41 
/dev/sdb3       488394752 3907029167 3418634416 83 Linux             705/42/42 513/80/63

Note that the first two partitions are an exact match in terms of starts, end and sector count. I think we can ignore the C/H/S (cylinder / head / sector) values.

Step 3 – clone the data

Taking great care to get this right, we can use the dd command and we can tell it to report its status from time to time so we know something is happening. There are two partitions to clone:

$ dd if=/dev/sda1 of=/dev/sdb1 bs=1G status=progress
... some output (this is quite quick as it's only 100M)
$ dd if=/dev/sda2 of=/dev/sdb2 bs=1G status=progress
... this took a couple of hours to copy the partition, the bottleneck probably being the USB interface

Step 4 – create (temporarily) a VirtualBox image representing the new disk

There are various orders in which one could do the remaining steps, but I did it like this, while I still had /dev/sda as the old disk and the new one plugged in via USB as /dev/sdb

$ VBoxManage internalcommands createrawvmdk -filename sdb.vmdk -rawdisk /dev/sdb -partitions 1,2

This gives two files; sdb.vmdk and sdb-pt.vmdk. To be honest, I got to this point not knowing how the boot sector on the disk would show up, but having done it I was able to verify that sdb-pt.vmdk appeared to be a copy of the first sector of the real physical disk (at least the first 512 bytes of it). I did this by comparing the output of xxd sdb-pt.vmdk | head -n 32 with xxd /dev/sdb | head -n 32 which uses xxd to show a hex dump and picks the first 32 lines which happen to correspond to the first 512 bytes. In particular, you can see at offset 0x1be the start of the partition table, which is 4 blocks of 16 bytes each ending at 0x1fd, with the magic signature 55aa at 0x1fe. The disk identifier can also be seen as the 4 bytes starting at 0x1b8. And note that everything leading up to the disk identifier is all zeroes.

Step 5 – dump the boot sector of the existing disk

At this point, we have /dev/sda with the bootsector containing the Windows bootloader, we have a freshly initialised /dev/sdb containing a copy of the Windows partitions and a bootsector that’s empty apart from the partition table and disk identifier, and we have the sdb-pt.vmdk file containing a copy of that empty bootsector. We have also arranged for the new disk to have the same identifier as the old disk.

What is needed now is to create a new sdb-pt.vmdk file containing the bootloader code from /dev/sda and then the partition table from /dev/sdb. There are 444 bytes that we need from the former. So we can do something like this:

$ head -c 444 /dev/sda > sdb-pt-new.vmdk
$ tail -c +445 sdb-pt.vmdk >> sdb-pt-new.vmdk

We can confirm that the new file has the same size as the original, we can also use xxd to confirm that we’ve got the bootloader code and then everything from that point is as it was (including the new partition table)

Step 6 – the switch

We’re almost done. All that’s left to do is:

  1. Replace the old SATA SSD with the new 2TB SSD
  2. Run VirtualBox – do not start Windows but instead remove the existing hard drive attached to it (which is the “full raw disk” version of sda.vmdk), and use the media manager to delete this.
  3. Quit VirtualBox
  4. Recreate the raw disk with partitions file but for /dev/sda (which is what the new drive is): VBoxManage internalcommands createrawvmdk -filename sda.vmdk -rawdisk /dev/sda -partitions 1,2
  5. The previous command will have created the sda-pt.vmdk boot sector image from the drive, which will again be full of zeroes. Overwrite this with the sdb-pt-new.vmdk that we made earlier, obviously ensuring we preserve the original sda-pt.vmdk name and the file permissions.
  6. Start VirtualBox, add the newly created sda.vmdk image to the media manager, and then as the HD for the Windows VM
  7. Start the Windows VM and hope for the best

I was very gratified to find that this worked more-or-less first time. In truth, I originally just tried replacing the old sda.vmdk with the new one, which gave the error “{5c6ebcb7-736f-4888-8c44-3bbc4e757ba7} of the medium ‘/var/daniel/virtualbox/Windows 7/sda.vmdk’ does not match the value {e6085c97-6a18-4789-b862-8cebcd8abbf7} stored in the media registry (‘/home/daniel/.config/VirtualBox/VirtualBox.xml’)”. It didn’t seem to want to let me remove this missing drive from the machine using the GUI, so I edited the Windows 7.vbox file to remove it, then removed it from the media library and added the replacement, which I was then able to add to the VM.

Programming a Tesla

I have a friend called Chris who is a big fan of the band China Drum. Many years ago he challenged me to program their song Last Chance as a custom ringtone on his Nokia phone and, being vaguely musical, I obliged.

Time has moved on since then. With his hitherto rock-star hair cut to a respectable length, he is now the CEO of a company providing disease model human cells. And he owns a Tesla, something he likes to remind me about from time to time. Now it turns out that one of the silly things you can do with a Tesla is program the lights to flash to make a custom light show for a piece of music of your choice. You can probably see where this is heading.

Since a) I was going up to meet him (and his Tesla) in Yorkshire last weekend and b) I wasn’t sure how long the “I’ve been busy” excuse would work in response to the “Where’s my light show?” question, I figured I’d probably better actually try to do the damn thing.

Problem 1 – I don’t have the song

Sadly I didn’t have a copy of China Drum’s Goosefair album, so I didn’t have the audio. But I do have Linux, and a Spotify subscription, and the command-line ncspot client. I reasoned that if I could play the audio, I could surely also record the audio, it just might mean losing the will to live while trying to understand how Pulseaudio works (or maybe it’s PipeWire now, or who knows?)

Cutting a long and tedious story short, by doing some mystical fiddling about, I was able to send the output of ncspot to Audacity and thereby record myself a WAV of the song. In the interests of remaining on the right side of the law, I made good faith attempts to locate an original copy of the album, but it seems to be out of print. So I now have a copy via eBay.

Problem 2 – I don’t have the application for building the light show

To build a light show, one needs an open-source application called xLights. Since this doesn’t appear to be in the Ubuntu package repository, I had to build it from source. For some reason, I’m a bit averse to installing random libraries and things on my machine, but fortunately there is a “build it in Docker” option which I used and seemed to work successfully, except that I couldn’t figure out how to get at the final built application! It existed as a file in the filesystem of the docker container, but since the executing script had finished, the container wasn’t running, and there seemed to be no obvious way to get at it (it is entirely possible that I was just being stupid, of course). In the end, I reasoned that any file on a container filesystem has to be in my /var/lib directory somewhere, and with a bit of poking around, I located the xLights-2023.08-x86_64.AppImage file and copied it somewhere sensible.

Problem 3 – I don’t know where the beats are

I followed various instructions and got myself set up with a working application, a fresh musical sequence project for the file, and the .wav imported and displaying as a waveform.

The way xLights works is that you start with a bunch of horizontal lines representing available lights, you set up a bunch of timing markers which present as vertical lines thus dividing the work area up into a grid, and you can create light events in various cells of the resulting grid and the start / stop transitions of each light are aligned with the timing markers (though they can subsequently be moved). Thus you need a way of creating these markers. Fortunately, it is possible to download an audio plugin to figure these out for you. After doing this, you end up with a screen looking something like this:

An empty xLights grid.

Problem 4 – I can’t copy and paste

So I started filling things in with the idea that if I got something that I was happy with for a couple of bars, I could copy and paste it elsewhere rather than having to enter every note manually. Unfortunately, I ran into an unexpected problem, which is that the timing on the track is very variable. For example, I picked a couple of channels and used one to show where the “1” beats were in each bar and put lights on “2”, “3” and “4” in the other. But if you then try to copy and paste this bar into two bars, then four, then eight and so on, you quickly get out of sync with the beat lines, because the band speed up as the song goes on. Also, I couldn’t see any obvious option in xLights to quantise a track (i.e. to adjust the starts and ends of notes to match a set of timing marks).

Fortunately, the format that xLights uses to save these sequences is XML-based. Therefore I was able to write a Scala application to read in the sequence, make a note of all of the timestamps corresponding to the timing marks, and then shuffle all starts and ends of notes to the nearest timestamp. Actually, originally I wrote a thing to try to regularise the timestamp markers (i.e. keep the same number of them but make the spacing uniform) and quickly realised that the resulting markers were woefully out of sync with the music, which is when I realised the tempo of the song was variable.

Problem 5 – I can’t count

I dimly remembered from the Tesla light-show instructions that there was a limit of 3500 lights in a show, so I had to make sure the total number didn’t go above that. I couldn’t see an obvious way to do this in xLights, but I could see that it was just a question of running a suitable XPath expression against the XML file. So I fired up oXygen and wrote one (the correct XPath – see below – is count(//Effect[not(parent::EffectLayer/parent::Element[@type='timing'])]))

And so my development cycle was basically:

  • Stare at lots of blobs on the grid, adding and removing them and generally twiddle until it seems satisfactory
  • Run the Scala application to quantise it all to the timing marks, thus dealing with any slight mismatches caused by copying-and-pasting
  • Check in oXygen how many lights I’ve used (editing the raw XML was also useful to copy between channels e.g. to make the rear left turn indicators do the same as the front left turns)

Problem 6 – I can’t read instructions

Having got something I was reasonably happy with, I double-checked the instructions. Oh no! I am an idiot! It doesn’t say 3500 lights, it says 3500 commands where a command is “turn light on” or “turn light off”. So I now have twice the number of allowed lights, and drastic editing would be needed (naturally, this was the Friday night before I was due to drive up with it, after a number of late nights working on it).

Fortunately, I had also been an idiot (very slightly) with my counting XPath; because of the way the XML format works, each timing mark was being counted as an event. So having tweaked it not to count the timing marks, I had around 2600 lights, which is rather more than the 1750 budget but less bad than the 3400ish I started with.

So I had to scale things back a bit. I dropped the tracks I’d been using for the “1 2 3 4” beat. I removed some of the doubling up between channels. I dropped some of the shorter notes in tracks where I’d been trying to reproduce the rhythm of the vocals. I simplified bits and generally chopped it about until I had something that met the requirements.

Part of the finished product

Problem 7 – I don’t have a USB stick

Well, I didn’t at the start of the project. I did by the end; the smallest one I could find in Curry’s was 32GB. Which is ridiculous overkill for the size of the files needed (the audio was 27M, the compiled lightshow file was 372K). But, well, whatever…

And that was it. I followed the instructions about what needed to be on the stick (a folder called “LightShow” with that exact capitalisation containing “lightshow.wav” and “lightshow.fseq”) and took it with me to Yorkshire. Chris plugged it into the Tesla. It ran successfully. Hooray!

I’m not sure there’s a great moral to this story (other than maybe “read the instructions carefully”) but it was a fun challenge. Thanks to China Drum for a great song, Tesla for building the light show feature, and the xLights authors for an open-source application that made building the light show possible.

How we do centralized logging at 67 Bricks

If you’ve had a look around 67 Bricks website, you probably know that we work with quite a few clients. For most of the clients we host their infrastructure, which makes it easier for us to manage it and troubleshoot any issues when they occur. Each client’s infrastructure resides in its own AWS account, which is a part of AWS Organizations. We also have a logging AWS account which is used for infrastructure resources used by client accounts. In this shared account we have set up an ELK stack to collect logs from multiple clients in one place. In this post I will explain how it is set up.

What is ELK?

ELK stands for ElasticSearch, Logstash and Kibana.

A note: in this post I’m going to mention Amazon OpenSearch Service. In the past was called Amazon ElasticSearch Service. Amazon OpenSearch Service uses a fork of older version of ElasticSearch and Kibana. The name ELK, however, seems to have stuck even if OpenSearch is used instead of ElasticSearch (and ELK sounds nicer than OLK, in my opinion).

What is the infrastructure like?

The main elements are AWS Managed OpenSearch instance and an EC2 reverse proxy, which directs requests to OpenSearch. In terms of networking we have VPC peering connections between the VPC of the logging account, where OpenSearch instance resides, and the client account VPCs.

To clarify the above diagram:

Applications send log entries via peering connections. In order for them to be able to do that, the following is required:

1) The security group attached to the servers or containers running the applications must have a rule that allows traffic on port 443 from the CIDR block of the logging account VPC

2) The route table of the VPC in the client accounts must have a route with the logging account VPC CIDR as destination and peering connection as target

3) The security group attached to the OpenSearch instance must allow traffic on port 443 from CIDR ranges of client account VPCs

4) The route table of the VPC in the logging account must have routes with the client account VPC CIDRs are destination and peering connection as target

How do the applications send logs to the ELK instance?

Most applications that send logs are Scala Play applications – they use Logback framework for logging, and the logback.xml file with configuration. We have an appender section for ELK logs – we add it to all applications that send their logs to ELK, thereby ensuring that log entries are the same regardless of the system and have the same fields:

 <appender name="ELK" class="com.internetitem.logback.elasticsearch.ElasticsearchAppender">
    <url>${elkEndpoint}/_bulk</url>
    <!-- This nested %replace expression takes the first letter of the level and maps D and T
    (for DEBUG and TRACE) to d and maps other levels to i -->
    <index>someClient-${environment}-someApp-logs-%replace(%replace(%.-1level){'[DT]', 'd'}){'[A-Z]', 'i'}-%date{yyyy-MM-dd}</index>
    <type>log</type>
    <loggerName>es-logger</loggerName>
    <errorsToStderr>false</errorsToStderr>
    <includeMdc>true</includeMdc>
    <maxMessageSize>4096</maxMessageSize>
    <properties>
      <property>
        <name>client</name>
        <value>iclr</value>
      </property>
      <property>
        <name>service</name>
        <value>ingestion</value>
      </property>
      <property>
        <name>host</name>
        <value>${HOSTNAME}</value>
        <allowEmpty>false</allowEmpty>
      </property>
      <property>
        <name>severity</name>
        <value>%level</value>
      </property>
      ...
  </appender>

Here, environment, elkEndpoint and HOSTNAME are environment variables. environment and elkEndpoint are injected in the EC2 launch template and are populated when the instances are being started.

YOu can also see the <index> element. This will create a new index every day. Before we start sending application logs to ELK, we create an index pattern for each application. An index pattern allows you to select data and can include one or more indices. For example, if we have an index pattern someClient-live-someApp-logs-d-*, it would include indices someClient-live-someApp-logs-d-2023-02-27 , someClient-live-someApp-logs-d-2023-03-13, someClient-live-someApp-logs-d-2023-03-14 and so on.

How do you know if there is a problem with logs?

We have monitors set up in Kibana which check that there are logs coming in. This check is run every 10 minutes on each index pattern, and if it doesn’t find any log entries, it sends an alert to an SNS topic which in turn sends an email to inform us that there is a problem. The configuration of alarms and monitors can be done from the OpenSearch Plugins menu of Kibana.

At the moment these monitors are created manually for each index pattern, which is not ideal because it does take a bit of time setting them up; therefore one of the tasks on our to-do list is to automate monitor creation.

How do you make sure that the OpenSearch instance always has enough space?

When each new index pattern is created, we apply a lifecycle policy to it. For example, we delete info logs after a week; when the index agent is 7 days old, it starts to transition into the Delete state. We also have a Cloudwatch alarm which monitors FreeStorageSpace metric in the AWS/ES namespace.

How do YOU centralize logs from multiple systems on AWS? 🙂

Setting up local AWS environment using Localstack

When Cloud services are used in an application, it might be tricky to mock them during local development. Some approaches include: 1) doing nothing thus letting your application fail when it makes a call to a Cloud service; 2) creating sets of fake data to return from calls to AWS S3, for example; 3) using an account in the Cloud for development purposes. A nice in-between solution is using Localstack, a Cloud service emulator. Whereas the number of services available and the functionality might be a bit limited compared to the real AWS environment, it works rather well for our team.

This article will describe how to set it up for local development in Docker.

Docker-compose setup:

In the services section of our docker-compose.yml we have Localstack container definition:

localstack:
    image: localstack/localstack:latest
    hostname: localstack
    environment:
      - SERVICES=s3,sqs
      - HOSTNAME_EXTERNAL=localstack
      - DATA_DIR=/tmp/localstack/data
      - DEBUG=1
      - AWS_ACCESS_KEY_ID=test
      - AWS_SECRET_ACCESS_KEY=test
      - AWS_DEFAULT_REGION=eu-central-1
    ports:
      - "4566:4566"
    volumes:
      - localstack-data:/tmp/localstack:rw
      - ./create_localstack_resources.sh:/docker-entrypoint-initaws.d/create_localstack_resources.sh

Although we don’t need to connect to any AWS account, we do need dummy AWS variables (with any value). We specify which services we want to run using Localstack – in this case it’s SQS and S3.

We also need to set HOSTNAME_EXTERNAL because SQS API needs the container to be aware of the hostname that it can be accessed on.

Another point is that that we cannot use the entrypoint definition because Localstack has a directory docker-entrypoint-initaws.d from where shell scripts are run when the container starts up. That’s why we’re mapping the container volume to a folder wherer those scripts are. In our case create_localstack_resources.sh will create all the necessary S3 buckets and the SQS queue:

EXPECTED_BUCKETS=("bucket1" "bucket2" "bucket3")
EXISTING_BUCKETS=$(aws --endpoint-url=http://localhost:4566 s3 ls --output text)

echo "creating buckets"
for BUCKET in "${EXPECTED_BUCKETS[@]}"
do
  echo $BUCKET
  if [[ $EXISTING_BUCKETS != *"$BUCKET"* ]]; then
    aws --endpoint-url=http://localhost:4566 s3 mb s3://$BUCKET
  fi
done

echo "creating queue"
if [[ $EXISTING_QUEUE != *"$EXPECTED_QUEUE"* ]]; then
    aws --endpoint-url=http://localhost:4566 sqs create-queue --queue-name my-queue\
    --attributes '{
      "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:eu-central-1:000000000000:my-dead-letter-queue\",\"maxReceiveCount\":\"3\"}",
      "VisibilityTimeout": "120"
    }'
fi

Note that AWS CLI command syntax is different to the real AWS CLI (otherwise you’d create resources in the account for which you have the credentials set up!), and includes Localstack endoint flag: –endpoint-url=http://localhost:4566

Configuration files

We use Scala with Play framework for this particular application, and therefore have .conf files. In local.conf file we have the following:

aws { localstack.endpoint="http://localstack:4566" region = "eu-central-1" s3.bucket1 = "bucket1" s3.bucket2 = "bucket2" sqs.my_queue = "my-queue" sqs.queue_enabled = true }

The real application.conf file has resource names injected at the instance startup. They live in an autoscaling group launch template where they are created by Terraform (out of scope of this post).

Initializing SQS client based on the environment

The example here is for creating an SQS client. Below are snippets most relevant to the topic.

In order to initialize the SQS Service so that it can be injected into other services we can do this:

lazy val awsSqsService: QueueService = createsSqsServiceFromConfig()

In createsSqsServiceFromConfig we check if the configuration has a Localstack endpoint and if so, we build LocalStack client:

protected def createsSqsServiceFromConfig(): QueueService = { readSqsClientConfig().map { config => val sqsClient: SqsClient = config.localstackEndpoint match { case Some(endpoint) => new LocalStackSqsClient(endpoint, config.region) case None => new AwsSqsClient(config.region) } new SqsQueueService(config.queueName, sqsClient) }.getOrElse(fakeAwsSqsService) }

readSqsClientConfig is used to get configuration values from .conf files:

private def readSqsClientConfig = {
val sqsName = config.get[String]("aws.sqs.my_queue")
val sqsRegion = config.get[String]("aws.region")
val localStackEndpoint = config.getOptional[String]("aws.localstack.endpoint")
SqsClientConfig(sqsName, sqsRegion, localStackEndpoint)
}

Finally LocalStackSqsClient initialization looks like this:

class LocalStackSqsClient(endpoint: String, region:String) extends SqsClient with Logging {
private val sqsEndpoint = new EndpointConfiguration(endpoint, region)
private val awsCreds = new BasicAWSCredentials("test", "test")
private lazy val sqsClientBuilder = AmazonSQSClientBuilder.standard()
.withEndpointConfiguration(sqsEndpoint)
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
private lazy val client = sqsClientBuilder.build()

override def BuildClient(): AmazonSQS = { log.debug("Initializing LocalStack SQS service") client } }

Real AWS Client for the test/live environment (a snippet):

    AmazonSQSClientBuilder.standard()
      .withCredentials(new DefaultAWSCredentialsProviderChain)
      .withRegion(region)

Notice that we need fake BasicAWSCredentials that allows us to pass in dummy AWS access key and secret key and then we use AWSStaticCredentialsProvider, an implementation of AWSCredentialsProvider that just wraps static AWSCredentials. When real AWS environment is used, instead of AWSStaticCredentialsProvider we use DefaultAWSCredentialsProviderChain, which picks the EC2 Instance Role if it’s unable to find credentials by any other methods.

And that’s it. Happy coding!

Unit testing 101 – mob rulz

In a recent developer forum I made the rather wild decision to try demonstrate the principles of unit testing via an interactive mobbing session. I came prepared with some simple C# functions based around an Aspnetcore API and said “let’s write the tests together”. The resultant session unfolded not quite how I anticipated, but it was still lively, fun and informative.

The first function I presented was fairly uncontentious – the humble fizzbuzz:

[HttpGet]
[Route("fizzbuzz")]
public string GetFizzBuzz(int i)
{
    string str = "";
    if (i % 3 == 0)
    {
        str += "Fizz";
    }
    if (i % 5 == 0)
    {
        str += "Buzz";
    }
    if (str.Length == 0)
    {
        str = i.ToString();
    }

    return str;
}

Uncontentious that was, until a bright spark (naming no names) piped up with questions like “Shouldn’t 6 return ‘fizzfizz’?”. Er… moving on…

I gave a brief introduction to writing tests using XUnit following the Arrange/Act/Assert pattern, and we collaboratively came up with the following tests:

[Fact]
public void GetFizzBuzz_FactTest()
{
    // Arrange
    var input = 1;

    // Act
    var response = _controller.GetFizzBuzz(input);

    // Assert
    Assert.Equal("1", response);
}

[Theory]
[InlineData(1, "1")]
[InlineData(2, "2")]
[InlineData(3, "Fizz")]
[InlineData(4, "4")]
[InlineData(5, "Buzz")]
[InlineData(9, "Fizz")]
[InlineData(15, "FizzBuzz")]
public void GetFizzBuzz_TheoryTest(int input, string output)
{
    var response = _controller.GetFizzBuzz(input);
    Assert.Equal(output, response);
}

So far so good. We had a discussion about the difference between “white box” and “black box” testing (where I nodded sagely and pretended I knew exactly what these terms meant before making the person who mentioned them provide a definition). We agreed that these tests were “white box” testing because we had full access to the source code and new exactly what clauses we wanted to cover with our test cases. With “black box” testing we know nothing about the internals of the function and so might attempt to break it by throwing large integer values at it, or finding out exactly whether we got back “fizzfizz” with an input of 6.

Moving on – I presented a new function which does an unspecified “thing” to a string. It does a bit of error handling and returns an appropriate response depending on whether the thing was successful:

[Produces("application/json")]
[Route("api/[controller]")]
[ApiController]
public class AwesomeController : BaseController
{
    private readonly IAwesomeService _awesomeService;

    public AwesomeController(IAwesomeService awesomeService)
    {
        _awesomeService = awesomeService;
    }

    [HttpGet]
    [Route("stringything")]
    public ActionResult<string> DoAThingWithAString(
        string thingyString)
    {
        string response;

        try
        {
            response = _awesomeService
                           .DoAThingWithAString(thingyString);
        }
        catch (ArgumentException ex)
        {
            return BadRequest(ex.Message);
        }
        catch (Exception ex)
        {
            return StatusCode(500, ex.Message);
        }

        return Ok(response);
    }
}

This function is not stand-alone but instead calls a function in a service class, which does a bit of validation and then does the “thing” to the string:

public class AwesomeService : IAwesomeService
{
    private readonly IAmazonS3 _amazonS3Client;

    public AwesomeService(IAmazonS3 amazonS3Client)
    {
        _amazonS3Client = amazonS3Client;
    }

    public string DoAThingWithAString(string thingyString)
    {
        if (thingyString == null)
        {
            throw new ArgumentException("Where is the string?");
        }

        if (thingyString.Any(char.IsDigit))
        {
            throw new ArgumentException(
                @"We don't want your numbers");
        }

        var evens = 
            thingyString.Where((item, index) => index % 2 == 0);
        var odds = 
            thingyString.Where((item, index) => index % 2 == 1);

        return string.Concat(evens) + string.Concat(odds);
    }
}

And now the debates really began. The main point of contention was around the use of mocking. We can write an exhaustive test for the service function to exercise all the if clauses and check that the right exceptions are thrown. But when testing the controller function should we mock the service class or not?

Good arguments were provided for the “mocking” and “not mocking” cases. Some argued that it was easier to write tests for lower level functions, and if you did this then any test failures could be easily pinned down to a specific line of code. Others argued that for simple microservices with a narrow interface it is sufficient to just write tests that call the API, and only mock external services.

Being a personal fan of the mocking approach, and wanting to demonstrate how to do it, I prodded and cajoled the group into writing these tests to cover the exception scenarios:

public class AwesomeControllerTests
{
    private readonly AwesomeController _controller;
    private readonly Mock<IAwesomeService> _service;

    public AwesomeControllerTests()
    {
        _service = new Mock<IAwesomeService>();
        _controller = new AwesomeController(_service.Object);
    }

    [Fact]
    public void DoAThingWithAString_ArgumentException()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new ArgumentException("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<BadRequestObjectResult>(response);
        Assert.Equal(400, 
            ((BadRequestObjectResult)response).StatusCode);
        Assert.Equal("boom", 
            ((BadRequestObjectResult)response).Value);
    }

    [Fact]
    public void DoAThingWithAString_Exception()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new Exception("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<ObjectResult>(response);
        Assert.Equal(500, ((ObjectResult)response).StatusCode);
        Assert.Equal("boom", ((ObjectResult)response).Value);
    }        
}

Before the session descended into actual fisticuffs I rapidly moved on to discuss integration testing. I added a function to my service class that could read a file from S3:

public async Task<object> GetFileFromS3(string bucketName, string key)
{
    var obj = await _amazonS3Client.GetObjectAsync(
        new GetObjectRequest 
        { 
            BucketName = bucketName, 
            Key = key 
        });

    using var reader = new StreamReader(obj.ResponseStream);
    return reader.ReadToEnd();
}

I then added a function to my controller which called this and handled a few types of exception:

[HttpGet]
[Route("getfilefroms3")]
public async Task<ActionResult<object>> GetFile(string bucketName, string key)
{
    object response;

    try
    {
        response = await _awesomeService.GetFileFromS3(
                             bucketName, key);
    }
    catch (AmazonS3Exception ex)
    {
        if (ex.Message.Contains("Specified key does not exist") ||
            ex.Message.Contains("Specified bucket does not exist"))
        {
            return NotFound();
        }
        else if (ex.Message == "Access Denied")
        {
            return Unauthorized();
        }
        else
        {
            return StatusCode(500, ex.Message);
        }
    }
    catch (Exception ex)
    {
        return StatusCode(500, ex.Message);
    }

    return Ok(response);
}

I argued that here we could write a full end-to-end test which read an actual file from an actual S3 bucket and asserted some things on the result. Something like this:

public class AwesomeControllerIntegrationTests : 
    IClassFixture<WebApplicationFactory<Api.Startup>>
{
    private readonly WebApplicationFactory<Api.Startup> _factory;

    public AwesomeControllerIntegrationTests(
        WebApplicationFactory<Api.Startup> factory)
    {
        _factory = factory;
    }

    [Fact]
    public async Task GetFileTest()
    {
        var client = _factory.CreateClient();

        var query = HttpUtility.ParseQueryString(string.Empty);
        query["bucketName"] = "mybucket";
        query["key"] = "mything/thing.xml";
        using var response = await client.GetAsync(
            $"/api/Awesome/getfilefroms3?{query}");
        using var content =  response.Content;
        var stringResponse = await content.ReadAsStringAsync();

        Assert.NotNull(stringResponse);
    }
}

At this point I was glad that the forum was presented as a video call because I could detect some people getting distinctly agitated. “Why do you need to call S3 at all?” Well maybe the contents of this file are super mega important and the whole application would fall over into a puddle if it was changed? Maybe there is some process which generates this file on a schedule and we need to test that it is there and contains the things we are expecting it to contain?

But … maybe it is not our job as a developer to care about the contents of this file and it should be some other team entirely who is responsible for checking it has been generated correctly? Fair point…

We then discussed some options for “integration testing” including producing some local instance of AWS, or building a local database in docker and testing against that.

And then we ran out of time. I enjoyed the session and I hope the other participants did too. It remains to be seen whether I will be brave enough to attempt another interactive mobbing session in this manner…

Spooky season special – tales of terrors and errors

Anyone who has been working in software development for more than a few months will know the ice-cold sensation that creeps over you when something isn’t working and you don’t know why. Luckily, all our team members have lived to tell the tale, and are happy to share their experiences so you might avoid these errors in future… 

The Legend of the Kooky Configuration – Rhys Parsons
In my first job, in the late 90s, I was working on a project for West Midlands Fire Service (WMFS). We were replacing a key component (the Data Flow Controller, or DFC) that controlled radio transmitters and was a central hub (GD92 router) for communicating with appliances (fire engines). Communication with the Hill Top Sites (radio transmitters) was via an X.25 network.

The project was going well, we had passed the Factory Acceptance Tests, and it was time to test it on-site. By this point, I was working on the project on my own, even though I only had about two years of experience. I drove down to Birmingham from Hull with the equipment in a hired car, a journey of around 3.5 hours. The project had been going on for about a year by this point, so there was a lot riding on this test. WMFS had to change their procedures to mobilise fire engines via mobile phones instead of radio transmitters, which, back in the late 90s, was quite a slow process (30 seconds call setup). I plugged in the computers and waited for the Hill Top Sites to come online. They didn’t. I scratched my head. A lot. For an entire day. Pouring over code that looked fine. Then I packed it all up and drove back to Hull.

Back in the office, I plugged in the computer to test it. It worked immediately! Why?! How could it possibly have worked in Hull but not in Birmingham! It made absolutely no sense!

I hired a car for the next day and drove back down to Birmingham early, aiming to arrive just after 9, to avoid the shift change. By this point, I was tired and desperate.

I plugged the computer back in again. I had made absolutely no changes, but I could see no earthly reason why it wouldn’t work. “Sometimes,” I reasoned, “things just work.” That was my only hope. This was the second-day WMFS were using slower backup communications. One day was quite a good test of their resilience. Two days were nudging towards the unacceptable. Station Officers were already complaining. I stared at the screen, willing the red graphical LEDs to turn green. They remained stubbornly red. At the end of the day, I packed up the computer and drove back to Hull.

The WMFS project manager phoned my boss. We had a difficult phone conversation, and we decided I should go again the next day.

Thankfully, a senior engineer who had the experience of X.25 was in the office. I told him of this weird behaviour that made no sense. We spoke for about two minutes which concluded with him saying, “What does the configuration look like?”

My mouth dropped. The most obvious explanation. I hadn’t changed the X.25 addresses! I was so busy wondering how the code could be so weirdly broken that I hadn’t considered looking at the configuration. So, so stupid! I hadn’t changed the configuration since I first set up the system, several months earlier, it just wasn’t in my mind as something that needed doing.

Day three. Drove to Birmingham, feeling very nervous and stupid. Plugged in the computer. Changed the X.25 addresses. Held my breath. The graphical LEDs went from red to orange, and then each Hill Top Site went green in turn, as the transmit token was passed back and forth between them and the replacement DFC. Finally, success!

A Nightmare on Character Street – Rosie Chandler
We recently implemented a database hosted in AWS, with the password stored in the AWS Secrets Manager. The password is pulled into the database connection string, which ends up looking something like this:

“Server=myfunkyserver;Port=1234;Database=mycooldatabase;User ID=bigboss;Password=%PASSWORD%”

Where %PASSWORD% here is substituted with the password pulled out of the Secrets Manager. We found one day that calls to the database started throwing connection exceptions after working perfectly fine up until that point. 

After spending a lot of time scratching my head and searching through logs, I decided to take a peek into the Secrets Manager to see what the password was. Turns out that day’s password was something like =*&^%$ (note it starts with “=”) which means that the connection string for that day was invalid. After much facepalming, we implemented a one-line fix to ensure that “=” was added to the list of excluded characters for the password.

The Case of the Phantom Invoices – Chris Rimmer
Many years ago I was responsible for writing code that would email out invoices to customers. I was very careful to set things up so that when the code was tested it would send messages to a fake email system, not a real one. Unfortunately, this wasn’t set up in a very fail-safe way, meaning that another developer managed to activate the email job in the test system and sent real emails to real customers with bogus invoices in them. This is not the sort of mistake you quickly forget. Since then I’ve been very careful configuring email systems in test environments so that they can only send emails to internal addresses.


Tales from the Dropped Database – Rich Brown
It was a slow and rainy Thursday morning, I was just settling into my 3rd cup of coffee when a fateful email appeared with the subject ‘Live site down!!!’

Ah, of course, nothing like a production issue to kick-start your morning. I checked the site: it was indeed down. Sadly, the coffee would have to wait.

Logging onto the server, I checked the logs. A shiver ran down my spine.

ERROR: SQL Error – Table ‘users’ does not exist

ERROR: SQL Error – Table ‘articles’ does not exist

ERROR: SQL Error – Table ‘authors’ does not exist

ERROR: SQL Error – Database ‘live-db’ does not exist

That’s…. unusual…

Everything was working and then suddenly it stopped, no data existed.

Hopping onto the database server showed exactly that. Everything was gone, every row, every table, even the database itself wasn’t there.

I pulled in the rest of the team and we scratched our collective heads, how could this even happen? The database migration system shouldn’t be able to delete everything. We’ve taken all the right mitigations to prevent injection attacks. There’s no reason for our application to do this.

I asked, “What was everyone doing when the database disappeared?”

Dev 1 – “Writing code for this new feature”

Dev 2 – “Updating my local database”

Dev 3 – “Having a meeting”

Ok, follow up question to Dev 2 – “How did you update your database?”

Dev 2 – “I dropped it and let the app rebuild it as I usually do”

Me – “Show me what you did”

ERROR: SQL Error – Cannot drop database ‘live-db’ because it does not exist

Turned out Dev 2 had multiple SQL Server Manager instances open, one connected to their local test database and the other connected to the live system to query some existing data.

They thought they were dropping their local database and ended up dropping live by mistake.

One quick database restore and everything was back to normal.

Moral of the story, principle of least access. If you have a user who only needs to read data, only grant permissions to read data.