Coding principles 4: Test at multiple levels

This is the 3rd part of a series about 67 Bricks’s coding principles. The previous posts are: 12 and 3.

The principle

Test at multiple levels

I don’t think it’s controversial to say that tests are A Good Thing.

Functionality should be well tested so that we can be confident that it works correctly. Tests at different levels bring different benefits and should be combined to provide a high level of confidence in the software’s quality.

A rough common rule of thumb is that there should be:

  • lots of unit tests
    • these focus on individual units of code
    • they should be small, focused and quick to run
  • slightly fewer integration tests
    • these focus on testing multiple units together, potentially including external systems like databases
    • they tend to be a bit slower to run and more involved to set up
  • fewer again end-to-end tests
    • these test the whole stack
    • they generally test from the point of view of an end user or client of the system, so they might run via a headless browser or via a REST API
    • they tend to be comparatively slow and complex so they should be used sparingly and where they add real value
  • a small number of smoke tests
    • these are very basic end-to-end tests that can be run post-deployment or at regular intervals to check that the service is healthy

There is much that sensible people can disagree on in the above, like where the line sits between unit and integration tests; how much value there is in mocking in unit tests and much more. But I think the broader point that there is value in having tests at multiple levels stands.

By writing good tests at multiple levels, and running them often, it is possible to have a high level of confidence that a piece of software is in good working order.

Well written tests can bring a huge number of benefits, some of which are perhaps less obvious than others.

Tests verify that a piece of functionality works as intended

This is perhaps the most obvious benefit of tests: they test that some code does what you think it does.

While at 67 Bricks we are fairly agnostic to TDD (you’re welcome to use it, but you don’t have to), we do advocate for interleaving writing code with writing tests, rather than writing all the code first and leaving the tests till the end. Writing tests throughout the process can be hugely helpful in writing code that does what you intend it to with minimal bugs.

Tests encourage good, clean, modular code

It is a good rule of thumb that if a unit of code is hard to test, it’s probably an indication of a problem that you should fix. If it’s hard to test, perhaps it’s too tightly coupled or it’s making some undue assumptions or it has a confusing interface or it’s relying on hard-to-reason-about side effects… Wherever the difficulty springs from, the fact that it’s hard to test is a useful warning sign.

Tests act as specifications of behaviour

Each of your tests can act as an encapsulated description of how this unit of code is intended to act in particular circumstances. This is great as a way of documenting the developers’ intentions. If I come to a method and find myself wondering what to expect if null is passed into it, then my life will be made a lot easier if there’s a corresponding test like

it('throws an error when null is passed in', () => {

This example uses jest, a popular testing framework in the Javascript/Typescript world that allows you to write very spec-friendly, descriptive test names. In languages or frameworks that require you to use function names rather than strings for test names, I advocate making those function names as long and descriptive as possible, like

void throwsAnErrorWhenNullIsPassedIn()

Tests act as examples of how to use units of code

Related to the above point, they also act as written examples to future developers of how to use or consume the module under test.

Tests guard against regressions and other unintended changes

One of the most valuable things about adding tests as new features develop is that they remain in the codebase indefinitely as guards against unintended behaviour changes in the future. When working on a large system – particularly when developers come and go over time – it’s invaluable to be able to get instant feedback that a change you’ve made has caused a test to fail. This is especially true if that test is descriptively named, as recommended above, because it will help you understand what to do to fix the failure. Perhaps your change has had unintended consequences, or perhaps the test simply needs updating based on your change – a well named and well written test will help you make that call.

For this reason, it can sometimes be useful to test fairly trivial things that wouldn’t be worth testing if the only benefit were checking that your code works. Sometimes it’s valuable to simply enshrine something in a test to prevent accidental changes to important behaviour.

Tests help you refactor with confidence

When refactoring, tests are indispensable. If the code is well covered by good tests, and those tests pass after you’ve finished the refactor, you can have a high degree of confidence that you haven’t inadvertently changed any behaviour.

Resources

https://martinfowler.com/articles/practical-test-pyramid.html

Coding principles 3: Favour simplicity over complexity

This is the 3rd part of a series about 67 Bricks’s coding principles. The previous posts are: 1 and 2.

The principle

Aim for simplicity over complexity. This applies to everything from overarching architectural decisions down to function implementations.

This principle is a close cousin of the previous one – aim for clear, readable code – but emphasises one particular aspect of what makes code clear and readable: simplicity.

Simpler solutions tend to be easier to implement, to maintain, to reason about and to discuss with colleagues and clients.

It can be tempting to think that for software to be good or valuable it must be complicated. There can be an allure to complexity, I think partly because we tend to equate hard work with good work. So if we write something labyrinthine and hard to understand, it’s tempting to think it must also be good. But this is a false instinct when it comes to software. In code, hard does not equal good. In general complexity for its own sake should be avoided. It’s important to remember that there’s absolutely nothing wrong with a simple solution if it does what’s needed.

There’s also value in getting a simple solution working quickly so that it can be demoed, reviewed and discussed early compared to labouring for a long time over a complex solution that might not be correct. Something we emphasise a lot working at 67 Bricks is the value of iteration in the agile process. It can be extremely powerful to implement a basic version of a feature, site or application so that stakeholders can see and play with it and then give feedback rather than trying to discuss an abstract idea. Here, simplicity really shines because often getting a simple thing in front of a stakeholder in a week can be a lot more valuable than getting a complicated thing in front of them in a month.

This principle applies at every level at which we work, from designing your architectural infrastructure, down through designing the architecture of each module in your system, down to writing individual functions, frontend components and tests. At every level, if you can achieve what you need with fewer moving parts, simpler abstractions and fewer layers of indirection, the maintainability of your whole system will benefit.

Of course there are caveats here. Some code has to be complicated because it’s modelling complicated business logic. Sometimes there must be layers of abstraction and indirection because the problem requires it. This principle is not an argument that code should never be complicated, because sometimes it is unavoidable. Instead, it is an argument that simplicity is a valuable goal in itself and should be favoured where possible.

Another factor that makes this principle deceptively tricky is that it is the system (the architecture, the application, the class etc) that should be simple, not necessarily each individual code change. A complex system can very quickly emerge from a number of simple changes. Equally, a complicated refactor may leave the larger system simpler. It’s important to see the wood for the trees here. What’s important isn’t necessarily the simplicity of an individual code change, but the simplicity of the system that results from it.

There’s also subjectivity here: what does “simple” really mean when talking about code? A good example of an overcomplicated solution is the FizzBuzz Enterprise Edition repo – a satirical implementation of the basic FizzBuzz code challenge using an exaggerated Enterprise Java approach, with layers of abstraction via factories, visitors and strategies. However, all the patterns in use there do have their purpose. In another context, a factory class can simplify rather than obfuscate. But it’s important not to bring in extra complexity or indirection before it’s necessary.

Resources

The Wrong Abstraction – Sandi Metz

The Grug Brained Developer

Coding principles 2: Prioritise readability

This is the 2nd part of a series about 67 Bricks’s coding principles. The first post, containing the introduction and first principle is here.

The principle

Aim for clear, readable code. Write clear, readable comments where necessary

You should make it a priority that your work be readable and understandable to those who might come to it after you. This means you should aim to write code that is as clear and unambiguous as possible. You should do this by:

  • using clear variable, function and class names
  • avoiding confusing, ambiguous or unnecessarily complicated logic
  • adhering to the conventions and idioms of the language or technology you’re using

What can’t be made clear through code alone should be explained in comments.

Comments should focus on “why” (or “why not” explanations) far more than “how” explanations. This is particularly true if there is some historical context to a coding decision that might not be clear to someone maintaining the code in the future.

Note however that just like code, comments must be maintained and can become stale or misleading if they don’t evolve with the code, so use them carefully and only where they add value.

It is important to recognise that your code will be read far more times that it is written, and it will be maintained by people who don’t know everything you knew when you wrote it; possibly including your future self. Aim to be kind to your future self and others by writing code that conveys as much information and relevant context as possible.

I expect we’ve all had the experience of coming to a piece of code and struggling to understand it, only to realise it was you who wrote it a few months or weeks (or even days?) ago. We should learn from this occasional experience and aim to identify what we could have changed about the code the first time that would have prevented it. Better variable names? More comments? More comprehensive tests?

“You’re not going to run out of ink,” is something a colleague once commented on a pull request of mine to say that I could clarify the purpose of a variable by giving it a longer, more descriptive name. I think that’s a point worth remembering. Use as many characters as you need to make the job of the next person easier.

Of course, there’s some subjectivity here. What you see as obscure, someone else might see as entirely clear and vice versa. And certainly there’s an element of experience in how easily one can read and understand any code. The point really is to make sure that at least a thought is spared for the person who comes to the code next.

Examples

Here is an example that does not follow this principle:

const a = getArticles('2020-01-01');
a && process(a);

This example is unclear because it uses meaningless variable names and somewhat ambiguous method names. For example, it’s not clear without reading further into each method what they do – what does the date string parameter mean in getArticles? It also uses a technique for conditionally executing a method that is likely to confuse someone trying to scan this code quickly.

Now, here’s an example attempts to follow the principle:

// The client is only interested in articles published after 1st Jan
// 2020. Older articles are managed by a different system.
// See <ticket number>
const minDate = '2020-01-01';

const articlesResult = getArticlesSince(minDate);
if (articlesResult) {
  ingestArticles(articlesResult);
}

It provides a comment to explain the “why” of the hardcoded date, including relevant context; it uses much more meaningful names for variables and functions; and it uses a more standard, idiomatic pattern for conditionally executing a method.

Resources

Naming is Hard: Let’s Do Better (Kate Gregory, YouTube)

Coding principles 1: Favour functional code

Introduction to the principles

When I started working at 67 Bricks in 2017, in a small Oxford office already slightly struggling to contain about 15 developers, I found a strong and positive coding culture here. I learnt very quickly over my first few weeks what kind of code and practices the company valued. Some of that learning came via formal routes like on-boarding meetings and code review comments, but a lot of it came just by being in the office among many excellent developers and chatting or overhearing chats about opinions and preferences.

While there’s something very nice about this organic, osmosis-like way of ingesting a company’s values, practices and principles, it has been forced to evolve by a few factors over the last year. First we switched to home-working during the Covid lockdowns of 2020 and 2021 and then settled into a hybrid working model in which home-working is the default for most of us and the office is used somewhat less routinely. Secondly, we’ve increasing our technical team quite significantly over the last several years. Thirdly, that growth has partly involved a focus on bringing in and developing more junior developers. Each of these changes has made the “osmosis” model for new starters to pick up the company’s values a bit less tenable.

So over recent months, the tech leads have undertaken a project to distil those unwritten values and principles into a set of slightly more formal statements that new starters and old hands alike can refer to to help guide our high level thinking.

We came up with 9 of these principles. This and the following 8 posts in this series will go through each principle describing it and explaining why we think it is important in our ultimate goal of producing good, well-functioning products that run robustly, meet customer needs and are easy to maintain. 67 Bricks’s semi-joking unofficial motto is “do sensible things competently”; these principles aim to formalise a little what we mean by “sensible” and “competent”.

Generally I’ve used Typescript to write any code examples. The commonality of Typescript and Javascript should mean that examples are understandable to a good number of people.

About the principles

Before diving into the first principle, it’s worth briefly describing what these principles are and what they’re not.

These are high-level, general principles that aim to guide approaches to writing code in a way that is language/framework/technology agnostic. They should be seen more as rules of thumb or guidelines with plenty of room for exceptions and caveats depending on the situation. A good comparison might Effective Java by Joshua Bloch where a statement like “Favor composition over inheritance” doesn’t rule out ever using inheritance, but aims to guide the reader to understand why – in some cases – inheritance can cause problems and composition may provide a more robust and flexible solution.

These principles are not a style guide – our individual project teams are self organising and perfectly capable of enforcing their own code style preferences as they see fit – nor a dogmatic, stone-carved attempt at absolute truth. They’re also not strongly opinionated hot takes that are likely to provoke flame wars. They are simply what we see as sensible guidelines towards good, easy-to-write, easy-to-maintain code, and therefore robust software.

That was a lot of ado, so without any further let’s get on with the first principle.

The principle

Favour functional, immutable code over imperative, mutable code

Functional code emphasises side effect free, pure, composable functions that deal with immutable objects and avoid mutable state. We believe this approach leads to more concise, more testable, more readable, less error-prone software and we advise that all code be written in this way unless there is a good reason not to.

Code written in this way is easier to reason about because it avoids side effects and state mutations; functions are pure, deterministic and predictable. This approach promotes writing small, modular functions that are easy to compose together and easy to test.

67 Bricks has a history of favouring Scala as a development language – which may be clear from browsing back through the history of this blog. While these days C# has become a more common language for the products we deliver, the functional-first spirit of Scala is still woven into the fabric of 67 Bricks development. I believe Martin Odersky’s Coursera course: Functional Programming Principles in Scala is an excellent starting point for anyone wanting to understand the functional programming mindset regardless of your interest in Scala as a language.

As an interesting aside, the implementations of many of the Scala collections library classes – such as ListMap and HashMap – use mutable data structures internally in some methods, presumably for purposes of optimisation. This illustrates the caveat mentioned above that there may be sensible, situation-specific reasons to override this principle and others. It’s worth noting however that while the internals of some functions may be implemented in an imperative way, those are implementation details that are entirely encapsulated and irrelevant to users of the API.

I think “functional programming” is better seen as a continuum than a black and white dichotomy. While certain languages – like Haskell and F# – may be strictly functional, most languages – including C#, Javascript/Typescript, Python and (increasingly) Java – have many features that allow you to write in a more functional way if you choose to use them.

Examples

There are many books describing and teaching functional programming and the various principles that make it up, so I don’t intend to go into too much detail, but I think a couple of examples may help illustrate what functional code is and why it’s useful.

The following is an example of some code that does not follow this principle:

let onOffer = false;

function applyOffersToPrices(prices: number[]) {
  onOffer = isOfferDate(new Date());
  if (onOffer) {
    for (let i = 0; i < prices.length; i++) {
      prices[i] /= 2;
    }
  }
  return onOffer;
}

const prices: number[] = await retrievePricesFromSomewhere();
const onOffer = applyOffersToPrices(prices)
if (onOffer) {
  // ... what values does `prices` contain here?
} else {
  // ... how about here?
}

This code is hard to reason about because applyOffersToPrices mutates one of its arguments in some instances. This makes it very hard to be sure what state the values in the prices array are in after that function is called.

The following is an example that attempts to follow the principle:

function discountedPrices(prices: number[], date: Date) {
  if (!isOfferDate(date)) {
    return prices;
  }
  return prices.map(price => price / 2)
}

const prices: number[] = await retrievePricesFromSomewhere();
const todayPrices = discountedPrices(prices, new Date());

In this example, applyOffersToPrices is a pure function that does not mutate its input, but returns a new array containing the updated prices. It is unambiguous that prices still contains the original prices while todayPrices contains the prices that apply on the current date with the offer applied as necessary.

Note also that discountedPrices has everything it needs – the original prices and the current date – passed into it as arguments. This makes it very easy to test with different values.

Resources

Functional Programming Principles in Scala – Martin Odersky on Coursera

Why Functional Programming

Unit testing 101 – mob rulz

In a recent developer forum I made the rather wild decision to try demonstrate the principles of unit testing via an interactive mobbing session. I came prepared with some simple C# functions based around an Aspnetcore API and said “let’s write the tests together”. The resultant session unfolded not quite how I anticipated, but it was still lively, fun and informative.

The first function I presented was fairly uncontentious – the humble fizzbuzz:

[HttpGet]
[Route("fizzbuzz")]
public string GetFizzBuzz(int i)
{
    string str = "";
    if (i % 3 == 0)
    {
        str += "Fizz";
    }
    if (i % 5 == 0)
    {
        str += "Buzz";
    }
    if (str.Length == 0)
    {
        str = i.ToString();
    }

    return str;
}

Uncontentious that was, until a bright spark (naming no names) piped up with questions like “Shouldn’t 6 return ‘fizzfizz’?”. Er… moving on…

I gave a brief introduction to writing tests using XUnit following the Arrange/Act/Assert pattern, and we collaboratively came up with the following tests:

[Fact]
public void GetFizzBuzz_FactTest()
{
    // Arrange
    var input = 1;

    // Act
    var response = _controller.GetFizzBuzz(input);

    // Assert
    Assert.Equal("1", response);
}

[Theory]
[InlineData(1, "1")]
[InlineData(2, "2")]
[InlineData(3, "Fizz")]
[InlineData(4, "4")]
[InlineData(5, "Buzz")]
[InlineData(9, "Fizz")]
[InlineData(15, "FizzBuzz")]
public void GetFizzBuzz_TheoryTest(int input, string output)
{
    var response = _controller.GetFizzBuzz(input);
    Assert.Equal(output, response);
}

So far so good. We had a discussion about the difference between “white box” and “black box” testing (where I nodded sagely and pretended I knew exactly what these terms meant before making the person who mentioned them provide a definition). We agreed that these tests were “white box” testing because we had full access to the source code and new exactly what clauses we wanted to cover with our test cases. With “black box” testing we know nothing about the internals of the function and so might attempt to break it by throwing large integer values at it, or finding out exactly whether we got back “fizzfizz” with an input of 6.

Moving on – I presented a new function which does an unspecified “thing” to a string. It does a bit of error handling and returns an appropriate response depending on whether the thing was successful:

[Produces("application/json")]
[Route("api/[controller]")]
[ApiController]
public class AwesomeController : BaseController
{
    private readonly IAwesomeService _awesomeService;

    public AwesomeController(IAwesomeService awesomeService)
    {
        _awesomeService = awesomeService;
    }

    [HttpGet]
    [Route("stringything")]
    public ActionResult<string> DoAThingWithAString(
        string thingyString)
    {
        string response;

        try
        {
            response = _awesomeService
                           .DoAThingWithAString(thingyString);
        }
        catch (ArgumentException ex)
        {
            return BadRequest(ex.Message);
        }
        catch (Exception ex)
        {
            return StatusCode(500, ex.Message);
        }

        return Ok(response);
    }
}

This function is not stand-alone but instead calls a function in a service class, which does a bit of validation and then does the “thing” to the string:

public class AwesomeService : IAwesomeService
{
    private readonly IAmazonS3 _amazonS3Client;

    public AwesomeService(IAmazonS3 amazonS3Client)
    {
        _amazonS3Client = amazonS3Client;
    }

    public string DoAThingWithAString(string thingyString)
    {
        if (thingyString == null)
        {
            throw new ArgumentException("Where is the string?");
        }

        if (thingyString.Any(char.IsDigit))
        {
            throw new ArgumentException(
                @"We don't want your numbers");
        }

        var evens = 
            thingyString.Where((item, index) => index % 2 == 0);
        var odds = 
            thingyString.Where((item, index) => index % 2 == 1);

        return string.Concat(evens) + string.Concat(odds);
    }
}

And now the debates really began. The main point of contention was around the use of mocking. We can write an exhaustive test for the service function to exercise all the if clauses and check that the right exceptions are thrown. But when testing the controller function should we mock the service class or not?

Good arguments were provided for the “mocking” and “not mocking” cases. Some argued that it was easier to write tests for lower level functions, and if you did this then any test failures could be easily pinned down to a specific line of code. Others argued that for simple microservices with a narrow interface it is sufficient to just write tests that call the API, and only mock external services.

Being a personal fan of the mocking approach, and wanting to demonstrate how to do it, I prodded and cajoled the group into writing these tests to cover the exception scenarios:

public class AwesomeControllerTests
{
    private readonly AwesomeController _controller;
    private readonly Mock<IAwesomeService> _service;

    public AwesomeControllerTests()
    {
        _service = new Mock<IAwesomeService>();
        _controller = new AwesomeController(_service.Object);
    }

    [Fact]
    public void DoAThingWithAString_ArgumentException()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new ArgumentException("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<BadRequestObjectResult>(response);
        Assert.Equal(400, 
            ((BadRequestObjectResult)response).StatusCode);
        Assert.Equal("boom", 
            ((BadRequestObjectResult)response).Value);
    }

    [Fact]
    public void DoAThingWithAString_Exception()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new Exception("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<ObjectResult>(response);
        Assert.Equal(500, ((ObjectResult)response).StatusCode);
        Assert.Equal("boom", ((ObjectResult)response).Value);
    }        
}

Before the session descended into actual fisticuffs I rapidly moved on to discuss integration testing. I added a function to my service class that could read a file from S3:

public async Task<object> GetFileFromS3(string bucketName, string key)
{
    var obj = await _amazonS3Client.GetObjectAsync(
        new GetObjectRequest 
        { 
            BucketName = bucketName, 
            Key = key 
        });

    using var reader = new StreamReader(obj.ResponseStream);
    return reader.ReadToEnd();
}

I then added a function to my controller which called this and handled a few types of exception:

[HttpGet]
[Route("getfilefroms3")]
public async Task<ActionResult<object>> GetFile(string bucketName, string key)
{
    object response;

    try
    {
        response = await _awesomeService.GetFileFromS3(
                             bucketName, key);
    }
    catch (AmazonS3Exception ex)
    {
        if (ex.Message.Contains("Specified key does not exist") ||
            ex.Message.Contains("Specified bucket does not exist"))
        {
            return NotFound();
        }
        else if (ex.Message == "Access Denied")
        {
            return Unauthorized();
        }
        else
        {
            return StatusCode(500, ex.Message);
        }
    }
    catch (Exception ex)
    {
        return StatusCode(500, ex.Message);
    }

    return Ok(response);
}

I argued that here we could write a full end-to-end test which read an actual file from an actual S3 bucket and asserted some things on the result. Something like this:

public class AwesomeControllerIntegrationTests : 
    IClassFixture<WebApplicationFactory<Api.Startup>>
{
    private readonly WebApplicationFactory<Api.Startup> _factory;

    public AwesomeControllerIntegrationTests(
        WebApplicationFactory<Api.Startup> factory)
    {
        _factory = factory;
    }

    [Fact]
    public async Task GetFileTest()
    {
        var client = _factory.CreateClient();

        var query = HttpUtility.ParseQueryString(string.Empty);
        query["bucketName"] = "mybucket";
        query["key"] = "mything/thing.xml";
        using var response = await client.GetAsync(
            $"/api/Awesome/getfilefroms3?{query}");
        using var content =  response.Content;
        var stringResponse = await content.ReadAsStringAsync();

        Assert.NotNull(stringResponse);
    }
}

At this point I was glad that the forum was presented as a video call because I could detect some people getting distinctly agitated. “Why do you need to call S3 at all?” Well maybe the contents of this file are super mega important and the whole application would fall over into a puddle if it was changed? Maybe there is some process which generates this file on a schedule and we need to test that it is there and contains the things we are expecting it to contain?

But … maybe it is not our job as a developer to care about the contents of this file and it should be some other team entirely who is responsible for checking it has been generated correctly? Fair point…

We then discussed some options for “integration testing” including producing some local instance of AWS, or building a local database in docker and testing against that.

And then we ran out of time. I enjoyed the session and I hope the other participants did too. It remains to be seen whether I will be brave enough to attempt another interactive mobbing session in this manner…

Spooky season special – tales of terrors and errors

Anyone who has been working in software development for more than a few months will know the ice-cold sensation that creeps over you when something isn’t working and you don’t know why. Luckily, all our team members have lived to tell the tale, and are happy to share their experiences so you might avoid these errors in future… 

The Legend of the Kooky Configuration – Rhys Parsons
In my first job, in the late 90s, I was working on a project for West Midlands Fire Service (WMFS). We were replacing a key component (the Data Flow Controller, or DFC) that controlled radio transmitters and was a central hub (GD92 router) for communicating with appliances (fire engines). Communication with the Hill Top Sites (radio transmitters) was via an X.25 network.

The project was going well, we had passed the Factory Acceptance Tests, and it was time to test it on-site. By this point, I was working on the project on my own, even though I only had about two years of experience. I drove down to Birmingham from Hull with the equipment in a hired car, a journey of around 3.5 hours. The project had been going on for about a year by this point, so there was a lot riding on this test. WMFS had to change their procedures to mobilise fire engines via mobile phones instead of radio transmitters, which, back in the late 90s, was quite a slow process (30 seconds call setup). I plugged in the computers and waited for the Hill Top Sites to come online. They didn’t. I scratched my head. A lot. For an entire day. Pouring over code that looked fine. Then I packed it all up and drove back to Hull.

Back in the office, I plugged in the computer to test it. It worked immediately! Why?! How could it possibly have worked in Hull but not in Birmingham! It made absolutely no sense!

I hired a car for the next day and drove back down to Birmingham early, aiming to arrive just after 9, to avoid the shift change. By this point, I was tired and desperate.

I plugged the computer back in again. I had made absolutely no changes, but I could see no earthly reason why it wouldn’t work. “Sometimes,” I reasoned, “things just work.” That was my only hope. This was the second-day WMFS were using slower backup communications. One day was quite a good test of their resilience. Two days were nudging towards the unacceptable. Station Officers were already complaining. I stared at the screen, willing the red graphical LEDs to turn green. They remained stubbornly red. At the end of the day, I packed up the computer and drove back to Hull.

The WMFS project manager phoned my boss. We had a difficult phone conversation, and we decided I should go again the next day.

Thankfully, a senior engineer who had the experience of X.25 was in the office. I told him of this weird behaviour that made no sense. We spoke for about two minutes which concluded with him saying, “What does the configuration look like?”

My mouth dropped. The most obvious explanation. I hadn’t changed the X.25 addresses! I was so busy wondering how the code could be so weirdly broken that I hadn’t considered looking at the configuration. So, so stupid! I hadn’t changed the configuration since I first set up the system, several months earlier, it just wasn’t in my mind as something that needed doing.

Day three. Drove to Birmingham, feeling very nervous and stupid. Plugged in the computer. Changed the X.25 addresses. Held my breath. The graphical LEDs went from red to orange, and then each Hill Top Site went green in turn, as the transmit token was passed back and forth between them and the replacement DFC. Finally, success!

A Nightmare on Character Street – Rosie Chandler
We recently implemented a database hosted in AWS, with the password stored in the AWS Secrets Manager. The password is pulled into the database connection string, which ends up looking something like this:

“Server=myfunkyserver;Port=1234;Database=mycooldatabase;User ID=bigboss;Password=%PASSWORD%”

Where %PASSWORD% here is substituted with the password pulled out of the Secrets Manager. We found one day that calls to the database started throwing connection exceptions after working perfectly fine up until that point. 

After spending a lot of time scratching my head and searching through logs, I decided to take a peek into the Secrets Manager to see what the password was. Turns out that day’s password was something like =*&^%$ (note it starts with “=”) which means that the connection string for that day was invalid. After much facepalming, we implemented a one-line fix to ensure that “=” was added to the list of excluded characters for the password.

The Case of the Phantom Invoices – Chris Rimmer
Many years ago I was responsible for writing code that would email out invoices to customers. I was very careful to set things up so that when the code was tested it would send messages to a fake email system, not a real one. Unfortunately, this wasn’t set up in a very fail-safe way, meaning that another developer managed to activate the email job in the test system and sent real emails to real customers with bogus invoices in them. This is not the sort of mistake you quickly forget. Since then I’ve been very careful configuring email systems in test environments so that they can only send emails to internal addresses.


Tales from the Dropped Database – Rich Brown
It was a slow and rainy Thursday morning, I was just settling into my 3rd cup of coffee when a fateful email appeared with the subject ‘Live site down!!!’

Ah, of course, nothing like a production issue to kick-start your morning. I checked the site: it was indeed down. Sadly, the coffee would have to wait.

Logging onto the server, I checked the logs. A shiver ran down my spine.

ERROR: SQL Error – Table ‘users’ does not exist

ERROR: SQL Error – Table ‘articles’ does not exist

ERROR: SQL Error – Table ‘authors’ does not exist

ERROR: SQL Error – Database ‘live-db’ does not exist

That’s…. unusual…

Everything was working and then suddenly it stopped, no data existed.

Hopping onto the database server showed exactly that. Everything was gone, every row, every table, even the database itself wasn’t there.

I pulled in the rest of the team and we scratched our collective heads, how could this even happen? The database migration system shouldn’t be able to delete everything. We’ve taken all the right mitigations to prevent injection attacks. There’s no reason for our application to do this.

I asked, “What was everyone doing when the database disappeared?”

Dev 1 – “Writing code for this new feature”

Dev 2 – “Updating my local database”

Dev 3 – “Having a meeting”

Ok, follow up question to Dev 2 – “How did you update your database?”

Dev 2 – “I dropped it and let the app rebuild it as I usually do”

Me – “Show me what you did”

ERROR: SQL Error – Cannot drop database ‘live-db’ because it does not exist

Turned out Dev 2 had multiple SQL Server Manager instances open, one connected to their local test database and the other connected to the live system to query some existing data.

They thought they were dropping their local database and ended up dropping live by mistake.

One quick database restore and everything was back to normal.

Moral of the story, principle of least access. If you have a user who only needs to read data, only grant permissions to read data.

Things Customers Don’t Understand About Search

Note, this post is based on a dev forum put together by Chris.

Full-text search is a common feature of systems 67 Bricks build. We want to make it easy for users to find relevant information quickly often through a faceted search function. Understanding user needs and building a top notch user experience is vital. When building faceted search, we generally use either ElasticSearch (or AWS’s OpenSearch) or MarkLogic. Both databases offer very similar feature sets when it comes to search, though one is more targetted towards JSON based documents and the other, XML.

Search can seem magical at first glance and do some amazing things but this can lead to situations where customers (and UX/UI designers) assume the search mecahnism can do more than it can. We frequently find a disconnect between what is desired and what is feasable with search systems.

There are 2 main categories of problems we often see are:

  1. Customers / Designers asking for things that could be done, but often come with nasty performance implications
  2. Features that seem reasonable to ask for at first glance, but once dug into reveal logical problems that make developing the feature near impossible

Faceted Search

Faceted search systems are some of the most common systems we build at 67 Bricks. The user experience typicallly starts with a search box that they enter a number of terms into, hit enter and then be presented with a list of results in some kind of relevancy order. Often there is a count of results displayed alongside a pagination mechanism to iterate through the results (e.g. showing results 1-10 of 12,464). We also show facets, counts for how many results fit into different buckets. All this is handled in a single query that often takes less than 100ms which seems miraculous. Of course, this isn’t magic, full-text search systems use a variety of clever indexes to make searching and computing the facet counts quick.

Lets make a search system for a hypothetical website cottagesearch.com. Our first screen will present the user with some options to select a location, the date range they want to stay and how many guests are coming. We perform the search and show the matching results. How should we display the results and more importantly, how do we show the facets?

Let’s say we did a search for 2 bedroom cottages. We’ve seen wireframes for a number of occassions where the facet count for all bedroom numbers are displayed. So users see the number of results applicable to each bedroom count they would get if they didn’t limit the search to just 2 bedrooms (i.e. there aren’t that many 2 bed options, but look at how many 3 bed options are available). At first glance, this seems like a sensible design, but fundmanetally breaks how search systems work with faceting, they will return counts, but only for the search just done.

We could get around this by doing 2 searches, 1 limited by bedrooms and one that does not to retrieve the facet counts. This may seem like a sensible idea when we have 1 facet, but what do we do when we have more? Do we need to do multiple searches, effectively making an N+1 problem? How to we display numbers? Should the counts for the location facet include the limit of bedrooms or not? As soon as we start exploring additional situations we start to see the challenges the original design presents.

This gets harder when we consider non-exclusionary facets. Let’s say our cottage search system lets you filter by particular features, such as a wood burner, hot tub or dishwasher. Now, if we show counts of non-selected facets, what do these numbers represent? Do they include results that already include the selected facet or not? Here, the logic starts to break down and becomes ever more confusing to the end user and difficult to implement for the developer.

Other Complex Facet Situations

A common question we need to ask with non-exclusionary facets: Is it an AND or an OR search? The answer is very domain dependant, but either way we suggest steering away from facets counts in these situations.

Date ranges provide an interesting problem, some sites will purposefully search outside of the selected range so as to provide results near the selected date range. This may be a useful or annoying depending on what the user expects and is trying to achieve. Some users would want exact matches and would have no interest in results that do not meet the selected date range.

Ordering facets is also a questions that may be overlooked. Do you order lexographically or do you order by descending number of matches? What about names, year ranges or numeric values? Again, a lot of what users expect and would want comes down to the domain being dealt with and the needs of the users.

When users select a new facet, what should the UI do? Should the search immediately rerun and the results and facets update or should there be a manual refresh button the user has to select before the search is updated? An immediate refresh would be slower, but let users narrow down carefully, while a manual update would reduce the number of searches done, but then users may be able to select a number of facets in such a way that no results would be returned.

Hierarchies can also prove tricky. We often see taxonomies being used to inform facets, say subjects with sub categories. How should these be displayed? Again there are many solutions to pick from with different sets of trade-offs.

Advanced Search

Advanced search can often be a bit like a peacocks tail – something that looks impressive, but doesn’t contribute a fair share of value based on how much effort it takes to develop. A lot of designers and product owners love the idea of it but in practice, it can end up being somewhat confusing to use and many end users end up avoiding it.

Boolean builders exist in many systems where the designer of advanced search will insist on allowing users to build up some complex search with lots of boolean AND/OR options, but displaying this to users in a way they can understand is challenging. If a user builds a boolean search such as: GDP AND Argentina OR Brazil do we treat it as (GDP AND Argentina) OR Brazil or should it be interpreted as GDP AND (Argentina OR Brazil). We could include brackets in the builder, but this just further complicates the UI.

We frequently get bugs and feedback on advanced search, some of this feedback can amount to different users having contradictory opinions on how it should work. We would ask product owners to carefully consider “How many people will use it?” Google has a well build UI for advanced search that does away with the challenges of boolean logic by having separate fields for ANDs ORs and NOTs.

Google advanced search UI

An advanced search facility can introduce additional complexity when combined with facets. If an advances search lets you select some facets before completing the search, does this form part of the string in the search box? We have had mixed results with enabling power users to enter facets into search fields (e.g. bedrooms:3), but this can be tricky, some users can deal with it, but others may prefer a advanced search builder while others will rely on facets post search.

Summary

In conclusion, we have 3 main takeaways

  • Search is much more complex than it first appears
  • Facets are not magic, just because you can draw a nice wireframe doesn’t make it feasable to develop
  • Advanced search can be tricky to get right and even then, only used by a minority of users

We’ve build many different types of search and have experimented with a number of approaches in the past and we offer some tried and tested principles:

  • Make searches stateless – Don’t add complexity by trying to maintain state between facets changes, simply treat each change as a fresh search. That way URLs can act as a method of persistence and bookmarking common searches.
  • Have facets only display counts for the current search and do not display counts for other facets once one has been selected within that category.
  • Only use relevancy as the default ordering mechanism – You may be tempted to allow results to be ordered in different ways, such as published date, but this can cause problems with weakly matching, but recent results appearing first.
  • Don’t build an advanced search unless you really need to and if you have to, use a Google style interface over a boolean query builder.
  • Check that search is working as expected – Have domain experts check that searches are returning sensible results and look into using analytics to see if users are having a happy journey through the application (i.e. run a search and then find the right result within the first few hits).
  • Beware of exhaustive search use cases – As many search mechanisms work on some score based on relevancy to the terms entered, having a search that guarantees a return of everything relevant can be tricky to define and to develop.

Zen and the art of booking vaccinations

This is a slightly abridged version of a painful experience I had recently when trying to book a Covid vaccination for my 5-year-old daughter, and some musing about what went wrong (spoiler: IT systems). It’s absolutely not intended as a criticism of anyone involved in the process. All descriptions of the automated menu process describe how it was working today.

At the beginning of April, vaccinations were opened up for children aged 5 and over. Accordingly, on Saturday 2nd, we tried to book an appointment for our daughter using the NHS website (https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/book-coronavirus-vaccination/). After entering her NHS record and date of birth, we were bounced to an error page:

The error page after failing to book an appointment.

There’s no information on the page about what the error might be – possibly this is reasonable given patient confidentiality etc. (at no point had I authenticated myself). I noticed that the URL ended “/cannot-find-vaccination-record?code=9901” but other than that, all I can do is call 119 as suggested.

Dialling 119, of course, leads you to a menu-based system. After choosing the location you’re calling from, you get 4 options:

  1. Test and trace service
  2. Covid-19 vaccination booking service
  3. NHS Covid pass service
  4. Report an issue with Covid vaccination record

So the obvious choice here is “4”. This gives you a recorded message “If you have a problem with your UK vaccination records, the agent answering your call can refer you on to the data resolution team”. This sounds promising! Then there’s a further menu with 3 choices:

  1. If your vaccination record issue is stopping you from making a vaccination booking
  2. If your issue is with the Covid pass
  3. If your issue relates to a vaccination made overseas

Again, the obvious choice here is “1”. This results in you being sent to the vaccination booking service.

The first problem I encountered (in the course of the day I did this many times!) was that many of the staff on the other end seemed to be genuinely confused about how I’d ended up with them. They told me I should redial and choose option 4, and I kept explaining that I had done exactly that and this is where I had ended up. So either menu system is not working and is sending me to the wrong place (although given the voice prompts it sounds like it was doing the right thing), or the staff taking the calls have not been briefed properly.

Eventually I was able to get myself referred to the slightly Portal-esque sounding Vaccination Data Resolution Service. They explained that my daughter did not appear to be registered with a GP, which surprised me because she definitely is. So, they said, I should get in touch with her GP practice and get them to make sure the records were correct on “the national system”.

This I did. I actually went down there (it was lunchtime), the staff at the surgery peered at her records and reported that everything seemed to be present and correct, with no issues being flagged up.

So, then I had more fun and games trying to get one of the 119 operators to refer me back to the VDRS. This was eventually successful, and someone else at the VDRS called me back. She took pity on me, and gave me some more specific information – the “summary case record” on the “NHS Spine portal” which should have listed my daughter’s GP did not.

I phoned the GP surgery, explained this, various people looked at the record and reported that everything seemed fine to them.

More phoning of 119, a third referral, to another person. He wasn’t able to suggest anything, sadly.

So, at this point, I was wracking my brains trying to work out where the problem could lie. I had the VDRS people saying that this data was missing from my daughter’s record, and the GP surgery insisting that all was well. The VDRS chap had mentioned something about it potentially taking up to 4 weeks (!) for updates to come through to them, which suggests that behind the scenes there must be some data synchronisation between different systems. I wondered if there was some kind of way of tagging bits of a patient record with permissions to say who is allowed to see them, and the GP surgery could and the VDRS people couldn’t.

Finally, on the school run to collect my daughter, I thought I’d have one last try at talking to the GP practice in person. I spoke to one of the ladies I’d spoken to on the phone, she took my bit of paper with “NHS Spine portal – summary care record” scrawled on it and went off to see the deputy practice manager. A short while later she returned; they’d looked at that bit of the record and spotted something in it about “linking” the local record with the NHS Spine (she claimed they’d never seen this before1), and that this was not set. I got the impression that in fact it couldn’t be set, because her proposed fix (to be tried tomorrow) is going to be to deregister my daughter and re-register her. And then I should try the vaccine booking again in a couple of days.

As someone who is (for want of a better phrase) an “IT professional”, the whole experience was quite frustrating. As noted at the top, I’m not trying to criticise any person I dealt with – everyone seemed keen to help. I’m also not trying to cast aspersions on the GP’s surgery – as far as they were aware, there was nothing amiss with the record (until they discovered this “linking” thing). My suspicion is that the faults lie with IT systems and processes.

For example, it sounds like it’s an error for a GP surgery to have a patient record that’s not linked to a record in the NHS Spine, but since many people took a look at the screens showing the data without noticing anything amiss, I’d say that’s some kind of failure of UI design. I wonder how it ended up like that; maybe my daughter’s record predated linking with Spine and somehow got missed in a transitioning process (or no such process occurred)?

It would have been nice if I’d been able to get from the state of knowing there’s some kind of error with the record, to knowing the actual details of the error, without having to jump through so many telephonic hoops. I presume the error code 9901 means something to somebody, but it didn’t mean anything to any of the people I spoke to. In any case, I only spotted it because, as a developer, I thought I’d peep at the URL, but it didn’t seem to be helpful from the point of view of diagnosing the problem. It feels like there’s a missed opportunity here – since people seeing that error page are directed to the 119 service, it would have been helpful to have provided some kind of visible code to enable the call handlers to triage the calls effectively.

In terms of my own development, it was an important reminder that the systems I build will be used by people who are not necessarily IT-literate and don’t know how they work under the covers, and if they go wrong then it might be a bewildering and perplexing experience for them. Being at the receiving end of vague and generic-sounding errors, as I have been today, has not been a lot of fun.


1I found this curious, but a subsequent visit to the surgery a couple of days later to see how the “fix” was progressing clarified things somewhat. My understanding now is that the regular view of my daughter’s record suggested that all was well, and that it was linked with Spine, and it was only when the deputy practice manager clicked through to investigate that it popped up a message saying that it was not linked correctly.

Obviously I don’t know the internals of the system and this is purely speculation, but suppose that the local system system had set a “I am linked with Spine” flag when it tried to set up the link, the linking failed for some reason, and the flag was never unset (or maybe it got set regardless of whether the linking succeeded or failed). Suppose furthermore that the “clicking through” process described actually tries to pull data from Spine. That could give a scenario in which the record looks fine at first glance and gives no reason to suppose anything is wrong, and you only see the problem with some deeper investigation. We can still learn a lesson from this hypothetical conjecture – if you are setting a flag to indicate that some fallible process has taken place, don’t set it before or after you run this process, set it when you have run it and confirmed that it has been successful.


Postscript

Sadly, I never found out what the underlying problem was. We just tried the booking process one day and it worked as expected (although by this point it was moot as my daughter had tested positive for Covid, shortly followed by myself and my wife). A week or so later, we got a registration confirmation letter confirming that she’d been registered with a GP practice, which was reassuring. I’d like to hope that somewhere a bug has been fixed as a result of this…

Organising Git Pull Requests

Recently we had an interesting dev forum on Git workflows. Git is the de facto source control tool of choice for software development. Powerful and flexible, teams have a wide range of choices to make when deciding exactly how to make use of it. This discussion originally stemmed from a Changelog podcast titled “Git your reset on” based on the blog post “Git Organized” by Annie Sexton.

To briefly summarise the above, if we imagine a situation where a merge to main leads to a production deployment and then an issue is found in production, how do we roll back? We look in the git history, find the commit that introduced the bug, revert it and then redeploy to production. But oh no! This revert changed a number of source files beyond the scope of the bug and has ended up introducing a new bug.

The proposed solution to this is to ensure a sensible, clean history of commits is used within a branch. Annie Sexton suggests using a feature branch where you commit regularly with useless messages like WIP until you are ready to submit a pull request. You then run git reset origin/head to be presented with all the changes you have done and how this differs from main, you then progressively add files and make commits with sensible messages to build up a more coherent change.

This approach has a number of advantages we discussed:

  • It provides a neat history of compartmentalised changes
  • Reviewers can then follow a structured story of commits about how the developer arrived at their solution
  • Irrelevant changes are excluded
  • Ensure that all of the included changes are relevant only to the task at hand

Disadvantages of this approach may include:

  • Bad approaches that were tried and abandoned are not in the history
  • Changes aren’t going to always compartmentalise neatly into specific changes of files meaning we would need to commit specific changes within a file, some tools do a bad job supporting this use case

It turns out that a lot of developers at 67 Bricks use a form of history rewriting when developing code, but no one used the specific approach proposed above. git commit–amend is a very useful command to tweak the previous commit. Some (including myself) use git rebase -i as a way of rebuilding the commit history, though this has a limitation in cases where you may want to split a commit up. Finally, others create new branches and build up clean commits on the new branch.

Is force pushing a force for good?

This question really split the developers, some see using git push --force as an evil that indicates you’re using git wrongly and should only be used as a last resort. Others really like the idea of force pushing, but only for branches you’ve been developing on and have yet to create a merge request for.

Squashing?

A workflow some developers have seen before involves using the --squash flag when merging into main. This can create a nice history where each commit neatly maps to a single ticket but the general view was this caused the loss of helpful information from the git history.

Should the software be valid at every commit?

Some argued that this is a sensible thing to strive for, having confidence that you can revert back to any commit and the software will work is a nice place to be at. However, others criticised this as tricky to achieve in practice. Ideally the tests would all be valid and pass too, but this goes against how some developers work; they create a failing test to show the problem or new feature, commit it and then build out code to make the test green. Others claim that tests should change in the same commit as code changes to make it obvious these changes belong together.

Tools used by Developers

At 67 bricks, we try to avoid specifying which precise tool a developer should use for their craft. Different developers prefer different approaches when it comes to using git, some are more comfortable with their IDEs plugin (e.g. IntelliJ has good git integration which can add specific lines to commits) while others may prefer using the command line. Here’s a list of tools we’ve used when dealing with git:

To conclude, specific approaches and tools used vary, but everyone agreed on the benefits of striving for a clean, structured git history.

An often asked question

I thought I’d pen a short blog post on a question I frequently find myself asking.

You’re right, but are you relevant?

Me

As software developers, it’s all too easy to want to select the new hot technologies available to us. Building a simple web app is great, but building a distributed web-app based on an auto-scaling cluster making use of CQRS and an SPA front-end written in the latest JavaScript hotness is just more interesting. Software development is often about finding simple solutions to complex problems, allowing us to minimize complexity.

  • You’re right, distributed microservices will allow you to deploy separate parts of your app independently, but is that relevant to a back-office system which only needs to be available 9-5?
  • You’re right, CQRS will allow you to better scale the database to handle a tremendous number of queries in parallel, but is that relevant when we only expect 100 users a day?
  • You’re right, that new Javascript SPA framework will let you create really compelling, interactive applications, but is that relevant to a basic CRUD app?

Lots of modern technology can do amazing things and there are always compelling reasons to choose technology x, but are those reasons relevant to the problem at hand? If you spend a lot of time up front designing systems with tough-to-implement technologies and approaches which aren’t needed; a lot of development effort would have been needlessly spent adding all kinds of complexity which wasn’t needed. Worse, we could have added lots of complexity that then resists change, preventing us from adapting the system in the direction our users require.

At 67 Bricks, we encourage teams to start with something simple and then iterate upon it, adding complexity only when it is justified. Software systems can be designed to embrace change. An idea that’s the subject of one of my favourite books Building Evolutionary Architectures.

So next time you find yourself architecting a big complex system with lots of cutting edge technologies that allow for all kinds of “-ilities” to be achieved. Be sure to stop and ask yourself. “I’m right, but am I relevant?”