Setting up local AWS environment using Localstack

When Cloud services are used in an application, it might be tricky to mock them during local development. Some approaches include: 1) doing nothing thus letting your application fail when it makes a call to a Cloud service; 2) creating sets of fake data to return from calls to AWS S3, for example; 3) using an account in the Cloud for development purposes. A nice in-between solution is using Localstack, a Cloud service emulator. Whereas the number of services available and the functionality might be a bit limited compared to the real AWS environment, it works rather well for our team.

This article will describe how to set it up for local development in Docker.

Docker-compose setup:

In the services section of our docker-compose.yml we have Localstack container definition:

localstack:
    image: localstack/localstack:latest
    hostname: localstack
    environment:
      - SERVICES=s3,sqs
      - HOSTNAME_EXTERNAL=localstack
      - DATA_DIR=/tmp/localstack/data
      - DEBUG=1
      - AWS_ACCESS_KEY_ID=test
      - AWS_SECRET_ACCESS_KEY=test
      - AWS_DEFAULT_REGION=eu-central-1
    ports:
      - "4566:4566"
    volumes:
      - localstack-data:/tmp/localstack:rw
      - ./create_localstack_resources.sh:/docker-entrypoint-initaws.d/create_localstack_resources.sh

Although we don’t need to connect to any AWS account, we do need dummy AWS variables (with any value). We specify which services we want to run using Localstack – in this case it’s SQS and S3.

We also need to set HOSTNAME_EXTERNAL because SQS API needs the container to be aware of the hostname that it can be accessed on.

Another point is that that we cannot use the entrypoint definition because Localstack has a directory docker-entrypoint-initaws.d from where shell scripts are run when the container starts up. That’s why we’re mapping the container volume to a folder wherer those scripts are. In our case create_localstack_resources.sh will create all the necessary S3 buckets and the SQS queue:

EXPECTED_BUCKETS=("bucket1" "bucket2" "bucket3")
EXISTING_BUCKETS=$(aws --endpoint-url=http://localhost:4566 s3 ls --output text)

echo "creating buckets"
for BUCKET in "${EXPECTED_BUCKETS[@]}"
do
  echo $BUCKET
  if [[ $EXISTING_BUCKETS != *"$BUCKET"* ]]; then
    aws --endpoint-url=http://localhost:4566 s3 mb s3://$BUCKET
  fi
done

echo "creating queue"
if [[ $EXISTING_QUEUE != *"$EXPECTED_QUEUE"* ]]; then
    aws --endpoint-url=http://localhost:4566 sqs create-queue --queue-name my-queue\
    --attributes '{
      "RedrivePolicy": "{\"deadLetterTargetArn\":\"arn:aws:sqs:eu-central-1:000000000000:my-dead-letter-queue\",\"maxReceiveCount\":\"3\"}",
      "VisibilityTimeout": "120"
    }'
fi

Note that AWS CLI command syntax is different to the real AWS CLI (otherwise you’d create resources in the account for which you have the credentials set up!), and includes Localstack endoint flag: –endpoint-url=http://localhost:4566

Configuration files

We use Scala with Play framework for this particular application, and therefore have .conf files. In local.conf file we have the following:

aws { localstack.endpoint="http://localstack:4566" region = "eu-central-1" s3.bucket1 = "bucket1" s3.bucket2 = "bucket2" sqs.my_queue = "my-queue" sqs.queue_enabled = true }

The real application.conf file has resource names injected at the instance startup. They live in an autoscaling group launch template where they are created by Terraform (out of scope of this post).

Initializing SQS client based on the environment

The example here is for creating an SQS client. Below are snippets most relevant to the topic.

In order to initialize the SQS Service so that it can be injected into other services we can do this:

lazy val awsSqsService: QueueService = createsSqsServiceFromConfig()

In createsSqsServiceFromConfig we check if the configuration has a Localstack endpoint and if so, we build LocalStack client:

protected def createsSqsServiceFromConfig(): QueueService = { readSqsClientConfig().map { config => val sqsClient: SqsClient = config.localstackEndpoint match { case Some(endpoint) => new LocalStackSqsClient(endpoint, config.region) case None => new AwsSqsClient(config.region) } new SqsQueueService(config.queueName, sqsClient) }.getOrElse(fakeAwsSqsService) }

readSqsClientConfig is used to get configuration values from .conf files:

private def readSqsClientConfig = {
val sqsName = config.get[String]("aws.sqs.my_queue")
val sqsRegion = config.get[String]("aws.region")
val localStackEndpoint = config.getOptional[String]("aws.localstack.endpoint")
SqsClientConfig(sqsName, sqsRegion, localStackEndpoint)
}

Finally LocalStackSqsClient initialization looks like this:

class LocalStackSqsClient(endpoint: String, region:String) extends SqsClient with Logging {
private val sqsEndpoint = new EndpointConfiguration(endpoint, region)
private val awsCreds = new BasicAWSCredentials("test", "test")
private lazy val sqsClientBuilder = AmazonSQSClientBuilder.standard()
.withEndpointConfiguration(sqsEndpoint)
.withCredentials(new AWSStaticCredentialsProvider(awsCreds))
private lazy val client = sqsClientBuilder.build()

override def BuildClient(): AmazonSQS = { log.debug("Initializing LocalStack SQS service") client } }

Real AWS Client for the test/live environment (a snippet):

    AmazonSQSClientBuilder.standard()
      .withCredentials(new DefaultAWSCredentialsProviderChain)
      .withRegion(region)

Notice that we need fake BasicAWSCredentials that allows us to pass in dummy AWS access key and secret key and then we use AWSStaticCredentialsProvider, an implementation of AWSCredentialsProvider that just wraps static AWSCredentials. When real AWS environment is used, instead of AWSStaticCredentialsProvider we use DefaultAWSCredentialsProviderChain, which picks the EC2 Instance Role if it’s unable to find credentials by any other methods.

And that’s it. Happy coding!

Unit testing 101 – mob rulz

In a recent developer forum I made the rather wild decision to try demonstrate the principles of unit testing via an interactive mobbing session. I came prepared with some simple C# functions based around an Aspnetcore API and said “let’s write the tests together”. The resultant session unfolded not quite how I anticipated, but it was still lively, fun and informative.

The first function I presented was fairly uncontentious – the humble fizzbuzz:

[HttpGet]
[Route("fizzbuzz")]
public string GetFizzBuzz(int i)
{
    string str = "";
    if (i % 3 == 0)
    {
        str += "Fizz";
    }
    if (i % 5 == 0)
    {
        str += "Buzz";
    }
    if (str.Length == 0)
    {
        str = i.ToString();
    }

    return str;
}

Uncontentious that was, until a bright spark (naming no names) piped up with questions like “Shouldn’t 6 return ‘fizzfizz’?”. Er… moving on…

I gave a brief introduction to writing tests using XUnit following the Arrange/Act/Assert pattern, and we collaboratively came up with the following tests:

[Fact]
public void GetFizzBuzz_FactTest()
{
    // Arrange
    var input = 1;

    // Act
    var response = _controller.GetFizzBuzz(input);

    // Assert
    Assert.Equal("1", response);
}

[Theory]
[InlineData(1, "1")]
[InlineData(2, "2")]
[InlineData(3, "Fizz")]
[InlineData(4, "4")]
[InlineData(5, "Buzz")]
[InlineData(9, "Fizz")]
[InlineData(15, "FizzBuzz")]
public void GetFizzBuzz_TheoryTest(int input, string output)
{
    var response = _controller.GetFizzBuzz(input);
    Assert.Equal(output, response);
}

So far so good. We had a discussion about the difference between “white box” and “black box” testing (where I nodded sagely and pretended I knew exactly what these terms meant before making the person who mentioned them provide a definition). We agreed that these tests were “white box” testing because we had full access to the source code and new exactly what clauses we wanted to cover with our test cases. With “black box” testing we know nothing about the internals of the function and so might attempt to break it by throwing large integer values at it, or finding out exactly whether we got back “fizzfizz” with an input of 6.

Moving on – I presented a new function which does an unspecified “thing” to a string. It does a bit of error handling and returns an appropriate response depending on whether the thing was successful:

[Produces("application/json")]
[Route("api/[controller]")]
[ApiController]
public class AwesomeController : BaseController
{
    private readonly IAwesomeService _awesomeService;

    public AwesomeController(IAwesomeService awesomeService)
    {
        _awesomeService = awesomeService;
    }

    [HttpGet]
    [Route("stringything")]
    public ActionResult<string> DoAThingWithAString(
        string thingyString)
    {
        string response;

        try
        {
            response = _awesomeService
                           .DoAThingWithAString(thingyString);
        }
        catch (ArgumentException ex)
        {
            return BadRequest(ex.Message);
        }
        catch (Exception ex)
        {
            return StatusCode(500, ex.Message);
        }

        return Ok(response);
    }
}

This function is not stand-alone but instead calls a function in a service class, which does a bit of validation and then does the “thing” to the string:

public class AwesomeService : IAwesomeService
{
    private readonly IAmazonS3 _amazonS3Client;

    public AwesomeService(IAmazonS3 amazonS3Client)
    {
        _amazonS3Client = amazonS3Client;
    }

    public string DoAThingWithAString(string thingyString)
    {
        if (thingyString == null)
        {
            throw new ArgumentException("Where is the string?");
        }

        if (thingyString.Any(char.IsDigit))
        {
            throw new ArgumentException(
                @"We don't want your numbers");
        }

        var evens = 
            thingyString.Where((item, index) => index % 2 == 0);
        var odds = 
            thingyString.Where((item, index) => index % 2 == 1);

        return string.Concat(evens) + string.Concat(odds);
    }
}

And now the debates really began. The main point of contention was around the use of mocking. We can write an exhaustive test for the service function to exercise all the if clauses and check that the right exceptions are thrown. But when testing the controller function should we mock the service class or not?

Good arguments were provided for the “mocking” and “not mocking” cases. Some argued that it was easier to write tests for lower level functions, and if you did this then any test failures could be easily pinned down to a specific line of code. Others argued that for simple microservices with a narrow interface it is sufficient to just write tests that call the API, and only mock external services.

Being a personal fan of the mocking approach, and wanting to demonstrate how to do it, I prodded and cajoled the group into writing these tests to cover the exception scenarios:

public class AwesomeControllerTests
{
    private readonly AwesomeController _controller;
    private readonly Mock<IAwesomeService> _service;

    public AwesomeControllerTests()
    {
        _service = new Mock<IAwesomeService>();
        _controller = new AwesomeController(_service.Object);
    }

    [Fact]
    public void DoAThingWithAString_ArgumentException()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new ArgumentException("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<BadRequestObjectResult>(response);
        Assert.Equal(400, 
            ((BadRequestObjectResult)response).StatusCode);
        Assert.Equal("boom", 
            ((BadRequestObjectResult)response).Value);
    }

    [Fact]
    public void DoAThingWithAString_Exception()
    {
        _service.Setup(x => x.DoAThingWithAString(It.IsAny<string>()))
            .Throws(new Exception("boom"));

        var response = _controller.DoAThingWithAString("whatever")
                                  .Result;

        Assert.IsType<ObjectResult>(response);
        Assert.Equal(500, ((ObjectResult)response).StatusCode);
        Assert.Equal("boom", ((ObjectResult)response).Value);
    }        
}

Before the session descended into actual fisticuffs I rapidly moved on to discuss integration testing. I added a function to my service class that could read a file from S3:

public async Task<object> GetFileFromS3(string bucketName, string key)
{
    var obj = await _amazonS3Client.GetObjectAsync(
        new GetObjectRequest 
        { 
            BucketName = bucketName, 
            Key = key 
        });

    using var reader = new StreamReader(obj.ResponseStream);
    return reader.ReadToEnd();
}

I then added a function to my controller which called this and handled a few types of exception:

[HttpGet]
[Route("getfilefroms3")]
public async Task<ActionResult<object>> GetFile(string bucketName, string key)
{
    object response;

    try
    {
        response = await _awesomeService.GetFileFromS3(
                             bucketName, key);
    }
    catch (AmazonS3Exception ex)
    {
        if (ex.Message.Contains("Specified key does not exist") ||
            ex.Message.Contains("Specified bucket does not exist"))
        {
            return NotFound();
        }
        else if (ex.Message == "Access Denied")
        {
            return Unauthorized();
        }
        else
        {
            return StatusCode(500, ex.Message);
        }
    }
    catch (Exception ex)
    {
        return StatusCode(500, ex.Message);
    }

    return Ok(response);
}

I argued that here we could write a full end-to-end test which read an actual file from an actual S3 bucket and asserted some things on the result. Something like this:

public class AwesomeControllerIntegrationTests : 
    IClassFixture<WebApplicationFactory<Api.Startup>>
{
    private readonly WebApplicationFactory<Api.Startup> _factory;

    public AwesomeControllerIntegrationTests(
        WebApplicationFactory<Api.Startup> factory)
    {
        _factory = factory;
    }

    [Fact]
    public async Task GetFileTest()
    {
        var client = _factory.CreateClient();

        var query = HttpUtility.ParseQueryString(string.Empty);
        query["bucketName"] = "mybucket";
        query["key"] = "mything/thing.xml";
        using var response = await client.GetAsync(
            $"/api/Awesome/getfilefroms3?{query}");
        using var content =  response.Content;
        var stringResponse = await content.ReadAsStringAsync();

        Assert.NotNull(stringResponse);
    }
}

At this point I was glad that the forum was presented as a video call because I could detect some people getting distinctly agitated. “Why do you need to call S3 at all?” Well maybe the contents of this file are super mega important and the whole application would fall over into a puddle if it was changed? Maybe there is some process which generates this file on a schedule and we need to test that it is there and contains the things we are expecting it to contain?

But … maybe it is not our job as a developer to care about the contents of this file and it should be some other team entirely who is responsible for checking it has been generated correctly? Fair point…

We then discussed some options for “integration testing” including producing some local instance of AWS, or building a local database in docker and testing against that.

And then we ran out of time. I enjoyed the session and I hope the other participants did too. It remains to be seen whether I will be brave enough to attempt another interactive mobbing session in this manner…

Spooky season special – tales of terrors and errors

Anyone who has been working in software development for more than a few months will know the ice-cold sensation that creeps over you when something isn’t working and you don’t know why. Luckily, all our team members have lived to tell the tale, and are happy to share their experiences so you might avoid these errors in future… 

The Legend of the Kooky Configuration – Rhys Parsons
In my first job, in the late 90s, I was working on a project for West Midlands Fire Service (WMFS). We were replacing a key component (the Data Flow Controller, or DFC) that controlled radio transmitters and was a central hub (GD92 router) for communicating with appliances (fire engines). Communication with the Hill Top Sites (radio transmitters) was via an X.25 network.

The project was going well, we had passed the Factory Acceptance Tests, and it was time to test it on-site. By this point, I was working on the project on my own, even though I only had about two years of experience. I drove down to Birmingham from Hull with the equipment in a hired car, a journey of around 3.5 hours. The project had been going on for about a year by this point, so there was a lot riding on this test. WMFS had to change their procedures to mobilise fire engines via mobile phones instead of radio transmitters, which, back in the late 90s, was quite a slow process (30 seconds call setup). I plugged in the computers and waited for the Hill Top Sites to come online. They didn’t. I scratched my head. A lot. For an entire day. Pouring over code that looked fine. Then I packed it all up and drove back to Hull.

Back in the office, I plugged in the computer to test it. It worked immediately! Why?! How could it possibly have worked in Hull but not in Birmingham! It made absolutely no sense!

I hired a car for the next day and drove back down to Birmingham early, aiming to arrive just after 9, to avoid the shift change. By this point, I was tired and desperate.

I plugged the computer back in again. I had made absolutely no changes, but I could see no earthly reason why it wouldn’t work. “Sometimes,” I reasoned, “things just work.” That was my only hope. This was the second-day WMFS were using slower backup communications. One day was quite a good test of their resilience. Two days were nudging towards the unacceptable. Station Officers were already complaining. I stared at the screen, willing the red graphical LEDs to turn green. They remained stubbornly red. At the end of the day, I packed up the computer and drove back to Hull.

The WMFS project manager phoned my boss. We had a difficult phone conversation, and we decided I should go again the next day.

Thankfully, a senior engineer who had the experience of X.25 was in the office. I told him of this weird behaviour that made no sense. We spoke for about two minutes which concluded with him saying, “What does the configuration look like?”

My mouth dropped. The most obvious explanation. I hadn’t changed the X.25 addresses! I was so busy wondering how the code could be so weirdly broken that I hadn’t considered looking at the configuration. So, so stupid! I hadn’t changed the configuration since I first set up the system, several months earlier, it just wasn’t in my mind as something that needed doing.

Day three. Drove to Birmingham, feeling very nervous and stupid. Plugged in the computer. Changed the X.25 addresses. Held my breath. The graphical LEDs went from red to orange, and then each Hill Top Site went green in turn, as the transmit token was passed back and forth between them and the replacement DFC. Finally, success!

A Nightmare on Character Street – Rosie Chandler
We recently implemented a database hosted in AWS, with the password stored in the AWS Secrets Manager. The password is pulled into the database connection string, which ends up looking something like this:

“Server=myfunkyserver;Port=1234;Database=mycooldatabase;User ID=bigboss;Password=%PASSWORD%”

Where %PASSWORD% here is substituted with the password pulled out of the Secrets Manager. We found one day that calls to the database started throwing connection exceptions after working perfectly fine up until that point. 

After spending a lot of time scratching my head and searching through logs, I decided to take a peek into the Secrets Manager to see what the password was. Turns out that day’s password was something like =*&^%$ (note it starts with “=”) which means that the connection string for that day was invalid. After much facepalming, we implemented a one-line fix to ensure that “=” was added to the list of excluded characters for the password.

The Case of the Phantom Invoices – Chris Rimmer
Many years ago I was responsible for writing code that would email out invoices to customers. I was very careful to set things up so that when the code was tested it would send messages to a fake email system, not a real one. Unfortunately, this wasn’t set up in a very fail-safe way, meaning that another developer managed to activate the email job in the test system and sent real emails to real customers with bogus invoices in them. This is not the sort of mistake you quickly forget. Since then I’ve been very careful configuring email systems in test environments so that they can only send emails to internal addresses.


Tales from the Dropped Database – Rich Brown
It was a slow and rainy Thursday morning, I was just settling into my 3rd cup of coffee when a fateful email appeared with the subject ‘Live site down!!!’

Ah, of course, nothing like a production issue to kick-start your morning. I checked the site: it was indeed down. Sadly, the coffee would have to wait.

Logging onto the server, I checked the logs. A shiver ran down my spine.

ERROR: SQL Error – Table ‘users’ does not exist

ERROR: SQL Error – Table ‘articles’ does not exist

ERROR: SQL Error – Table ‘authors’ does not exist

ERROR: SQL Error – Database ‘live-db’ does not exist

That’s…. unusual…

Everything was working and then suddenly it stopped, no data existed.

Hopping onto the database server showed exactly that. Everything was gone, every row, every table, even the database itself wasn’t there.

I pulled in the rest of the team and we scratched our collective heads, how could this even happen? The database migration system shouldn’t be able to delete everything. We’ve taken all the right mitigations to prevent injection attacks. There’s no reason for our application to do this.

I asked, “What was everyone doing when the database disappeared?”

Dev 1 – “Writing code for this new feature”

Dev 2 – “Updating my local database”

Dev 3 – “Having a meeting”

Ok, follow up question to Dev 2 – “How did you update your database?”

Dev 2 – “I dropped it and let the app rebuild it as I usually do”

Me – “Show me what you did”

ERROR: SQL Error – Cannot drop database ‘live-db’ because it does not exist

Turned out Dev 2 had multiple SQL Server Manager instances open, one connected to their local test database and the other connected to the live system to query some existing data.

They thought they were dropping their local database and ended up dropping live by mistake.

One quick database restore and everything was back to normal.

Moral of the story, principle of least access. If you have a user who only needs to read data, only grant permissions to read data.

Wortüberbreitedarstellungsproblem

Don’t worry if you don’t understand German, the title of this post will make sense if you read on…

We’ve been working for the last few years with De Gruyter to rebuild their delivery platform. This has worked well and we have picked up an award along the way. Part of our approach has been to push out new features and improvements to the site on a weekly basis. Yesterday we did this, deploying a new home page design that has been a month or two in the making. The release went fine, but then we started getting reports that the new home page didn’t look quite right for users on iPhones and iPads. I took a look – it seemed fine on my Android phone and on my daughter’s iPhone. A developer based in India looked on his iPhone with different browsers and everything was as expected. But somehow German users were seeing text that overflowed the edge of the page. So what was going on – how could German Apple devices be so different? Most odd.

It turned out that the problem was not a peculiarity of the German devices, but of the German language. German is famous for its long compound words (like the title of this post) and often uses one big word where English would use a phrase. Our new homepage includes a grid of subjects that are covered in books that De Gruyter publishes. In English these subjects mostly have quite short names, but in German they can be quite long. For smaller screens the subject grid would shift from three to two columns, but even so this was not enough to accommodate the long German words, meaning the page overflowed.

Subject grid in German

The fix was quite a simple one, for the German version of the page the grid would shift to two columns more readily and then to a single column for a phone screen. But I think the lesson is that there is more to catering for different languages than checking the site looks fine in English and that all the text has been translated. The features of the target language can have unexpected effects and need checking. It’s easy to overlook this when dealing with two languages that are apparently quite similar.

On a similar note, it can be easy to be complacent that your site is easy to use because you understand it or believe it is accessible to those using a screen reader because you have added alt text onto images. Just because it works for you doesn’t mean it works for others and that always needs bearing in mind.

Finally, that title? It translates as something like “Overly wide word display problem” and was suggested by someone at De Gruyter as a German compound word to describe the problem we saw.

Embracing Impermanence (or how to check my sbt build works)

Stable trading relationships with nearby countries. Basic human rights. A planet capable of sustaining life. What do these three things have in common?

The answer is that they are all impermanent. One moment we have them, the next moment – whoosh! – they’re gone.

Today I decided I would embrace our new age of impermanence insofar as it pertains to my home directory. Specifically, I wondered whether I could configure a Linux installation so that my home directory was mounted in a ramdisk, created afresh each time I rebooted the server.

Why on earth would I want to do something like that?

The answer is that I have a Scala project, built using sbt (the Scala Build Tool), and I thought I’d clear some of the accumulated cruft out of the build.sbt file, starting with the configured resolvers. These are basically the repositories which will be searched for the project’s dependencies – there were a few special case ones (e.g. one for JGit, another for MarkLogic) and I strongly suspected that the dependencies in question would now be found in the standard Maven repository. So they could probably be removed, but how to check, since all of the dependencies would now exist in caches on my local machine?

A simple solution would have been to delete the caches, but that involves putting some effort into finding them, plus I have developed a paranoid streak about triggering unnecessary file writes on my SSD. So I had a cunning plan – build a VirtualBox VM and arrange for the home directory on it to be a ramdisk, thus I could check the code out to it and verify that the code will build from such a checkout, and this would then be a useful resource for conducting similar experiments in the future.

Obviously this is not quite a trivial undertaking, because I need some bits of the home directory (specifically the .ssh directory) to persist so I can create the SSH keys needed to authenticate with GitHub (and our internal GitLab). Recreating those each time the machine booted would be a pain.

After a bit of fiddling, my home-grown solution went something like this:

  • Create a Virtualbox VM, give it 8G memory and a 4G disk (maybe a bit low if you think you’ll want docker images on it; I subsequently ended up creating a bigger disk and mounting it on /var/lib/docker)
  • Log into VM (my user is daniel), install useful things like Git, curl, zip, unzip etc.
  • Create SSH keys, upload to GitHub / GitLab / wherever
  • Install SDKMAN! to manage Java versions
  • Create /var/daniel and copy into it all of the directories and files in my home directory which I wanted to be persisted; these were basically .ssh for SSH keys, .sdkman for java installations, .bashrc which now contains the SDKMAN! init code, and .profile
  • Save the following script as /usr/local/bin/create_home_dir.sh – this wipes out /home/daniel, recreates it and mounts it as tmpfs (i.e. a ramdisk) and then symlinks into it the stuff I want to persist (everything in /var/daniel)
#!/bin/bash
DIR=/home/daniel

mount | grep $DIR && umount $DIR

[ -d $DIR ] && rm -rf $DIR

mkdir $DIR

mount -t tmpfs tmpfs $DIR

chown -R daniel:daniel $DIR

ls -A /var/daniel | while read FILE
do
  sudo -u daniel ln -s /var/daniel/$FILE $DIR/
done
  • Save the following as /etc/systemd/system/create-home-dir.service
[Unit]
description=Create home directory

[Service]
ExecStart=/usr/local/bin/create_home_dir.sh

[Install]
WantedBy=multi-user.target
  • Enable the service with systemctl enable create-home-dir
  • Reboot and hope

And it turns out that this worked; when the server came back I could ssh into it (i.e. the authorized_keys file was recognised in the symlinked .ssh directory) and I had a nice empty workspace; I could git clone the repo I wanted, then build it and watch all of the dependencies get downloaded successfully. I note with interest that having done this, .cache is 272M and .sbt is 142M in size. That seems to be quite a lot of downloading! But at least it’s all in memory, and will vanish when the VM is switched off…

STEM Ambassadors In the Field

(Or a fun way to introduce local kids to programming)

A previous employer encouraged me to join the STEM Ambassador program at the end of 2017 (https://www.stem.org.uk/stem-ambassadors) and I willingly joined, wanting to give something back to society. The focus of the program is to send ambassadors into schools and local communities, to act as role models and to demonstrate to young people the benefits and rewards that studying STEM subjects can bring. I approached my local primary school (at the time my daughter was a pupil there) about the possibility of setting up an after-school computing club, and they jumped at the chance.

I started the club unsure what to expect, but with a lot of hope and some amount of trepidation. I took on groups of 10 or so KS2 pupils, teaching them the basics of loops, events, variables and functions, largely using Scratch (https://scratch.mit.edu/) and an eclectic mix of programmable robots that I’d acquired over the years (I have a few from Wonder Workshop https://www.makewonder.com/robots/ and also a pair of Lego Boost robots https://www.lego.com/en-gb/product/boost-creative-toolbox-17101). Running the club was extremely rewarding. Some of the kids were brilliant, and will no doubt have a great future ahead of them. Others mainly wanted only to drive the robots around – but I figured that as long as they were having fun then their and my time was well spent.

Then, in March 2020, the Covid-19 pandemic hit. Kids were sent home for months, and all clubs were cancelled, with no knowing when they might start up again. The pandemic has obviously been tough for everyone, but one of the hidden effects has been the impact on the education of our children. It will take years, probably, to know exactly what effect two years of lockdown has had on the attainment opportunities and mental heath of young people. Many of them missed out not only on in-person schooling, but also on all the additional extra-curricular opportunities like school visits, and also things like the STEM Ambassador program.

So now, two years and a change of jobs later, I thought it was about time I got myself back in the field, and start up my STEM activities again.

My first opportunity has been to run a “retro games arcade” stall at the school’s summer fair. This involved commandeering a tiny wooden cabin plonked the wrong-way-round on the edge of the school field, next to one of the temporary classrooms. To turn this into a games arcade I needed to black out the windows to make it dark enough inside to see a computer screen, then to run an extension lead out of the window of the classroom, and to quietly steal a few chairs and tables upon which to set up my “arcade consoles”. Blacking out the windows was achieved by covering them up with garden underlay and sticking drawing pins around the edges (much to the detriment of my poor thumbs).

The field and cabin in which I did my STEM Ambassadoring, with the (mostly willing) assistance of my daughter

For the arcade machines, I wrote two games in Scratch based around the classic arcade games “Defender” and “Frogger”. I set up two laptops to run these games, covering over all but the arrow keys, trackpad and spacebar with shiny card. My aim was to write games that the students could replicate themselves, if they wished. I wanted games simple enough that a small child could play, but would also be fun for an older child or a parent to play as well. The gameplay should ideally last for 1-4 minutes, and the player should be able to accumulate a high score. As the afternoon progressed I kept track of the highest two scores in each game so that the players with these scores could win a prize at the end of the afternoon.

If you’re interested in seeing these games then you can have a look here:

Defender: https://scratch.mit.edu/projects/711066827/

Frogger: https://scratch.mit.edu/projects/317968991/

Of course the afternoon in question was one of the hottest days of the year. I spent 3 hours diving into and out of the tiny sweltering cabin, caught between managing the queue, taking the 50p fee, handing out Pokémon cards to the players (I got a stack of them and gave one out to every player), explaining to the kids how to play the games, and keeping track of the ever-changing high scores. I did have willing help from my daughter (who especially liked taking the money) and my husband (who seemed adept at managing the queue). At some point I managed to eat a burger and grab a drink, but it was a pretty frenetic afternoon.

67 Bricks agreed to give me £50 to pay for prizes. I bought Sonic and Mario soft toys, a Lego Minecraft set, and a large pack of assorted Pokémon cards. I also washed up a Kirby soft toy that I found in my daughter’s “charity shop” pile and added that to the prize pool. Throughout the afternoon I kept track of the top two highest scores in both games, using the incredibly high-tech method of a white-board and dry-wipe marker. The hardest part was figuring out how to spell everyone’s name, and in moving the first-place score to second place every time a high-score was beaten. Oh, and making sure the overly-enthusiastic children didn’t wander off with poor Mario before the official prize giving ceremony.

As the afternoon progressed I encountered some kids who aced the games, and actively competed with each other to keep their place at the top of the leader board. Other children struggled to control the game and I had to give them a helping hand (quite literally – I said I would control the cursor keys while they controlled the space bar). And then there was the dad who was determined to win a prize for his child, and kept returning to make sure of his position on the leader board. But eventually the last burger was eaten, the arcade was closed, and the prizes announced. Four children went home happily clutching their prizes and the rest their collection of assorted Pokémon cards.

For the next step in my STEM Ambassador journey, I have agreed to start up the computing club again in September. I’m hoping to teach the children the skills to write their own arcade games in Scratch. Watch this space.

Things Customers Don’t Understand About Search

Note, this post is based on a dev forum put together by Chris.

Full-text search is a common feature of systems 67 Bricks build. We want to make it easy for users to find relevant information quickly often through a faceted search function. Understanding user needs and building a top notch user experience is vital. When building faceted search, we generally use either ElasticSearch (or AWS’s OpenSearch) or MarkLogic. Both databases offer very similar feature sets when it comes to search, though one is more targetted towards JSON based documents and the other, XML.

Search can seem magical at first glance and do some amazing things but this can lead to situations where customers (and UX/UI designers) assume the search mecahnism can do more than it can. We frequently find a disconnect between what is desired and what is feasable with search systems.

There are 2 main categories of problems we often see are:

  1. Customers / Designers asking for things that could be done, but often come with nasty performance implications
  2. Features that seem reasonable to ask for at first glance, but once dug into reveal logical problems that make developing the feature near impossible

Faceted Search

Faceted search systems are some of the most common systems we build at 67 Bricks. The user experience typicallly starts with a search box that they enter a number of terms into, hit enter and then be presented with a list of results in some kind of relevancy order. Often there is a count of results displayed alongside a pagination mechanism to iterate through the results (e.g. showing results 1-10 of 12,464). We also show facets, counts for how many results fit into different buckets. All this is handled in a single query that often takes less than 100ms which seems miraculous. Of course, this isn’t magic, full-text search systems use a variety of clever indexes to make searching and computing the facet counts quick.

Lets make a search system for a hypothetical website cottagesearch.com. Our first screen will present the user with some options to select a location, the date range they want to stay and how many guests are coming. We perform the search and show the matching results. How should we display the results and more importantly, how do we show the facets?

Let’s say we did a search for 2 bedroom cottages. We’ve seen wireframes for a number of occassions where the facet count for all bedroom numbers are displayed. So users see the number of results applicable to each bedroom count they would get if they didn’t limit the search to just 2 bedrooms (i.e. there aren’t that many 2 bed options, but look at how many 3 bed options are available). At first glance, this seems like a sensible design, but fundmanetally breaks how search systems work with faceting, they will return counts, but only for the search just done.

We could get around this by doing 2 searches, 1 limited by bedrooms and one that does not to retrieve the facet counts. This may seem like a sensible idea when we have 1 facet, but what do we do when we have more? Do we need to do multiple searches, effectively making an N+1 problem? How to we display numbers? Should the counts for the location facet include the limit of bedrooms or not? As soon as we start exploring additional situations we start to see the challenges the original design presents.

This gets harder when we consider non-exclusionary facets. Let’s say our cottage search system lets you filter by particular features, such as a wood burner, hot tub or dishwasher. Now, if we show counts of non-selected facets, what do these numbers represent? Do they include results that already include the selected facet or not? Here, the logic starts to break down and becomes ever more confusing to the end user and difficult to implement for the developer.

Other Complex Facet Situations

A common question we need to ask with non-exclusionary facets: Is it an AND or an OR search? The answer is very domain dependant, but either way we suggest steering away from facets counts in these situations.

Date ranges provide an interesting problem, some sites will purposefully search outside of the selected range so as to provide results near the selected date range. This may be a useful or annoying depending on what the user expects and is trying to achieve. Some users would want exact matches and would have no interest in results that do not meet the selected date range.

Ordering facets is also a questions that may be overlooked. Do you order lexographically or do you order by descending number of matches? What about names, year ranges or numeric values? Again, a lot of what users expect and would want comes down to the domain being dealt with and the needs of the users.

When users select a new facet, what should the UI do? Should the search immediately rerun and the results and facets update or should there be a manual refresh button the user has to select before the search is updated? An immediate refresh would be slower, but let users narrow down carefully, while a manual update would reduce the number of searches done, but then users may be able to select a number of facets in such a way that no results would be returned.

Hierarchies can also prove tricky. We often see taxonomies being used to inform facets, say subjects with sub categories. How should these be displayed? Again there are many solutions to pick from with different sets of trade-offs.

Advanced Search

Advanced search can often be a bit like a peacocks tail – something that looks impressive, but doesn’t contribute a fair share of value based on how much effort it takes to develop. A lot of designers and product owners love the idea of it but in practice, it can end up being somewhat confusing to use and many end users end up avoiding it.

Boolean builders exist in many systems where the designer of advanced search will insist on allowing users to build up some complex search with lots of boolean AND/OR options, but displaying this to users in a way they can understand is challenging. If a user builds a boolean search such as: GDP AND Argentina OR Brazil do we treat it as (GDP AND Argentina) OR Brazil or should it be interpreted as GDP AND (Argentina OR Brazil). We could include brackets in the builder, but this just further complicates the UI.

We frequently get bugs and feedback on advanced search, some of this feedback can amount to different users having contradictory opinions on how it should work. We would ask product owners to carefully consider “How many people will use it?” Google has a well build UI for advanced search that does away with the challenges of boolean logic by having separate fields for ANDs ORs and NOTs.

Google advanced search UI

An advanced search facility can introduce additional complexity when combined with facets. If an advances search lets you select some facets before completing the search, does this form part of the string in the search box? We have had mixed results with enabling power users to enter facets into search fields (e.g. bedrooms:3), but this can be tricky, some users can deal with it, but others may prefer a advanced search builder while others will rely on facets post search.

Summary

In conclusion, we have 3 main takeaways

  • Search is much more complex than it first appears
  • Facets are not magic, just because you can draw a nice wireframe doesn’t make it feasable to develop
  • Advanced search can be tricky to get right and even then, only used by a minority of users

We’ve build many different types of search and have experimented with a number of approaches in the past and we offer some tried and tested principles:

  • Make searches stateless – Don’t add complexity by trying to maintain state between facets changes, simply treat each change as a fresh search. That way URLs can act as a method of persistence and bookmarking common searches.
  • Have facets only display counts for the current search and do not display counts for other facets once one has been selected within that category.
  • Only use relevancy as the default ordering mechanism – You may be tempted to allow results to be ordered in different ways, such as published date, but this can cause problems with weakly matching, but recent results appearing first.
  • Don’t build an advanced search unless you really need to and if you have to, use a Google style interface over a boolean query builder.
  • Check that search is working as expected – Have domain experts check that searches are returning sensible results and look into using analytics to see if users are having a happy journey through the application (i.e. run a search and then find the right result within the first few hits).
  • Beware of exhaustive search use cases – As many search mechanisms work on some score based on relevancy to the terms entered, having a search that guarantees a return of everything relevant can be tricky to define and to develop.

My Journey to Getting AWS Certified

When I joined 67 Bricks in January 2021 I knew close to zero about AWS, and not-a-lot about cloud services in general. I had dabbled a bit in Azure in my previous job, and I understood the fundamentals of what “the cloud” was, but I was very aware that I’d have to get up to speed if I wanted to be useful at developing applications on AWS. I joined our team on the EIU project, and on day one I was exposed to discussions about S3 buckets, lambda functions, glue jobs and SNS topics – all things I knew nothing about.

I asked one of the EIU enablement team to give me an overview, and I was introduced to the AWS console and shown some of the key services. Over the next few months I gradually started to get to grips with the basics – I learned how to upload to and download objects from S3, write to and query a DynamoDB table, and search for things in CloudWatch. I was very proud when I wrote my first lambda function, but I still felt like I was winging it.

I was encouraged by our development manager to look into obtaining some AWS certifications. The obvious starting point was Cloud Practitioner (https://aws.amazon.com/certification/certified-cloud-practitioner/?ch=sec&sec=rmg&d=1) which covers the basics of what “the cloud” is, and the applications of core AWS services. The best course I found to prepare for this was one from Amazon themselves https://explore.skillbuilder.aws/learn/course/134/aws-cloud-practitioner-essentials (you might need to sign in to the skill builder to access it, but the course is free). It uses the analogy of a coffee shop to explain the concepts of instances, scaling, load balancing, messaging and queueing, storage, networking etc, in an easy to understand manner. After a lot of procrastinating, and wondering if I was ready, I eventually took the exam in October 2021 and passed it with a respectable score.

The cloud practitioner course covers AWS services in an abstract manner – you learn about the core services without ever having to use them. In fact you could probably pass the course without ever logging into the AWS console. To demonstrate real experience and knowledge of AWS services, I decided that the certification to go for next was Developer Associate (https://aws.amazon.com/certification/certified-developer-associate/?ch=sec&sec=rmg&d=1). AWS doesn’t offer their own course to study for this certification – instead they provide links to numerous white papers, which make for fairly dry reading, and it is not clear exactly what knowledge is and is not required.

After doing a bit of research I decided that this course on Udemy https://www.udemy.com/course/aws-certified-developer-associate-dva-c01/ by Stephane Maarek was the most highly rated. With 32 hours of videos to absorb, this was not a trivial undertaking, but after slotting in a few hours of study either before work or in the evenings, I made it through with two books stuffed with notes.

The Developer Associate certification requires you to understand at a fairly deep level how the AWS compute, data, storage, messaging, monitoring and deployment services work, and also to understand architectural best practices, the AWS shared responsibility model, and application lifecycle management. A typical exam question for Developer Associate might ask you to calculate how many read-capacity-units or write-capacity-units a DynamoDB table consumes under various circumstances. Another one might test your understanding of how many EC2 instances a particular auto-scaling policy would add or remove. Another question might require you to understand what lambda concurrency limits are for.

After working my way through a number of practice exams (the best ones seem to be by Jon Bonzo, again on Udemy https://www.udemy.com/course/aws-certified-developer-associate-practice-exams-amazon/) I took the plunge and sat the exam in January 2022, again passing with a respectable score.

But what next? The knowledge I’d gained up until this point had given me real practical skills, and a deeper knowledge of how the various AWS services connect together. For example, it was no longer a mystery how lambda functions could be triggered by SNS topics or messages from an SQS queue, and could then call another API perhaps hosted on EC2 to initiate some other process. And I could understand how to utilise infrastructure-as-code (e.g. CloudFormation or CDK) along with services like CodePipeline and CodeDeploy, to automate build processes. But I wanted a greater understanding of the “bigger picture”, and so next I chose to go for the Solutions Architect Associate certification (https://aws.amazon.com/certification/certified-solutions-architect-associate/?ch=sec&sec=rmg&d=1).

The Solutions Architect Associate exam typically presents a scenario and then asks you to choose which option provides the best solution. One option is usually wrong, but there could be more than one solution which would work – but you have to scrutinise the question to see which one best meets the requirements of the scenario. Are they asking for the cheapest solution? Or the fastest? Or the most fault tolerant? (Look for clues like “must be highly available” – and so the correct answer will probably involve multi-AZ deployments). Is any down-time acceptable? Is data required in real time, or is a delay acceptable? (E.g. do we choose Kinesis or SQS?) If a customer is migrating to the cloud are there time constraints, and how much data is there to migrate? (E.g. it can take a month or two to set up a Direct Connect connection, but you could have a Snowmobile in a week. A VPN might work but there are limits to the data transfer rates).

Again, I chose Stephane Maarek’s course on Udemy (https://www.udemy.com/course/aws-certified-solutions-architect-associate-saa-c02/) – his study materials are clear and he also notes which sections are duplicates of those in the developer associate course. I again used Jon Bonzo’s practice exams (https://www.udemy.com/course/aws-certified-solutions-architect-associate-amazon-practice-exams-saa-c03/). There is a fairly hard-core section on VPC, which is something I struggled with. Stephane presents a spaghetti-like diagram showing the relationship between VPCs, public and private subnets, internet gateways, NAT gateways, security groups, route tables, on-premise set-ups, VPC endpoints, transit gateways, direct connections, VPC peering connections etc – and says “by the end of this section you’ll know what all of this means”. He was right, but as someone with limited networking experience and knowledge, I found it pretty tough.

I sat the exam in April 2022, a day before I figured out that the cough and fatigue I’d developed was actually Covid. I passed the exam respectably again, and then collapsed into bed for a few days to recover.

At this point it’s probably worth mentioning how the exam process works. If you like, you can book an exam in an approved test centre. However, I chose to go with the “online proctored” exams hosted by Pearson Vue. You book an exam slot – generally plenty are available at all times of the day and night, and you can usually find a slot within the next day or two that suits. For the exam you need to be sitting at a clear table with nothing within arms reach. Not even a tissue or a glass of water. You need to run some Pearson software on your laptop that checks no other processes are running (so turn off slack, email, shut down your docker containers etc etc), and then launches their exam platform. You will be asked to present photo ID, and then show the proctor your testing environment. They will want to see your chair and table from all angles, and will want to see your arms to make sure you’re not wearing a watch or have anything hiding up your sleeves. You need your mobile phone in the room, but out of reach, in case they need to call you. And you also need to make sure you are undisturbed in the room for the duration of the exam (which is typically 2-3 hours).

This last point was challenging for me. My home-office is not suitable – being far to crammed with potential cheat material, and I also share it with my husband. The only suitable place is my dining table, in the very open-plan ground floor of my house. Finding a time when I can have the ground-floor to myself for 2-3 hours means scheduling the exam for around 7AM in the morning on a day when the kids are not at school. I ended up putting “do not disturb” signs on the door and issuing dire warnings to everyone that they mustn’t come downstairs until I’d given them the all-clear. Anyone wandering sleepily through the room on a quest for coffee could result in the exam proctor dropping my connection and disqualifying me from the exam. Fortunately, all was well and all the exams I’ve sat so far were carried out without incident.

After obtaining the Solutions Architect Associate certification I thought about taking a break. But then I took a look at the requirements for SysOps Administrator Associate (https://aws.amazon.com/certification/certified-sysops-admin-associate/?ch=sec&sec=rmg&d=1) and realised that I’d already covered about two-thirds of the required material. Now SysOps is not something I have a love for. I have a deep respect for people who understand deployments and pipelines and infrastructure. The Enablement team at the EIU, who I work closely with, are miracle workers who regularly perform magic to get things up and running. The idea that I could learn some of that wizardry seemed far-fetched. But I thought I might as well give it a go.

The SysOps certificate focusses a lot on configuration and monitoring. You learn a lot about load balancers, autoscaling policies, CloudFormation and CloudWatch. And yes, all that indepth knowledge about VPCs and hybrid-cloud set-ups is applicable here too. A typical exam question will present a scenario where something has gone wrong, and you have to pick the best option to fix it. For example, someone can’t SSH into an EC2 instance because something is wrong with the security group. Or someone in a child account of a parent organisation can’t access something in another child account. Yet again I went to Stephane Maarek’s course, which was again excellent (https://www.udemy.com/course/ultimate-aws-certified-sysops-administrator-associate/). And Jon Bonzo again provided the practice exams (https://www.udemy.com/course/aws-certified-sysops-administrator-associate-practice-exams-soa-c01/).

I sat the SysOps exam in June 2022. One thing that caused a little trepidation was that this exam includes “exam labs” – these are practical exercises carried out in the AWS console. It was hard to prepare for these because I could not find any practice labs on-line, and so I was going in cold. However, it turned out that the labs were well defined with clear steps on what was required. Even the ones where I had never really looked at the service before, I was able to find it in the console and figure out what I needed to do. I was asked to:

  • Create a backup plan for an EFS system with two types of retention policy
  • Update a CloudFormation stack to fiddle with some EC2 settings, roles, route tables etc
  • Create an S3 static website and configure some Route 53 failover policies

The second of these caused me the most difficulty – I hadn’t anticipated actually having to write a CloudFormation template – they provided one which I needed to edit and it took me a while to figure out how to actually do this. Turns out that you need to save a new version of the template locally and then re-upload it.

I passed the SysOps exam with a more modest mark than for the other certifications, and I definitely breathed a sigh of relief. I am now definitely taking a breather – perhaps in a few months I might take a look at some of the specialist certifications (maybe Data Analytics?) but for the moment I’m going to get back to some of my other neglected hobbies (I like to draw, and play the piano, and one day I’ll maybe finish my epic fantasy trilogy).

The key take-aways from my experience are:

  • The associate level certifications require you to acquire knowledge that is directly applicable in the day-to-day life of a developer or systems administrator.
  • I was initially concerned that the courses would be part of a propaganda machine from AWS, encouraging us to spend ever larger amounts on AWS services. I found this to not be the case at all. Quite a large part of the material teaches us how to save costs, and how to incorporate our existing on-premises infrastructure with AWS, rather than replacing it entirely.
  • Sitting an exam in your own home is definitely preferable than travelling to a test centre – you get far more flexibility over when you can take the exam. However, not everyone will have a suitable place at home to take the exam, particularly if you share your home with other people, or you do not have a suitable table to work at.
  • Studying for these certifications will require a significant time commitment. The online courses run for 20-30 hours or more, assuming you never pause the videos to take notes, or repeat a section. And that is before you take time to revise or do practice exams.
  • Definitely the most valuable tool for preparing for the exams is by completing as many practice exams as you can find. The best ones include detailed explanations about why a particular answer is correct and the others are wrong.
  • Also note that these certifications have an expiry date – typically 3 years – and also the courses are refreshed periodically. For example, the Solutions Architect Associate is being refreshed at the end of August and Solutions Architect Professional is being refreshed in November.

Dev Forum – Parsing Data

Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.

After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]}) would be considered valid, while the sequence ((({(>>]) would be invalid.

First we define the set of classes that describes the parsed structure:

object BracketParser {

  sealed trait Brackets

  case class RoundBrackets(content: Brackets)
     extends Brackets

  case class SquareBrackets(content: Brackets)
     extends Brackets

  case class AngleBrackets(content: Brackets)
     extends Brackets

  case class CurlyBrackets(content: Brackets)
     extends Brackets

  case object Empty extends Brackets

}

Next, we define the matching rules that parboiled uses:

package com.sixtysevenbricks.examples.parboiled

import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._

class BracketParser extends Parser {

  /**
   * The input should consist of a bracketed expression
   * followed by the special "end of input" marker
   */
  def input: Rule1[Brackets] = rule {
    bracketedExpression ~ EOI
  }

  /**
   * A bracketed expression can be roundBrackets,
   * or squareBrackets, or... or the special empty 
   * expression (which occurs in the middle). Note that
   * because "empty" will always match, it must be listed
   * last
   */
  def bracketedExpression: Rule1[Brackets] = rule {
    roundBrackets | squareBrackets | 
    angleBrackets | curlyBrackets | empty
  }

  /**
   * The empty rule matches an EMPTY expression
   * (which will always succeed) and pushes the Empty
   * case object onto the stack
   */
  def empty: Rule1[Brackets] = rule {
    EMPTY ~> (_ => Empty)
  }

  /**
   * The roundBrackets rule matches a bracketed 
   * expression surrounded by parentheses. If it
   * succeeds, it pushes a RoundBrackets object 
   * onto the stack, containing the content inside
   * the brackets
   */
  def roundBrackets: Rule1[Brackets] = rule {
    "(" ~ bracketedExpression ~ ")" ~~>
         (content => RoundBrackets(content))
  }

  // Remaining matchers
  def squareBrackets: Rule1[Brackets] = rule {
    "[" ~ bracketedExpression ~ "]"  ~~>
        (content => SquareBrackets(content))
  }

  def angleBrackets: Rule1[Brackets] = rule {
    "<" ~ bracketedExpression ~ ">" ~~>
        (content => AngleBrackets(content))
  }

  def curlyBrackets: Rule1[Brackets] = rule {
    "{" ~ bracketedExpression ~ "}" ~~>
        (content => CurlyBrackets(content))
  }


  /**
   * The main entrypoint for parsing.
   * @param expression
   * @return
   */
  def parseExpression(expression: String):
    ParsingResult[Brackets] = {
    ReportingParseRunner(input).run(expression)
  }

}

While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.

To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.

Zen and the art of booking vaccinations

This is a slightly abridged version of a painful experience I had recently when trying to book a Covid vaccination for my 5-year-old daughter, and some musing about what went wrong (spoiler: IT systems). It’s absolutely not intended as a criticism of anyone involved in the process. All descriptions of the automated menu process describe how it was working today.

At the beginning of April, vaccinations were opened up for children aged 5 and over. Accordingly, on Saturday 2nd, we tried to book an appointment for our daughter using the NHS website (https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/book-coronavirus-vaccination/). After entering her NHS record and date of birth, we were bounced to an error page:

The error page after failing to book an appointment.

There’s no information on the page about what the error might be – possibly this is reasonable given patient confidentiality etc. (at no point had I authenticated myself). I noticed that the URL ended “/cannot-find-vaccination-record?code=9901” but other than that, all I can do is call 119 as suggested.

Dialling 119, of course, leads you to a menu-based system. After choosing the location you’re calling from, you get 4 options:

  1. Test and trace service
  2. Covid-19 vaccination booking service
  3. NHS Covid pass service
  4. Report an issue with Covid vaccination record

So the obvious choice here is “4”. This gives you a recorded message “If you have a problem with your UK vaccination records, the agent answering your call can refer you on to the data resolution team”. This sounds promising! Then there’s a further menu with 3 choices:

  1. If your vaccination record issue is stopping you from making a vaccination booking
  2. If your issue is with the Covid pass
  3. If your issue relates to a vaccination made overseas

Again, the obvious choice here is “1”. This results in you being sent to the vaccination booking service.

The first problem I encountered (in the course of the day I did this many times!) was that many of the staff on the other end seemed to be genuinely confused about how I’d ended up with them. They told me I should redial and choose option 4, and I kept explaining that I had done exactly that and this is where I had ended up. So either menu system is not working and is sending me to the wrong place (although given the voice prompts it sounds like it was doing the right thing), or the staff taking the calls have not been briefed properly.

Eventually I was able to get myself referred to the slightly Portal-esque sounding Vaccination Data Resolution Service. They explained that my daughter did not appear to be registered with a GP, which surprised me because she definitely is. So, they said, I should get in touch with her GP practice and get them to make sure the records were correct on “the national system”.

This I did. I actually went down there (it was lunchtime), the staff at the surgery peered at her records and reported that everything seemed to be present and correct, with no issues being flagged up.

So, then I had more fun and games trying to get one of the 119 operators to refer me back to the VDRS. This was eventually successful, and someone else at the VDRS called me back. She took pity on me, and gave me some more specific information – the “summary case record” on the “NHS Spine portal” which should have listed my daughter’s GP did not.

I phoned the GP surgery, explained this, various people looked at the record and reported that everything seemed fine to them.

More phoning of 119, a third referral, to another person. He wasn’t able to suggest anything, sadly.

So, at this point, I was wracking my brains trying to work out where the problem could lie. I had the VDRS people saying that this data was missing from my daughter’s record, and the GP surgery insisting that all was well. The VDRS chap had mentioned something about it potentially taking up to 4 weeks (!) for updates to come through to them, which suggests that behind the scenes there must be some data synchronisation between different systems. I wondered if there was some kind of way of tagging bits of a patient record with permissions to say who is allowed to see them, and the GP surgery could and the VDRS people couldn’t.

Finally, on the school run to collect my daughter, I thought I’d have one last try at talking to the GP practice in person. I spoke to one of the ladies I’d spoken to on the phone, she took my bit of paper with “NHS Spine portal – summary care record” scrawled on it and went off to see the deputy practice manager. A short while later she returned; they’d looked at that bit of the record and spotted something in it about “linking” the local record with the NHS Spine (she claimed they’d never seen this before1), and that this was not set. I got the impression that in fact it couldn’t be set, because her proposed fix (to be tried tomorrow) is going to be to deregister my daughter and re-register her. And then I should try the vaccine booking again in a couple of days.

As someone who is (for want of a better phrase) an “IT professional”, the whole experience was quite frustrating. As noted at the top, I’m not trying to criticise any person I dealt with – everyone seemed keen to help. I’m also not trying to cast aspersions on the GP’s surgery – as far as they were aware, there was nothing amiss with the record (until they discovered this “linking” thing). My suspicion is that the faults lie with IT systems and processes.

For example, it sounds like it’s an error for a GP surgery to have a patient record that’s not linked to a record in the NHS Spine, but since many people took a look at the screens showing the data without noticing anything amiss, I’d say that’s some kind of failure of UI design. I wonder how it ended up like that; maybe my daughter’s record predated linking with Spine and somehow got missed in a transitioning process (or no such process occurred)?

It would have been nice if I’d been able to get from the state of knowing there’s some kind of error with the record, to knowing the actual details of the error, without having to jump through so many telephonic hoops. I presume the error code 9901 means something to somebody, but it didn’t mean anything to any of the people I spoke to. In any case, I only spotted it because, as a developer, I thought I’d peep at the URL, but it didn’t seem to be helpful from the point of view of diagnosing the problem. It feels like there’s a missed opportunity here – since people seeing that error page are directed to the 119 service, it would have been helpful to have provided some kind of visible code to enable the call handlers to triage the calls effectively.

In terms of my own development, it was an important reminder that the systems I build will be used by people who are not necessarily IT-literate and don’t know how they work under the covers, and if they go wrong then it might be a bewildering and perplexing experience for them. Being at the receiving end of vague and generic-sounding errors, as I have been today, has not been a lot of fun.


1I found this curious, but a subsequent visit to the surgery a couple of days later to see how the “fix” was progressing clarified things somewhat. My understanding now is that the regular view of my daughter’s record suggested that all was well, and that it was linked with Spine, and it was only when the deputy practice manager clicked through to investigate that it popped up a message saying that it was not linked correctly.

Obviously I don’t know the internals of the system and this is purely speculation, but suppose that the local system system had set a “I am linked with Spine” flag when it tried to set up the link, the linking failed for some reason, and the flag was never unset (or maybe it got set regardless of whether the linking succeeded or failed). Suppose furthermore that the “clicking through” process described actually tries to pull data from Spine. That could give a scenario in which the record looks fine at first glance and gives no reason to suppose anything is wrong, and you only see the problem with some deeper investigation. We can still learn a lesson from this hypothetical conjecture – if you are setting a flag to indicate that some fallible process has taken place, don’t set it before or after you run this process, set it when you have run it and confirmed that it has been successful.


Postscript

Sadly, I never found out what the underlying problem was. We just tried the booking process one day and it worked as expected (although by this point it was moot as my daughter had tested positive for Covid, shortly followed by myself and my wife). A week or so later, we got a registration confirmation letter confirming that she’d been registered with a GP practice, which was reassuring. I’d like to hope that somewhere a bug has been fixed as a result of this…