Things Customers Don’t Understand About Search

Note, this post is based on a dev forum put together by Chris.

Full-text search is a common feature of systems 67 Bricks build. We want to make it easy for users to find relevant information quickly often through a faceted search function. Understanding user needs and building a top notch user experience is vital. When building faceted search, we generally use either ElasticSearch (or AWS’s OpenSearch) or MarkLogic. Both databases offer very similar feature sets when it comes to search, though one is more targetted towards JSON based documents and the other, XML.

Search can seem magical at first glance and do some amazing things but this can lead to situations where customers (and UX/UI designers) assume the search mecahnism can do more than it can. We frequently find a disconnect between what is desired and what is feasable with search systems.

There are 2 main categories of problems we often see are:

  1. Customers / Designers asking for things that could be done, but often come with nasty performance implications
  2. Features that seem reasonable to ask for at first glance, but once dug into reveal logical problems that make developing the feature near impossible

Faceted Search

Faceted search systems are some of the most common systems we build at 67 Bricks. The user experience typicallly starts with a search box that they enter a number of terms into, hit enter and then be presented with a list of results in some kind of relevancy order. Often there is a count of results displayed alongside a pagination mechanism to iterate through the results (e.g. showing results 1-10 of 12,464). We also show facets, counts for how many results fit into different buckets. All this is handled in a single query that often takes less than 100ms which seems miraculous. Of course, this isn’t magic, full-text search systems use a variety of clever indexes to make searching and computing the facet counts quick.

Lets make a search system for a hypothetical website cottagesearch.com. Our first screen will present the user with some options to select a location, the date range they want to stay and how many guests are coming. We perform the search and show the matching results. How should we display the results and more importantly, how do we show the facets?

Let’s say we did a search for 2 bedroom cottages. We’ve seen wireframes for a number of occassions where the facet count for all bedroom numbers are displayed. So users see the number of results applicable to each bedroom count they would get if they didn’t limit the search to just 2 bedrooms (i.e. there aren’t that many 2 bed options, but look at how many 3 bed options are available). At first glance, this seems like a sensible design, but fundmanetally breaks how search systems work with faceting, they will return counts, but only for the search just done.

We could get around this by doing 2 searches, 1 limited by bedrooms and one that does not to retrieve the facet counts. This may seem like a sensible idea when we have 1 facet, but what do we do when we have more? Do we need to do multiple searches, effectively making an N+1 problem? How to we display numbers? Should the counts for the location facet include the limit of bedrooms or not? As soon as we start exploring additional situations we start to see the challenges the original design presents.

This gets harder when we consider non-exclusionary facets. Let’s say our cottage search system lets you filter by particular features, such as a wood burner, hot tub or dishwasher. Now, if we show counts of non-selected facets, what do these numbers represent? Do they include results that already include the selected facet or not? Here, the logic starts to break down and becomes ever more confusing to the end user and difficult to implement for the developer.

Other Complex Facet Situations

A common question we need to ask with non-exclusionary facets: Is it an AND or an OR search? The answer is very domain dependant, but either way we suggest steering away from facets counts in these situations.

Date ranges provide an interesting problem, some sites will purposefully search outside of the selected range so as to provide results near the selected date range. This may be a useful or annoying depending on what the user expects and is trying to achieve. Some users would want exact matches and would have no interest in results that do not meet the selected date range.

Ordering facets is also a questions that may be overlooked. Do you order lexographically or do you order by descending number of matches? What about names, year ranges or numeric values? Again, a lot of what users expect and would want comes down to the domain being dealt with and the needs of the users.

When users select a new facet, what should the UI do? Should the search immediately rerun and the results and facets update or should there be a manual refresh button the user has to select before the search is updated? An immediate refresh would be slower, but let users narrow down carefully, while a manual update would reduce the number of searches done, but then users may be able to select a number of facets in such a way that no results would be returned.

Hierarchies can also prove tricky. We often see taxonomies being used to inform facets, say subjects with sub categories. How should these be displayed? Again there are many solutions to pick from with different sets of trade-offs.

Advanced Search

Advanced search can often be a bit like a peacocks tail – something that looks impressive, but doesn’t contribute a fair share of value based on how much effort it takes to develop. A lot of designers and product owners love the idea of it but in practice, it can end up being somewhat confusing to use and many end users end up avoiding it.

Boolean builders exist in many systems where the designer of advanced search will insist on allowing users to build up some complex search with lots of boolean AND/OR options, but displaying this to users in a way they can understand is challenging. If a user builds a boolean search such as: GDP AND Argentina OR Brazil do we treat it as (GDP AND Argentina) OR Brazil or should it be interpreted as GDP AND (Argentina OR Brazil). We could include brackets in the builder, but this just further complicates the UI.

We frequently get bugs and feedback on advanced search, some of this feedback can amount to different users having contradictory opinions on how it should work. We would ask product owners to carefully consider “How many people will use it?” Google has a well build UI for advanced search that does away with the challenges of boolean logic by having separate fields for ANDs ORs and NOTs.

Google advanced search UI

An advanced search facility can introduce additional complexity when combined with facets. If an advances search lets you select some facets before completing the search, does this form part of the string in the search box? We have had mixed results with enabling power users to enter facets into search fields (e.g. bedrooms:3), but this can be tricky, some users can deal with it, but others may prefer a advanced search builder while others will rely on facets post search.

Summary

In conclusion, we have 3 main takeaways

  • Search is much more complex than it first appears
  • Facets are not magic, just because you can draw a nice wireframe doesn’t make it feasable to develop
  • Advanced search can be tricky to get right and even then, only used by a minority of users

We’ve build many different types of search and have experimented with a number of approaches in the past and we offer some tried and tested principles:

  • Make searches stateless – Don’t add complexity by trying to maintain state between facets changes, simply treat each change as a fresh search. That way URLs can act as a method of persistence and bookmarking common searches.
  • Have facets only display counts for the current search and do not display counts for other facets once one has been selected within that category.
  • Only use relevancy as the default ordering mechanism – You may be tempted to allow results to be ordered in different ways, such as published date, but this can cause problems with weakly matching, but recent results appearing first.
  • Don’t build an advanced search unless you really need to and if you have to, use a Google style interface over a boolean query builder.
  • Check that search is working as expected – Have domain experts check that searches are returning sensible results and look into using analytics to see if users are having a happy journey through the application (i.e. run a search and then find the right result within the first few hits).
  • Beware of exhaustive search use cases – As many search mechanisms work on some score based on relevancy to the terms entered, having a search that guarantees a return of everything relevant can be tricky to define and to develop.

My Journey to Getting AWS Certified

When I joined 67 Bricks in January 2021 I knew close to zero about AWS, and not-a-lot about cloud services in general. I had dabbled a bit in Azure in my previous job, and I understood the fundamentals of what “the cloud” was, but I was very aware that I’d have to get up to speed if I wanted to be useful at developing applications on AWS. I joined our team on the EIU project, and on day one I was exposed to discussions about S3 buckets, lambda functions, glue jobs and SNS topics – all things I knew nothing about.

I asked one of the EIU enablement team to give me an overview, and I was introduced to the AWS console and shown some of the key services. Over the next few months I gradually started to get to grips with the basics – I learned how to upload to and download objects from S3, write to and query a DynamoDB table, and search for things in CloudWatch. I was very proud when I wrote my first lambda function, but I still felt like I was winging it.

I was encouraged by our development manager to look into obtaining some AWS certifications. The obvious starting point was Cloud Practitioner (https://aws.amazon.com/certification/certified-cloud-practitioner/?ch=sec&sec=rmg&d=1) which covers the basics of what “the cloud” is, and the applications of core AWS services. The best course I found to prepare for this was one from Amazon themselves https://explore.skillbuilder.aws/learn/course/134/aws-cloud-practitioner-essentials (you might need to sign in to the skill builder to access it, but the course is free). It uses the analogy of a coffee shop to explain the concepts of instances, scaling, load balancing, messaging and queueing, storage, networking etc, in an easy to understand manner. After a lot of procrastinating, and wondering if I was ready, I eventually took the exam in October 2021 and passed it with a respectable score.

The cloud practitioner course covers AWS services in an abstract manner – you learn about the core services without ever having to use them. In fact you could probably pass the course without ever logging into the AWS console. To demonstrate real experience and knowledge of AWS services, I decided that the certification to go for next was Developer Associate (https://aws.amazon.com/certification/certified-developer-associate/?ch=sec&sec=rmg&d=1). AWS doesn’t offer their own course to study for this certification – instead they provide links to numerous white papers, which make for fairly dry reading, and it is not clear exactly what knowledge is and is not required.

After doing a bit of research I decided that this course on Udemy https://www.udemy.com/course/aws-certified-developer-associate-dva-c01/ by Stephane Maarek was the most highly rated. With 32 hours of videos to absorb, this was not a trivial undertaking, but after slotting in a few hours of study either before work or in the evenings, I made it through with two books stuffed with notes.

The Developer Associate certification requires you to understand at a fairly deep level how the AWS compute, data, storage, messaging, monitoring and deployment services work, and also to understand architectural best practices, the AWS shared responsibility model, and application lifecycle management. A typical exam question for Developer Associate might ask you to calculate how many read-capacity-units or write-capacity-units a DynamoDB table consumes under various circumstances. Another one might test your understanding of how many EC2 instances a particular auto-scaling policy would add or remove. Another question might require you to understand what lambda concurrency limits are for.

After working my way through a number of practice exams (the best ones seem to be by Jon Bonzo, again on Udemy https://www.udemy.com/course/aws-certified-developer-associate-practice-exams-amazon/) I took the plunge and sat the exam in January 2022, again passing with a respectable score.

But what next? The knowledge I’d gained up until this point had given me real practical skills, and a deeper knowledge of how the various AWS services connect together. For example, it was no longer a mystery how lambda functions could be triggered by SNS topics or messages from an SQS queue, and could then call another API perhaps hosted on EC2 to initiate some other process. And I could understand how to utilise infrastructure-as-code (e.g. CloudFormation or CDK) along with services like CodePipeline and CodeDeploy, to automate build processes. But I wanted a greater understanding of the “bigger picture”, and so next I chose to go for the Solutions Architect Associate certification (https://aws.amazon.com/certification/certified-solutions-architect-associate/?ch=sec&sec=rmg&d=1).

The Solutions Architect Associate exam typically presents a scenario and then asks you to choose which option provides the best solution. One option is usually wrong, but there could be more than one solution which would work – but you have to scrutinise the question to see which one best meets the requirements of the scenario. Are they asking for the cheapest solution? Or the fastest? Or the most fault tolerant? (Look for clues like “must be highly available” – and so the correct answer will probably involve multi-AZ deployments). Is any down-time acceptable? Is data required in real time, or is a delay acceptable? (E.g. do we choose Kinesis or SQS?) If a customer is migrating to the cloud are there time constraints, and how much data is there to migrate? (E.g. it can take a month or two to set up a Direct Connect connection, but you could have a Snowmobile in a week. A VPN might work but there are limits to the data transfer rates).

Again, I chose Stephane Maarek’s course on Udemy (https://www.udemy.com/course/aws-certified-solutions-architect-associate-saa-c02/) – his study materials are clear and he also notes which sections are duplicates of those in the developer associate course. I again used Jon Bonzo’s practice exams (https://www.udemy.com/course/aws-certified-solutions-architect-associate-amazon-practice-exams-saa-c03/). There is a fairly hard-core section on VPC, which is something I struggled with. Stephane presents a spaghetti-like diagram showing the relationship between VPCs, public and private subnets, internet gateways, NAT gateways, security groups, route tables, on-premise set-ups, VPC endpoints, transit gateways, direct connections, VPC peering connections etc – and says “by the end of this section you’ll know what all of this means”. He was right, but as someone with limited networking experience and knowledge, I found it pretty tough.

I sat the exam in April 2022, a day before I figured out that the cough and fatigue I’d developed was actually Covid. I passed the exam respectably again, and then collapsed into bed for a few days to recover.

At this point it’s probably worth mentioning how the exam process works. If you like, you can book an exam in an approved test centre. However, I chose to go with the “online proctored” exams hosted by Pearson Vue. You book an exam slot – generally plenty are available at all times of the day and night, and you can usually find a slot within the next day or two that suits. For the exam you need to be sitting at a clear table with nothing within arms reach. Not even a tissue or a glass of water. You need to run some Pearson software on your laptop that checks no other processes are running (so turn off slack, email, shut down your docker containers etc etc), and then launches their exam platform. You will be asked to present photo ID, and then show the proctor your testing environment. They will want to see your chair and table from all angles, and will want to see your arms to make sure you’re not wearing a watch or have anything hiding up your sleeves. You need your mobile phone in the room, but out of reach, in case they need to call you. And you also need to make sure you are undisturbed in the room for the duration of the exam (which is typically 2-3 hours).

This last point was challenging for me. My home-office is not suitable – being far to crammed with potential cheat material, and I also share it with my husband. The only suitable place is my dining table, in the very open-plan ground floor of my house. Finding a time when I can have the ground-floor to myself for 2-3 hours means scheduling the exam for around 7AM in the morning on a day when the kids are not at school. I ended up putting “do not disturb” signs on the door and issuing dire warnings to everyone that they mustn’t come downstairs until I’d given them the all-clear. Anyone wandering sleepily through the room on a quest for coffee could result in the exam proctor dropping my connection and disqualifying me from the exam. Fortunately, all was well and all the exams I’ve sat so far were carried out without incident.

After obtaining the Solutions Architect Associate certification I thought about taking a break. But then I took a look at the requirements for SysOps Administrator Associate (https://aws.amazon.com/certification/certified-sysops-admin-associate/?ch=sec&sec=rmg&d=1) and realised that I’d already covered about two-thirds of the required material. Now SysOps is not something I have a love for. I have a deep respect for people who understand deployments and pipelines and infrastructure. The Enablement team at the EIU, who I work closely with, are miracle workers who regularly perform magic to get things up and running. The idea that I could learn some of that wizardry seemed far-fetched. But I thought I might as well give it a go.

The SysOps certificate focusses a lot on configuration and monitoring. You learn a lot about load balancers, autoscaling policies, CloudFormation and CloudWatch. And yes, all that indepth knowledge about VPCs and hybrid-cloud set-ups is applicable here too. A typical exam question will present a scenario where something has gone wrong, and you have to pick the best option to fix it. For example, someone can’t SSH into an EC2 instance because something is wrong with the security group. Or someone in a child account of a parent organisation can’t access something in another child account. Yet again I went to Stephane Maarek’s course, which was again excellent (https://www.udemy.com/course/ultimate-aws-certified-sysops-administrator-associate/). And Jon Bonzo again provided the practice exams (https://www.udemy.com/course/aws-certified-sysops-administrator-associate-practice-exams-soa-c01/).

I sat the SysOps exam in June 2022. One thing that caused a little trepidation was that this exam includes “exam labs” – these are practical exercises carried out in the AWS console. It was hard to prepare for these because I could not find any practice labs on-line, and so I was going in cold. However, it turned out that the labs were well defined with clear steps on what was required. Even the ones where I had never really looked at the service before, I was able to find it in the console and figure out what I needed to do. I was asked to:

  • Create a backup plan for an EFS system with two types of retention policy
  • Update a CloudFormation stack to fiddle with some EC2 settings, roles, route tables etc
  • Create an S3 static website and configure some Route 53 failover policies

The second of these caused me the most difficulty – I hadn’t anticipated actually having to write a CloudFormation template – they provided one which I needed to edit and it took me a while to figure out how to actually do this. Turns out that you need to save a new version of the template locally and then re-upload it.

I passed the SysOps exam with a more modest mark than for the other certifications, and I definitely breathed a sigh of relief. I am now definitely taking a breather – perhaps in a few months I might take a look at some of the specialist certifications (maybe Data Analytics?) but for the moment I’m going to get back to some of my other neglected hobbies (I like to draw, and play the piano, and one day I’ll maybe finish my epic fantasy trilogy).

The key take-aways from my experience are:

  • The associate level certifications require you to acquire knowledge that is directly applicable in the day-to-day life of a developer or systems administrator.
  • I was initially concerned that the courses would be part of a propaganda machine from AWS, encouraging us to spend ever larger amounts on AWS services. I found this to not be the case at all. Quite a large part of the material teaches us how to save costs, and how to incorporate our existing on-premises infrastructure with AWS, rather than replacing it entirely.
  • Sitting an exam in your own home is definitely preferable than travelling to a test centre – you get far more flexibility over when you can take the exam. However, not everyone will have a suitable place at home to take the exam, particularly if you share your home with other people, or you do not have a suitable table to work at.
  • Studying for these certifications will require a significant time commitment. The online courses run for 20-30 hours or more, assuming you never pause the videos to take notes, or repeat a section. And that is before you take time to revise or do practice exams.
  • Definitely the most valuable tool for preparing for the exams is by completing as many practice exams as you can find. The best ones include detailed explanations about why a particular answer is correct and the others are wrong.
  • Also note that these certifications have an expiry date – typically 3 years – and also the courses are refreshed periodically. For example, the Solutions Architect Associate is being refreshed at the end of August and Solutions Architect Professional is being refreshed in November.

Dev Forum – Parsing Data

Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.

After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]}) would be considered valid, while the sequence ((({(>>]) would be invalid.

First we define the set of classes that describes the parsed structure:

object BracketParser {

  sealed trait Brackets

  case class RoundBrackets(content: Brackets)
     extends Brackets

  case class SquareBrackets(content: Brackets)
     extends Brackets

  case class AngleBrackets(content: Brackets)
     extends Brackets

  case class CurlyBrackets(content: Brackets)
     extends Brackets

  case object Empty extends Brackets

}

Next, we define the matching rules that parboiled uses:

package com.sixtysevenbricks.examples.parboiled

import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._

class BracketParser extends Parser {

  /**
   * The input should consist of a bracketed expression
   * followed by the special "end of input" marker
   */
  def input: Rule1[Brackets] = rule {
    bracketedExpression ~ EOI
  }

  /**
   * A bracketed expression can be roundBrackets,
   * or squareBrackets, or... or the special empty 
   * expression (which occurs in the middle). Note that
   * because "empty" will always match, it must be listed
   * last
   */
  def bracketedExpression: Rule1[Brackets] = rule {
    roundBrackets | squareBrackets | 
    angleBrackets | curlyBrackets | empty
  }

  /**
   * The empty rule matches an EMPTY expression
   * (which will always succeed) and pushes the Empty
   * case object onto the stack
   */
  def empty: Rule1[Brackets] = rule {
    EMPTY ~> (_ => Empty)
  }

  /**
   * The roundBrackets rule matches a bracketed 
   * expression surrounded by parentheses. If it
   * succeeds, it pushes a RoundBrackets object 
   * onto the stack, containing the content inside
   * the brackets
   */
  def roundBrackets: Rule1[Brackets] = rule {
    "(" ~ bracketedExpression ~ ")" ~~>
         (content => RoundBrackets(content))
  }

  // Remaining matchers
  def squareBrackets: Rule1[Brackets] = rule {
    "[" ~ bracketedExpression ~ "]"  ~~>
        (content => SquareBrackets(content))
  }

  def angleBrackets: Rule1[Brackets] = rule {
    "<" ~ bracketedExpression ~ ">" ~~>
        (content => AngleBrackets(content))
  }

  def curlyBrackets: Rule1[Brackets] = rule {
    "{" ~ bracketedExpression ~ "}" ~~>
        (content => CurlyBrackets(content))
  }


  /**
   * The main entrypoint for parsing.
   * @param expression
   * @return
   */
  def parseExpression(expression: String):
    ParsingResult[Brackets] = {
    ReportingParseRunner(input).run(expression)
  }

}

While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.

To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.

Zen and the art of booking vaccinations

This is a slightly abridged version of a painful experience I had recently when trying to book a Covid vaccination for my 5-year-old daughter, and some musing about what went wrong (spoiler: IT systems). It’s absolutely not intended as a criticism of anyone involved in the process. All descriptions of the automated menu process describe how it was working today.

At the beginning of April, vaccinations were opened up for children aged 5 and over. Accordingly, on Saturday 2nd, we tried to book an appointment for our daughter using the NHS website (https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/book-coronavirus-vaccination/). After entering her NHS record and date of birth, we were bounced to an error page:

The error page after failing to book an appointment.

There’s no information on the page about what the error might be – possibly this is reasonable given patient confidentiality etc. (at no point had I authenticated myself). I noticed that the URL ended “/cannot-find-vaccination-record?code=9901” but other than that, all I can do is call 119 as suggested.

Dialling 119, of course, leads you to a menu-based system. After choosing the location you’re calling from, you get 4 options:

  1. Test and trace service
  2. Covid-19 vaccination booking service
  3. NHS Covid pass service
  4. Report an issue with Covid vaccination record

So the obvious choice here is “4”. This gives you a recorded message “If you have a problem with your UK vaccination records, the agent answering your call can refer you on to the data resolution team”. This sounds promising! Then there’s a further menu with 3 choices:

  1. If your vaccination record issue is stopping you from making a vaccination booking
  2. If your issue is with the Covid pass
  3. If your issue relates to a vaccination made overseas

Again, the obvious choice here is “1”. This results in you being sent to the vaccination booking service.

The first problem I encountered (in the course of the day I did this many times!) was that many of the staff on the other end seemed to be genuinely confused about how I’d ended up with them. They told me I should redial and choose option 4, and I kept explaining that I had done exactly that and this is where I had ended up. So either menu system is not working and is sending me to the wrong place (although given the voice prompts it sounds like it was doing the right thing), or the staff taking the calls have not been briefed properly.

Eventually I was able to get myself referred to the slightly Portal-esque sounding Vaccination Data Resolution Service. They explained that my daughter did not appear to be registered with a GP, which surprised me because she definitely is. So, they said, I should get in touch with her GP practice and get them to make sure the records were correct on “the national system”.

This I did. I actually went down there (it was lunchtime), the staff at the surgery peered at her records and reported that everything seemed to be present and correct, with no issues being flagged up.

So, then I had more fun and games trying to get one of the 119 operators to refer me back to the VDRS. This was eventually successful, and someone else at the VDRS called me back. She took pity on me, and gave me some more specific information – the “summary case record” on the “NHS Spine portal” which should have listed my daughter’s GP did not.

I phoned the GP surgery, explained this, various people looked at the record and reported that everything seemed fine to them.

More phoning of 119, a third referral, to another person. He wasn’t able to suggest anything, sadly.

So, at this point, I was wracking my brains trying to work out where the problem could lie. I had the VDRS people saying that this data was missing from my daughter’s record, and the GP surgery insisting that all was well. The VDRS chap had mentioned something about it potentially taking up to 4 weeks (!) for updates to come through to them, which suggests that behind the scenes there must be some data synchronisation between different systems. I wondered if there was some kind of way of tagging bits of a patient record with permissions to say who is allowed to see them, and the GP surgery could and the VDRS people couldn’t.

Finally, on the school run to collect my daughter, I thought I’d have one last try at talking to the GP practice in person. I spoke to one of the ladies I’d spoken to on the phone, she took my bit of paper with “NHS Spine portal – summary care record” scrawled on it and went off to see the deputy practice manager. A short while later she returned; they’d looked at that bit of the record and spotted something in it about “linking” the local record with the NHS Spine (she claimed they’d never seen this before1), and that this was not set. I got the impression that in fact it couldn’t be set, because her proposed fix (to be tried tomorrow) is going to be to deregister my daughter and re-register her. And then I should try the vaccine booking again in a couple of days.

As someone who is (for want of a better phrase) an “IT professional”, the whole experience was quite frustrating. As noted at the top, I’m not trying to criticise any person I dealt with – everyone seemed keen to help. I’m also not trying to cast aspersions on the GP’s surgery – as far as they were aware, there was nothing amiss with the record (until they discovered this “linking” thing). My suspicion is that the faults lie with IT systems and processes.

For example, it sounds like it’s an error for a GP surgery to have a patient record that’s not linked to a record in the NHS Spine, but since many people took a look at the screens showing the data without noticing anything amiss, I’d say that’s some kind of failure of UI design. I wonder how it ended up like that; maybe my daughter’s record predated linking with Spine and somehow got missed in a transitioning process (or no such process occurred)?

It would have been nice if I’d been able to get from the state of knowing there’s some kind of error with the record, to knowing the actual details of the error, without having to jump through so many telephonic hoops. I presume the error code 9901 means something to somebody, but it didn’t mean anything to any of the people I spoke to. In any case, I only spotted it because, as a developer, I thought I’d peep at the URL, but it didn’t seem to be helpful from the point of view of diagnosing the problem. It feels like there’s a missed opportunity here – since people seeing that error page are directed to the 119 service, it would have been helpful to have provided some kind of visible code to enable the call handlers to triage the calls effectively.

In terms of my own development, it was an important reminder that the systems I build will be used by people who are not necessarily IT-literate and don’t know how they work under the covers, and if they go wrong then it might be a bewildering and perplexing experience for them. Being at the receiving end of vague and generic-sounding errors, as I have been today, has not been a lot of fun.


1I found this curious, but a subsequent visit to the surgery a couple of days later to see how the “fix” was progressing clarified things somewhat. My understanding now is that the regular view of my daughter’s record suggested that all was well, and that it was linked with Spine, and it was only when the deputy practice manager clicked through to investigate that it popped up a message saying that it was not linked correctly.

Obviously I don’t know the internals of the system and this is purely speculation, but suppose that the local system system had set a “I am linked with Spine” flag when it tried to set up the link, the linking failed for some reason, and the flag was never unset (or maybe it got set regardless of whether the linking succeeded or failed). Suppose furthermore that the “clicking through” process described actually tries to pull data from Spine. That could give a scenario in which the record looks fine at first glance and gives no reason to suppose anything is wrong, and you only see the problem with some deeper investigation. We can still learn a lesson from this hypothetical conjecture – if you are setting a flag to indicate that some fallible process has taken place, don’t set it before or after you run this process, set it when you have run it and confirmed that it has been successful.


Postscript

Sadly, I never found out what the underlying problem was. We just tried the booking process one day and it worked as expected (although by this point it was moot as my daughter had tested positive for Covid, shortly followed by myself and my wife). A week or so later, we got a registration confirmation letter confirming that she’d been registered with a GP practice, which was reassuring. I’d like to hope that somewhere a bug has been fixed as a result of this…

Organising Git Pull Requests

Recently we had an interesting dev forum on Git workflows. Git is the de facto source control tool of choice for software development. Powerful and flexible, teams have a wide range of choices to make when deciding exactly how to make use of it. This discussion originally stemmed from a Changelog podcast titled “Git your reset on” based on the blog post “Git Organized” by Annie Sexton.

To briefly summarise the above, if we imagine a situation where a merge to main leads to a production deployment and then an issue is found in production, how do we roll back? We look in the git history, find the commit that introduced the bug, revert it and then redeploy to production. But oh no! This revert changed a number of source files beyond the scope of the bug and has ended up introducing a new bug.

The proposed solution to this is to ensure a sensible, clean history of commits is used within a branch. Annie Sexton suggests using a feature branch where you commit regularly with useless messages like WIP until you are ready to submit a pull request. You then run git reset origin/head to be presented with all the changes you have done and how this differs from main, you then progressively add files and make commits with sensible messages to build up a more coherent change.

This approach has a number of advantages we discussed:

  • It provides a neat history of compartmentalised changes
  • Reviewers can then follow a structured story of commits about how the developer arrived at their solution
  • Irrelevant changes are excluded
  • Ensure that all of the included changes are relevant only to the task at hand

Disadvantages of this approach may include:

  • Bad approaches that were tried and abandoned are not in the history
  • Changes aren’t going to always compartmentalise neatly into specific changes of files meaning we would need to commit specific changes within a file, some tools do a bad job supporting this use case

It turns out that a lot of developers at 67 Bricks use a form of history rewriting when developing code, but no one used the specific approach proposed above. git commit–amend is a very useful command to tweak the previous commit. Some (including myself) use git rebase -i as a way of rebuilding the commit history, though this has a limitation in cases where you may want to split a commit up. Finally, others create new branches and build up clean commits on the new branch.

Is force pushing a force for good?

This question really split the developers, some see using git push --force as an evil that indicates you’re using git wrongly and should only be used as a last resort. Others really like the idea of force pushing, but only for branches you’ve been developing on and have yet to create a merge request for.

Squashing?

A workflow some developers have seen before involves using the --squash flag when merging into main. This can create a nice history where each commit neatly maps to a single ticket but the general view was this caused the loss of helpful information from the git history.

Should the software be valid at every commit?

Some argued that this is a sensible thing to strive for, having confidence that you can revert back to any commit and the software will work is a nice place to be at. However, others criticised this as tricky to achieve in practice. Ideally the tests would all be valid and pass too, but this goes against how some developers work; they create a failing test to show the problem or new feature, commit it and then build out code to make the test green. Others claim that tests should change in the same commit as code changes to make it obvious these changes belong together.

Tools used by Developers

At 67 bricks, we try to avoid specifying which precise tool a developer should use for their craft. Different developers prefer different approaches when it comes to using git, some are more comfortable with their IDEs plugin (e.g. IntelliJ has good git integration which can add specific lines to commits) while others may prefer using the command line. Here’s a list of tools we’ve used when dealing with git:

To conclude, specific approaches and tools used vary, but everyone agreed on the benefits of striving for a clean, structured git history.

MarkLogic profiler

When investigating slow XQuery or XSLT queries in MarkLogic, one of the tools it provides is a query profiler. In this post I’ll explain how to make use of the profiler and how to interpret the results.

There are 3 ways you can make use of the profiler:

  1. use the MarkLogic QConsole;
  2. use an IDE that supports the profiler like IntelliJ with the XQuery and XSLT plugin;
  3. use the MarkLogic profiler API.

Using the QConsole

In the MarkLogic QConsole you can enable profiling by clicking on the Profile tab next to the Results tab and pressing the Run button. This will display a table with the profiling results. The profile can be saved by clicking on the download button on the right.

The profile result table contains the following information:

  1. Module:Line No.: Col No. – This is where the expression being profiled starts;
  2. Count – This is the number of times the expression was evaluated;
  3. Shallow %/Shallow ms – This is how long the expression took to run on its own across all Count times it was evaluated;
  4. Deep %/Deep ms – This is how long the expression took to run in addition to the time subexpessions took to evaluate;
  5. Expression – This is the string representation of the query expression being evaluated.

NOTE: Here, ms is microseconds not milliseconds.

If you want to do further processing on the downloaded profile:report XML, you can load that XML by using:

let $data := xdmp:filesystem-file("C:\profiling\profile.xml")
let $profile := xdmp:unquote($data)/prof:report

Analysing Profile Results

Sorting by shallow time is useful for identifying the expressions that take a long time to evaluate. This can identify places where you should modify the query or add indices.

The count column is also useful. It can indicate places where you have nested loops that result in quadratic time queries.

When making optimizations, it can be tricky to know if a change results in a speed up of the query or is not significant, but shows up in the normal time variation when running a query multiple times. To handle this, I like to use a spreadsheet and measure the average (mean) time and standard deviation over 5 or 10 runs.

This can also be useful when varying the number of items being processed (e.g. documents) in order to identify if the query scales with that item, and whether the scaling is linear, quadratic, or some other curve. For this, you can use line graphs to visualise the performance and linear or quadratic regression formulae to check what shape the graph is.

The stock chart graph is useful for comparing the variation between runs. Here, the mininum and maximum times form the extents (low, high) of the chart, and the mean +/- one standard deviation form the inner bar (open, close) of the chart.

Using the Profiler API

Using the profiler API you can use the following to collect the profile data from a query expression:

let $_ := prof:enable(xdmp:request())
(: code to profile, e.g. -- let $_ := local:test() :)
let $_ := prof:disable(xdmp:request())
let $profile := prof:report(xdmp:request())

NOTE: If you are running this from an XQuery 1.0 query instead of a 1.0-ml query then you need to declare the prof namespace:

declare namespace prof = "http://marklogic.com/xdmp/profile";

You can then either save the profile to a file, e.g.:

let $_ := xdmp:save("C:\profiling\profile.xml", $profile)

or process the resulting profile:report XML in the XQuery script.

Alternatively you can use one of the prof:eval or prof:xslt-eval to evaluate an in memory query, or prof:invoke or prof:xslt-invoke to evaluate a module file. These return the profile data as the first item and the result as the remaining items, so you can use the following:

let $ret := prof:eval("for $x in 1 to 10 return $x")
let $profile := $ret[1]
let $results := $ret[position() != 1]

With the profile report XML, you can then process it as you need. For example, to create a CSV version of the QConsole profile table you can use the following query:

declare function local:to-seconds($value) {
  ($value cast as xs:dayTimeDuration) div xs:dayTimeDuration("PT1S")
};

let $overall-elapsed := local:to-seconds($profile/prof:metadata/prof:overall-elapsed)
for $expression in $profile/prof:histogram/prof:expression
let $source := $expression/prof:expr-source/string()
let $uri := string(($expression/prof:uri/text(), "main")[1])
let $line := $expression/prof:line cast as xs:integer
let $column := ($expression/prof:column cast as xs:integer) + 1
let $count := $expression/prof:count cast as xs:integer
let $shallow-time := local:to-seconds($expression/prof:shallow-time)
let $deep-time := local:to-seconds($expression/prof:deep-time)
order by $shallow-time descending
return
  string-join((
    $uri, $line, $column, $count,
    $shallow-time, ($shallow-time div $overall-elapsed) * 100,
    $deep-time, ($deep-time div $overall-elapsed) * 100,
    $source
  ) ! string(.), ",")

NOTE: This does not handle escaping of commas in the $source string.

An often asked question

I thought I’d pen a short blog post on a question I frequently find myself asking.

You’re right, but are you relevant?

Me

As software developers, it’s all too easy to want to select the new hot technologies available to us. Building a simple web app is great, but building a distributed web-app based on an auto-scaling cluster making use of CQRS and an SPA front-end written in the latest JavaScript hotness is just more interesting. Software development is often about finding simple solutions to complex problems, allowing us to minimize complexity.

  • You’re right, distributed microservices will allow you to deploy separate parts of your app independently, but is that relevant to a back-office system which only needs to be available 9-5?
  • You’re right, CQRS will allow you to better scale the database to handle a tremendous number of queries in parallel, but is that relevant when we only expect 100 users a day?
  • You’re right, that new Javascript SPA framework will let you create really compelling, interactive applications, but is that relevant to a basic CRUD app?

Lots of modern technology can do amazing things and there are always compelling reasons to choose technology x, but are those reasons relevant to the problem at hand? If you spend a lot of time up front designing systems with tough-to-implement technologies and approaches which aren’t needed; a lot of development effort would have been needlessly spent adding all kinds of complexity which wasn’t needed. Worse, we could have added lots of complexity that then resists change, preventing us from adapting the system in the direction our users require.

At 67 Bricks, we encourage teams to start with something simple and then iterate upon it, adding complexity only when it is justified. Software systems can be designed to embrace change. An idea that’s the subject of one of my favourite books Building Evolutionary Architectures.

So next time you find yourself architecting a big complex system with lots of cutting edge technologies that allow for all kinds of “-ilities” to be achieved. Be sure to stop and ask yourself. “I’m right, but am I relevant?”

In praise of functional programming

At 67 Bricks, we are big proponents of functional programming. We believe that projects which use it are easier to write, understand and maintain. Scala is one of the most common languages we make use of and we see more object oriented languages like C# are often written with a functional perspective.

This isn’t to say that functional languages are inherently better than any other paradigms. Like any language, it’s perfectly possible to use it poorly and produce something unmaintainable and unreadable.

Immutable First

At first glance immutable programming would be a challenging limitation. Not being able to update the value of a variable is restrictive but that very restriction is what makes functional code easier to understand. Personally I found this to be one of the hardest concepts to wrap my head around when I first started functional programming. How can I write code if I can’t change the value of variables?

In practice, reading code was made much easier as once a value was assigned, it never changed, no matter how long the function was or what was done with the value. No more trying to keep track in my head how a long method changes a specific variable and all the ways it may not thanks to various control flows.

For example, when we pass a list into a method, we know that the reference to the list won’t be able to change, but the values within the list could. We would hope that the function was named something sensible, perhaps with some documentation that makes it clear what it will do to our list but we can never be too sure. The only way to know exactly what is happening is to dive into that method and check for ourselves which then adds to the cognitive load of understanding the code. With a more functional programming language, we know that our list cannot be changed because it is an immutable data structure with an immutable reference to it.

High Level Constructs

Functional code can often be more readable than object oriented code thanks to the various higher level functions and constructs like pattern matching. Naturally readability of code is very dependent on the programmer; it’s easy to make unreadable code in any language, but in the right hands these higher level constructs make the intended logic easier to understand and change. For example, here are 2 examples of code using regular expressions in Scala and C#:

def getFriendlyTime(string time): String = {
  val timestampRegex = "([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})".r
  time match {
    case timestampRegex(hour, minutes, _, _) => s"It's $minutes minutes after $hour"
    case _ => "We don't know"
  }
}
public static String GetFriendlyTime(String time) {
  var timestampRegex = new Regex("([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})");
  var match = timestampRegex.Match(time);
  if (match == null) {
    return "We don't know";
  } else {
    var hours = match.Groups[2].Value;
    var minutes = match.Groups[1].Value;
    return $"It's {minutes} minutes after {hour}";
  }
}

I would argue that the pattern matching of Scala really helps to create clear, concise code. As time goes on features from functional languages like pattern matching keep appearing in less functional ones like C#. A more Java based example would be the streams library, inspired by approaches found in functional programming.

This isn’t always the case, there are some situations where say a simple for loop is easier to understand than a complex fold operation. Luckily, Scala is a hybrid language, providing access to object oriented and procedural styles of programming that can be used when the situation calls for it. This flexibility helps programmers pick the style that best suits the problem at hand to make understandable and maintainable codebases.

Pure Functions and Composition

Pure functions are easier to test than impure functions. If it’s simply a case of calling a method with some arguments and always getting back the same answer, tests are going to be reliable and repeatable. If a method is part of a class which maintains lots of complex internal state, then testing is going to be unavoidably more complex and fragile. Worse is when the class requires lots of dependencies to be passed into it, these may need to be mocked or stubbed which can then lead to even more fragile tests.

Scala and other functional languages encourage developers to write small pure functions and then compose them together in larger more complex functions. This helps to minimise the amount of internal state we make use of and makes automated testing that much easier. With side effect causing code pushed to the edges (such as database access, client code to call other services etc.), it’s much easier to change how they are implemented (say if we wanted to change the type of database or change how a service call is made) making evolution of the code easier.

For example, take some possible Play controller code:

def updateUsername(id: int, name: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateName(name) match {
        case Some(updatedUser) =>     
         userRepo.save(updatedUser)
         bus.emit(updatedUser.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

def updateEmail(id: int, email: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateEmail(email) match {
        case Some(updatedUser) =>
          userRepo.save(updatedUser)
          bus.emit(updatedUser.events)
          Ok()
        case None => BadRequest()
    case None => NotFound()
}

Using composition, the common elements can be extracted and we can make a more dense method by removing duplication.

def updateUsername(id: int, name: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateName(name))
}

def updateEmail(id: int, email: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateEmail(email))
  }
}

private def updateEntityAndSave(repo: Repo[T])(id: int)(f: T => Option[T]): Result = {
  repo.get(id) match {
    case Some(entity) =>
      f(entity) match {
        case Some(updatedEntity) =>
         repo.save(updatedEntity)
         bus.emit(updatedEntity.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

None Not Null

Dubbed the billion-dollar mistake by Tony Hoare, nulls can be a source of great nuisance for programmers. Null Pointer Exceptions are all too often encountered and confusion reigns over whether null is a valid value or not. In some situations a null value may never happen and need not be worried about while in others it’s a perfectly valid value that has to be considered. In many OOP languages, there is no way of knowing if an object we get back from a method can be null or not without either jumping into said method or relying on documentation that often drifts or does not exist.

Functional languages avoid these problems simply by not having null (often Unit is used to represent a method that doesn’t return anything). In situations where we may want it, Scala provides the helpful Option type. This makes it explicit to the reader that you call this method, you will get back Some value or None. Even C# is now introducing features like Nullable Reference Types to help reduce the possible harm of null.

Scala also goes a step further, providing an Either construct which can contain a success (Right) of failure (Left) value. These can be chained together to using a railway styles of programming. This chaining approach can lead us to have an easily readable description of what’s happening and push all our error handling to the end, rather than sprinkle it among the code.

Using Option can also improve readability when interacting with Java code. For example, a method returning null can be pumped through an Option and combined with a match statement to lead to a neater result.

Option(someNullReturningMethod()) match {
  case Some(result) =>
    // do something
  case None =>
    // method returned null. Bad method!
}

Conclusions

There are many reasons to prefer functional approaches to programming. Even in less functionally orientated languages I have found myself reaching for functional constructs to arrange my code. I think it’s something all software developers should try learning and applying in their code.

Naturally, not every developer enjoys FP, it is not a silver bullet that will overcome all our challenges. Even within 67 Bricks we have differing opinions on which parts of functional programming are useful and which are not (scala implicits can be quite divisive). It’s ultimately just another tool in our expansive toolkit that helps us to craft correct, readable, flexible and functioning software.

Introduction to 67 Bricks immutable deployments

Since the rise of cloud computing, the way of creating and maintaining infrastructure has changed. While some years ago an unhealthy machine needed to be logged in and fixed manually, or even worse, turned off and a new one had to be configured from scratch by a systems administrator, thanks to modern technologies it has become possible to not worry about whether a resource is configured in exactly the same way.

This post summarizes the main tools used at 67 Bricks for deployment of the services that we host and manage, and describes our deployment process.

A dictionary definition of immutability is “the state of not changing, or being unable to be changed”. And this is what deploying immutably is about – after being released, infrastructure does not change in place, and in order to make changes, a new version is provisioned and the old version is destroyed.

Some of the benefits of immutability include the following:

  1. Consistency of environments – the same processes and procedures are applied to create environments for staging, testing and live, thereby making it easier to test.
  2. An immutable deployment pipeline means that what happens during deployments is documented by using configuration as code – this enables developers to understand the services better and know exactly what happens at each stage.
  3. Having configuration stored as code in a repository facilitates spinning up new services thus improving speed of development.
  4. The ability to reproduce resources aids immensely in automated disaster recovery where it might only take a few minutes to replace an unhealthy resource with a new one.
  5. Immutable deployments allow dynamic scaling of resource in the cloud based on demand.

Tool Used

  • We use Gitlab as our internal code repository, and CI/CD system.

At 67 Bricks we use a variety of modern tools and technologies to deploy our services.

  • Packer is used to create a server image with the software baked onto it.
  • AWS is typically where we deploy applications.
  • Ansible is used to configure servers, called by Packer during the image build phase.
  • Terraform is used to provision infrastructure.

Deployment process

Firstly, an environment or multiple environments are created in an AWS account using Terraform, and resources such as VPC (virtual private cloud) are launched.

Secondly, a base image is built daily off one of AWS AMIs (Amazon Machine Images). By means of Ansible, latest packages are fetched and tools such as AWS CLI and snap updates are installed, as well as automatic security updates are configured. This image is available to developers to use for their application. Since a lot of our applications run on Linux, we use Ubuntu to build the base image

When code is merged into the main/master branch in Gitlab, this starts a pipeline during which the application is built, tests are run and an AMI with the application installed is created via Packer, with Ansible tasks configuring the machine for the use by a specific application. This image is tagged so that we know which application is on it.

After that, the pipeline deploys the application to test and then to live. We use autoscaling for our applications (specifically, AWS AutoScaling Groups), and to use the newly built image with the new version of the application, a script is used to destroy existing application servers and create new ones with the updated version of the application installed. Autoscaling Group configuration is also amended to use the new AMI for any autoscaling events.

When servers (or EC2 instances, in AWS speak) are launched in an autoscaling group, a user_init script gets run on initialization. This is the stage at which any environment specific setup can be done.

Immutability is of paramount importance if you want to create a simpler, more predictable deployment pipeline whose outcome is trusted. Our teams have adopted the immutable deployment approach, and it helps us to ensure repeatable processes and reliable services and applications are in place.

Case Genie or How to Find Unknown Unknowns

It was Paul Magrath, Head of Product Development and Online Content at ICLR, who first used the late Donald Rumsfeld’s phrase to describe the business case for what later became known as Case Genie.

The idea was simple. Lawyers needed to discover historical cases that might impact a case they were preparing. Frequently-cited cases would already be known to them: they’d be at the tips of their fingers, ready to type into their skeleton argument; cases known to every barrister specialising in the field they were arguing; cases that had been cited many times both by other cases and in text books; or cases that had recently changed how the law should be interpreted, and were therefore big news within a narrowly-focused legal community.

But there may be some cases that are relatively unknown that might make all the difference. Enter Case Genie.

This blog post presents a technical overview of Case Genie. How, exactly, is it possible to find “unknown unknowns”?

Word embeddings

Document embeddings are considered the state-of-the-art way of finding similarities between documents. But before document embeddings come word embeddings. A good way to start to think about word embeddings, is to think about words in a document. Let’s take Milton’s Areopagitica as an example. This single work is our corpus (the body of text we’re interested in). Let’s take a single sentence from Milton’s Areopagitica:

Many a man lives a burden to the Earth; but a good Book is the precious life-blood of a master spirit, embalmed and treasured up on purpose to a life beyond life.

Now, if we take each distinct word, we can give it a score based on how often it occurs in the text:

  many  — 1
  a     — 5
  man   — 1
  lives — 1
  …etc…

Now imagine graphing three of these words. I’m choosing three because this is easy to visualise, but rather than take the first three words in the sentence, let’s take the following:

  a         — 5
  life      — 3
  treasured — 1

Imagine these graphed across the x, y and z axes: ‘a’ has the value 5 on the x axis; ‘life’ has the value 3 on the y axis; and ‘treasured’ has the value 1 on the z axis. Further, imagine that for each value, there is an arrow from the origin to the point along each axis, so that we have 3 arrows of different lengths pointing in different directions. Now, in mathematics a value with a direction is called a vector, so we can say that each distinct word within a corpus can be represented as a vector; and within a document, a vector’s value is the number of occurrences of the word within that document.

Three dimensions aren’t too hard to visualise. Extrapolating, we can add further axes, which is much harder to visualise; as many axes as there are distinct words. Each distinct word in the corpus will therefore be represented by a vector.

Of course, that’s just one sentence and there are many in Areopagitica. The full vector space is defined by the number of unique words in the corpus, let’s suppose there are 2,000 in Areopagitica. Initially this will sound confusing, but the word embedding for a given word is made up of a value for every vector in the vector space, which is the same as saying a value for every word. For each distinct word, its word embedding will therefore comprise 1,999 zero values and one non-zero value — 1. If we order the distinct words alphabetically, we can represent an embedding as an implicitly ordered array of numbers. Therefore, for each word embedding, the non-zero value will be in a different place; ‘a’ would be:

  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

whereas ‘and’ would be:

  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

This is crazily sparse; the cleverness comes when using AI to use all that empty space more wisely.

Algorithms like word2vec and fastText (there are other algorithms, but these are the two we looked at for ICLR) provide a more compact representation of word embeddings. Whereas with Areopagitica, using the algorithm I described above, we would need 2,000 values (dimensions) for each word, word2vec and fastText can use a few hundred dimensions. You can download a 300-dimensional model from the fastText website trained on 2 million words from Wikipedia. Word embeddings actually use floating point numbers, so the word ‘a’ in the fastText model we trained for ICLR starts:

  -0.20005 -0.019533 -0.15494 -0.11114 0.074793 0.1194 -0.046101 …

The model for ICLR has 600 dimensions, so there are 600 numbers for each word embedding.

Exactly how word2vec and fastText create these word vectors is outside the scope of this blog post. Essentially, you start by training a model from the corpus, which you feed to the tool. The corpus and the training algorithms define relationships between words or character combinations. The corpus used to train the model therefore determines how the words’ relationships are actually represented in the model. If you train with a French corpus, you will get good results when calculating document embeddings for French text; but you would get bad results if you presented English text. In the same vein, if you train a model using legal documents, you will get a model that is more sensitive to legal meanings and definitions.

So, if we were to use Areopagitica as our corpus, we would have a model trained to recognise only uses of the words as used in Areopagitica. We could calculate sentence embeddings (which we could treat like document embeddings) for every sentence in Areopagitica to determine which sentences were most similar. However, the results would likely be quite poor. The reason for this is that the corpus is too small. Ideally, you need a lot of words, the more the better. For ICLR, we used all of the judgment transcripts and Case Reports contained in ICLR.online as our corpus. This represents around 200,000 documents; about 600,000,000 words.

But what are document embeddings, and how do you create them?

Document embeddings

Once you have word embeddings, you will want to create document embeddings. To create a document embedding, take all the word embeddings of the document and use cosine similarity to combine the dimensions into a single representation with the same number of dimensions.

Cosine similarity combines two vectors by calculating the cosine of the angle between them and multiplying by the length of the side not participating in the cosine calculation.

Cosine similarity is a nice way to explain it visually, but there is an algebraic isomorphism using dot products of normalised vectors.

This algorithm combines word embeddings into document embeddings:

  1. Create an empty embedding (where all values are 0) called A. This is our aggregate.
  2. For each word embedding, w:
    1. normalise w;
    2. for each value in A, add the corresponding value in w (add the first number in A to the first number in w, the second number in A to the second number in w, and so on).
  3. Divide each value in A by the number of words used to calculate it.

To normalise a word embedding, take the square root of the sum of each value’s square (let’s call it n); then divide each value in the embedding by n.

This is the implementation used by fastText. However we have tweaked the algorithm to make use of tf–idf. tf–idf weights words that are considered important. A word is considered important if its frequency within the corpus is low compared to its frequency within a document. Since words like ‘a’ occur throughout the corpus, it is weighted low; whereas a word like ‘theft’ will be weighted higher, because it occurs only in a subset of documents within the corpus.

Calculating the distance between document embeddings

The final puzzle piece is to quantify how similar documents are based on their document embeddings. The distance between two documents is used as an indication of how similar they are. Documents that are close together are more similar than those that are further away.

To calculate the distance between two document embeddings, Cosine Similarity can be used again. This time, the vectors of the document embeddings are collapsed to a single value between 0 and 1. Identical documents have the value 1, completely orthogonal documents have the value 0.

We use a library called Faiss, created by Facebook Research, to store and query document embeddings. This is very fast, and you can query multiple input embeddings simultaneously.

And finally…

This post describes some of the major components used in Case Genie and delves a little into the concepts and some of the algorithms; but there is a lot more that I don’t have time to cover: specifically, how do we prepare the text of the corpus for training the model? I will cover that in another post.