Organising Git Pull Requests

Recently we had an interesting dev forum on Git workflows. Git is the de facto source control tool of choice for software development. Powerful and flexible, teams have a wide range of choices to make when deciding exactly how to make use of it. This discussion originally stemmed from a Changelog podcast titled “Git your reset on” based on the blog post “Git Organized” by Annie Sexton.

To briefly summarise the above, if we imagine a situation where a merge to main leads to a production deployment and then an issue is found in production, how do we roll back? We look in the git history, find the commit that introduced the bug, revert it and then redeploy to production. But oh no! This revert changed a number of source files beyond the scope of the bug and has ended up introducing a new bug.

The proposed solution to this is to ensure a sensible, clean history of commits is used within a branch. Annie Sexton suggests using a feature branch where you commit regularly with useless messages like WIP until you are ready to submit a pull request. You then run git reset origin/head to be presented with all the changes you have done and how this differs from main, you then progressively add files and make commits with sensible messages to build up a more coherent change.

This approach has a number of advantages we discussed:

  • It provides a neat history of compartmentalised changes
  • Reviewers can then follow a structured story of commits about how the developer arrived at their solution
  • Irrelevant changes are excluded
  • Ensure that all of the included changes are relevant only to the task at hand

Disadvantages of this approach may include:

  • Bad approaches that were tried and abandoned are not in the history
  • Changes aren’t going to always compartmentalise neatly into specific changes of files meaning we would need to commit specific changes within a file, some tools do a bad job supporting this use case

It turns out that a lot of developers at 67 Bricks use a form of history rewriting when developing code, but no one used the specific approach proposed above. git commit–amend is a very useful command to tweak the previous commit. Some (including myself) use git rebase -i as a way of rebuilding the commit history, though this has a limitation in cases where you may want to split a commit up. Finally, others create new branches and build up clean commits on the new branch.

Is force pushing a force for good?

This question really split the developers, some see using git push --force as an evil that indicates you’re using git wrongly and should only be used as a last resort. Others really like the idea of force pushing, but only for branches you’ve been developing on and have yet to create a merge request for.

Squashing?

A workflow some developers have seen before involves using the --squash flag when merging into main. This can create a nice history where each commit neatly maps to a single ticket but the general view was this caused the loss of helpful information from the git history.

Should the software be valid at every commit?

Some argued that this is a sensible thing to strive for, having confidence that you can revert back to any commit and the software will work is a nice place to be at. However, others criticised this as tricky to achieve in practice. Ideally the tests would all be valid and pass too, but this goes against how some developers work; they create a failing test to show the problem or new feature, commit it and then build out code to make the test green. Others claim that tests should change in the same commit as code changes to make it obvious these changes belong together.

Tools used by Developers

At 67 bricks, we try to avoid specifying which precise tool a developer should use for their craft. Different developers prefer different approaches when it comes to using git, some are more comfortable with their IDEs plugin (e.g. IntelliJ has good git integration which can add specific lines to commits) while others may prefer using the command line. Here’s a list of tools we’ve used when dealing with git:

To conclude, specific approaches and tools used vary, but everyone agreed on the benefits of striving for a clean, structured git history.

MarkLogic profiler

When investigating slow XQuery or XSLT queries in MarkLogic, one of the tools it provides is a query profiler. In this post I’ll explain how to make use of the profiler and how to interpret the results.

There are 3 ways you can make use of the profiler:

  1. use the MarkLogic QConsole;
  2. use an IDE that supports the profiler like IntelliJ with the XQuery and XSLT plugin;
  3. use the MarkLogic profiler API.

Using the QConsole

In the MarkLogic QConsole you can enable profiling by clicking on the Profile tab next to the Results tab and pressing the Run button. This will display a table with the profiling results. The profile can be saved by clicking on the download button on the right.

The profile result table contains the following information:

  1. Module:Line No.: Col No. – This is where the expression being profiled starts;
  2. Count – This is the number of times the expression was evaluated;
  3. Shallow %/Shallow ms – This is how long the expression took to run on its own across all Count times it was evaluated;
  4. Deep %/Deep ms – This is how long the expression took to run in addition to the time subexpessions took to evaluate;
  5. Expression – This is the string representation of the query expression being evaluated.

NOTE: Here, ms is microseconds not milliseconds.

If you want to do further processing on the downloaded profile:report XML, you can load that XML by using:

let $data := xdmp:filesystem-file("C:\profiling\profile.xml")
let $profile := xdmp:unquote($data)/prof:report

Analysing Profile Results

Sorting by shallow time is useful for identifying the expressions that take a long time to evaluate. This can identify places where you should modify the query or add indices.

The count column is also useful. It can indicate places where you have nested loops that result in quadratic time queries.

When making optimizations, it can be tricky to know if a change results in a speed up of the query or is not significant, but shows up in the normal time variation when running a query multiple times. To handle this, I like to use a spreadsheet and measure the average (mean) time and standard deviation over 5 or 10 runs.

This can also be useful when varying the number of items being processed (e.g. documents) in order to identify if the query scales with that item, and whether the scaling is linear, quadratic, or some other curve. For this, you can use line graphs to visualise the performance and linear or quadratic regression formulae to check what shape the graph is.

The stock chart graph is useful for comparing the variation between runs. Here, the mininum and maximum times form the extents (low, high) of the chart, and the mean +/- one standard deviation form the inner bar (open, close) of the chart.

Using the Profiler API

Using the profiler API you can use the following to collect the profile data from a query expression:

let $_ := prof:enable(xdmp:request())
(: code to profile, e.g. -- let $_ := local:test() :)
let $_ := prof:disable(xdmp:request())
let $profile := prof:report(xdmp:request())

NOTE: If you are running this from an XQuery 1.0 query instead of a 1.0-ml query then you need to declare the prof namespace:

declare namespace prof = "http://marklogic.com/xdmp/profile";

You can then either save the profile to a file, e.g.:

let $_ := xdmp:save("C:\profiling\profile.xml", $profile)

or process the resulting profile:report XML in the XQuery script.

Alternatively you can use one of the prof:eval or prof:xslt-eval to evaluate an in memory query, or prof:invoke or prof:xslt-invoke to evaluate a module file. These return the profile data as the first item and the result as the remaining items, so you can use the following:

let $ret := prof:eval("for $x in 1 to 10 return $x")
let $profile := $ret[1]
let $results := $ret[position() != 1]

With the profile report XML, you can then process it as you need. For example, to create a CSV version of the QConsole profile table you can use the following query:

declare function local:to-seconds($value) {
  ($value cast as xs:dayTimeDuration) div xs:dayTimeDuration("PT1S")
};

let $overall-elapsed := local:to-seconds($profile/prof:metadata/prof:overall-elapsed)
for $expression in $profile/prof:histogram/prof:expression
let $source := $expression/prof:expr-source/string()
let $uri := string(($expression/prof:uri/text(), "main")[1])
let $line := $expression/prof:line cast as xs:integer
let $column := ($expression/prof:column cast as xs:integer) + 1
let $count := $expression/prof:count cast as xs:integer
let $shallow-time := local:to-seconds($expression/prof:shallow-time)
let $deep-time := local:to-seconds($expression/prof:deep-time)
order by $shallow-time descending
return
  string-join((
    $uri, $line, $column, $count,
    $shallow-time, ($shallow-time div $overall-elapsed) * 100,
    $deep-time, ($deep-time div $overall-elapsed) * 100,
    $source
  ) ! string(.), ",")

NOTE: This does not handle escaping of commas in the $source string.

An often asked question

I thought I’d pen a short blog post on a question I frequently find myself asking.

You’re right, but are you relevant?

Me

As software developers, it’s all too easy to want to select the new hot technologies available to us. Building a simple web app is great, but building a distributed web-app based on an auto-scaling cluster making use of CQRS and an SPA front-end written in the latest JavaScript hotness is just more interesting. Software development is often about finding simple solutions to complex problems, allowing us to minimize complexity.

  • You’re right, distributed microservices will allow you to deploy separate parts of your app independently, but is that relevant to a back-office system which only needs to be available 9-5?
  • You’re right, CQRS will allow you to better scale the database to handle a tremendous number of queries in parallel, but is that relevant when we only expect 100 users a day?
  • You’re right, that new Javascript SPA framework will let you create really compelling, interactive applications, but is that relevant to a basic CRUD app?

Lots of modern technology can do amazing things and there are always compelling reasons to choose technology x, but are those reasons relevant to the problem at hand? If you spend a lot of time up front designing systems with tough-to-implement technologies and approaches which aren’t needed; a lot of development effort would have been needlessly spent adding all kinds of complexity which wasn’t needed. Worse, we could have added lots of complexity that then resists change, preventing us from adapting the system in the direction our users require.

At 67 Bricks, we encourage teams to start with something simple and then iterate upon it, adding complexity only when it is justified. Software systems can be designed to embrace change. An idea that’s the subject of one of my favourite books Building Evolutionary Architectures.

So next time you find yourself architecting a big complex system with lots of cutting edge technologies that allow for all kinds of “-ilities” to be achieved. Be sure to stop and ask yourself. “I’m right, but am I relevant?”

In praise of functional programming

At 67 Bricks, we are big proponents of functional programming. We believe that projects which use it are easier to write, understand and maintain. Scala is one of the most common languages we make use of and we see more object oriented languages like C# are often written with a functional perspective.

This isn’t to say that functional languages are inherently better than any other paradigms. Like any language, it’s perfectly possible to use it poorly and produce something unmaintainable and unreadable.

Immutable First

At first glance immutable programming would be a challenging limitation. Not being able to update the value of a variable is restrictive but that very restriction is what makes functional code easier to understand. Personally I found this to be one of the hardest concepts to wrap my head around when I first started functional programming. How can I write code if I can’t change the value of variables?

In practice, reading code was made much easier as once a value was assigned, it never changed, no matter how long the function was or what was done with the value. No more trying to keep track in my head how a long method changes a specific variable and all the ways it may not thanks to various control flows.

For example, when we pass a list into a method, we know that the reference to the list won’t be able to change, but the values within the list could. We would hope that the function was named something sensible, perhaps with some documentation that makes it clear what it will do to our list but we can never be too sure. The only way to know exactly what is happening is to dive into that method and check for ourselves which then adds to the cognitive load of understanding the code. With a more functional programming language, we know that our list cannot be changed because it is an immutable data structure with an immutable reference to it.

High Level Constructs

Functional code can often be more readable than object oriented code thanks to the various higher level functions and constructs like pattern matching. Naturally readability of code is very dependent on the programmer; it’s easy to make unreadable code in any language, but in the right hands these higher level constructs make the intended logic easier to understand and change. For example, here are 2 examples of code using regular expressions in Scala and C#:

def getFriendlyTime(string time): String = {
  val timestampRegex = "([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})".r
  time match {
    case timestampRegex(hour, minutes, _, _) => s"It's $minutes minutes after $hour"
    case _ => "We don't know"
  }
}
public static String GetFriendlyTime(String time) {
  var timestampRegex = new Regex("([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})");
  var match = timestampRegex.Match(time);
  if (match == null) {
    return "We don't know";
  } else {
    var hours = match.Groups[2].Value;
    var minutes = match.Groups[1].Value;
    return $"It's {minutes} minutes after {hour}";
  }
}

I would argue that the pattern matching of Scala really helps to create clear, concise code. As time goes on features from functional languages like pattern matching keep appearing in less functional ones like C#. A more Java based example would be the streams library, inspired by approaches found in functional programming.

This isn’t always the case, there are some situations where say a simple for loop is easier to understand than a complex fold operation. Luckily, Scala is a hybrid language, providing access to object oriented and procedural styles of programming that can be used when the situation calls for it. This flexibility helps programmers pick the style that best suits the problem at hand to make understandable and maintainable codebases.

Pure Functions and Composition

Pure functions are easier to test than impure functions. If it’s simply a case of calling a method with some arguments and always getting back the same answer, tests are going to be reliable and repeatable. If a method is part of a class which maintains lots of complex internal state, then testing is going to be unavoidably more complex and fragile. Worse is when the class requires lots of dependencies to be passed into it, these may need to be mocked or stubbed which can then lead to even more fragile tests.

Scala and other functional languages encourage developers to write small pure functions and then compose them together in larger more complex functions. This helps to minimise the amount of internal state we make use of and makes automated testing that much easier. With side effect causing code pushed to the edges (such as database access, client code to call other services etc.), it’s much easier to change how they are implemented (say if we wanted to change the type of database or change how a service call is made) making evolution of the code easier.

For example, take some possible Play controller code:

def updateUsername(id: int, name: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateName(name) match {
        case Some(updatedUser) =>     
         userRepo.save(updatedUser)
         bus.emit(updatedUser.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

def updateEmail(id: int, email: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateEmail(email) match {
        case Some(updatedUser) =>
          userRepo.save(updatedUser)
          bus.emit(updatedUser.events)
          Ok()
        case None => BadRequest()
    case None => NotFound()
}

Using composition, the common elements can be extracted and we can make a more dense method by removing duplication.

def updateUsername(id: int, name: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateName(name))
}

def updateEmail(id: int, email: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateEmail(email))
  }
}

private def updateEntityAndSave(repo: Repo[T])(id: int)(f: T => Option[T]): Result = {
  repo.get(id) match {
    case Some(entity) =>
      f(entity) match {
        case Some(updatedEntity) =>
         repo.save(updatedEntity)
         bus.emit(updatedEntity.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

None Not Null

Dubbed the billion-dollar mistake by Tony Hoare, nulls can be a source of great nuisance for programmers. Null Pointer Exceptions are all too often encountered and confusion reigns over whether null is a valid value or not. In some situations a null value may never happen and need not be worried about while in others it’s a perfectly valid value that has to be considered. In many OOP languages, there is no way of knowing if an object we get back from a method can be null or not without either jumping into said method or relying on documentation that often drifts or does not exist.

Functional languages avoid these problems simply by not having null (often Unit is used to represent a method that doesn’t return anything). In situations where we may want it, Scala provides the helpful Option type. This makes it explicit to the reader that you call this method, you will get back Some value or None. Even C# is now introducing features like Nullable Reference Types to help reduce the possible harm of null.

Scala also goes a step further, providing an Either construct which can contain a success (Right) of failure (Left) value. These can be chained together to using a railway styles of programming. This chaining approach can lead us to have an easily readable description of what’s happening and push all our error handling to the end, rather than sprinkle it among the code.

Using Option can also improve readability when interacting with Java code. For example, a method returning null can be pumped through an Option and combined with a match statement to lead to a neater result.

Option(someNullReturningMethod()) match {
  case Some(result) =>
    // do something
  case None =>
    // method returned null. Bad method!
}

Conclusions

There are many reasons to prefer functional approaches to programming. Even in less functionally orientated languages I have found myself reaching for functional constructs to arrange my code. I think it’s something all software developers should try learning and applying in their code.

Naturally, not every developer enjoys FP, it is not a silver bullet that will overcome all our challenges. Even within 67 Bricks we have differing opinions on which parts of functional programming are useful and which are not (scala implicits can be quite divisive). It’s ultimately just another tool in our expansive toolkit that helps us to craft correct, readable, flexible and functioning software.

Introduction to 67 Bricks immutable deployments

Since the rise of cloud computing, the way of creating and maintaining infrastructure has changed. While some years ago an unhealthy machine needed to be logged in and fixed manually, or even worse, turned off and a new one had to be configured from scratch by a systems administrator, thanks to modern technologies it has become possible to not worry about whether a resource is configured in exactly the same way.

This post summarizes the main tools used at 67 Bricks for deployment of the services that we host and manage, and describes our deployment process.

A dictionary definition of immutability is “the state of not changing, or being unable to be changed”. And this is what deploying immutably is about – after being released, infrastructure does not change in place, and in order to make changes, a new version is provisioned and the old version is destroyed.

Some of the benefits of immutability include the following:

  1. Consistency of environments – the same processes and procedures are applied to create environments for staging, testing and live, thereby making it easier to test.
  2. An immutable deployment pipeline means that what happens during deployments is documented by using configuration as code – this enables developers to understand the services better and know exactly what happens at each stage.
  3. Having configuration stored as code in a repository facilitates spinning up new services thus improving speed of development.
  4. The ability to reproduce resources aids immensely in automated disaster recovery where it might only take a few minutes to replace an unhealthy resource with a new one.
  5. Immutable deployments allow dynamic scaling of resource in the cloud based on demand.

Tool Used

  • We use Gitlab as our internal code repository, and CI/CD system.

At 67 Bricks we use a variety of modern tools and technologies to deploy our services.

  • Packer is used to create a server image with the software baked onto it.
  • AWS is typically where we deploy applications.
  • Ansible is used to configure servers, called by Packer during the image build phase.
  • Terraform is used to provision infrastructure.

Deployment process

Firstly, an environment or multiple environments are created in an AWS account using Terraform, and resources such as VPC (virtual private cloud) are launched.

Secondly, a base image is built daily off one of AWS AMIs (Amazon Machine Images). By means of Ansible, latest packages are fetched and tools such as AWS CLI and snap updates are installed, as well as automatic security updates are configured. This image is available to developers to use for their application. Since a lot of our applications run on Linux, we use Ubuntu to build the base image

When code is merged into the main/master branch in Gitlab, this starts a pipeline during which the application is built, tests are run and an AMI with the application installed is created via Packer, with Ansible tasks configuring the machine for the use by a specific application. This image is tagged so that we know which application is on it.

After that, the pipeline deploys the application to test and then to live. We use autoscaling for our applications (specifically, AWS AutoScaling Groups), and to use the newly built image with the new version of the application, a script is used to destroy existing application servers and create new ones with the updated version of the application installed. Autoscaling Group configuration is also amended to use the new AMI for any autoscaling events.

When servers (or EC2 instances, in AWS speak) are launched in an autoscaling group, a user_init script gets run on initialization. This is the stage at which any environment specific setup can be done.

Immutability is of paramount importance if you want to create a simpler, more predictable deployment pipeline whose outcome is trusted. Our teams have adopted the immutable deployment approach, and it helps us to ensure repeatable processes and reliable services and applications are in place.

Case Genie or How to Find Unknown Unknowns

It was Paul Magrath, Head of Product Development and Online Content at ICLR, who first used the late Donald Rumsfeld’s phrase to describe the business case for what later became known as Case Genie.

The idea was simple. Lawyers needed to discover historical cases that might impact a case they were preparing. Frequently-cited cases would already be known to them: they’d be at the tips of their fingers, ready to type into their skeleton argument; cases known to every barrister specialising in the field they were arguing; cases that had been cited many times both by other cases and in text books; or cases that had recently changed how the law should be interpreted, and were therefore big news within a narrowly-focused legal community.

But there may be some cases that are relatively unknown that might make all the difference. Enter Case Genie.

This blog post presents a technical overview of Case Genie. How, exactly, is it possible to find “unknown unknowns”?

Word embeddings

Document embeddings are considered the state-of-the-art way of finding similarities between documents. But before document embeddings come word embeddings. A good way to start to think about word embeddings, is to think about words in a document. Let’s take Milton’s Areopagitica as an example. This single work is our corpus (the body of text we’re interested in). Let’s take a single sentence from Milton’s Areopagitica:

Many a man lives a burden to the Earth; but a good Book is the precious life-blood of a master spirit, embalmed and treasured up on purpose to a life beyond life.

Now, if we take each distinct word, we can give it a score based on how often it occurs in the text:

  many  — 1
  a     — 5
  man   — 1
  lives — 1
  …etc…

Now imagine graphing three of these words. I’m choosing three because this is easy to visualise, but rather than take the first three words in the sentence, let’s take the following:

  a         — 5
  life      — 3
  treasured — 1

Imagine these graphed across the x, y and z axes: ‘a’ has the value 5 on the x axis; ‘life’ has the value 3 on the y axis; and ‘treasured’ has the value 1 on the z axis. Further, imagine that for each value, there is an arrow from the origin to the point along each axis, so that we have 3 arrows of different lengths pointing in different directions. Now, in mathematics a value with a direction is called a vector, so we can say that each distinct word within a corpus can be represented as a vector; and within a document, a vector’s value is the number of occurrences of the word within that document.

Three dimensions aren’t too hard to visualise. Extrapolating, we can add further axes, which is much harder to visualise; as many axes as there are distinct words. Each distinct word in the corpus will therefore be represented by a vector.

Of course, that’s just one sentence and there are many in Areopagitica. The full vector space is defined by the number of unique words in the corpus, let’s suppose there are 2,000 in Areopagitica. Initially this will sound confusing, but the word embedding for a given word is made up of a value for every vector in the vector space, which is the same as saying a value for every word. For each distinct word, its word embedding will therefore comprise 1,999 zero values and one non-zero value — 1. If we order the distinct words alphabetically, we can represent an embedding as an implicitly ordered array of numbers. Therefore, for each word embedding, the non-zero value will be in a different place; ‘a’ would be:

  1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

whereas ‘and’ would be:

  0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 …

This is crazily sparse; the cleverness comes when using AI to use all that empty space more wisely.

Algorithms like word2vec and fastText (there are other algorithms, but these are the two we looked at for ICLR) provide a more compact representation of word embeddings. Whereas with Areopagitica, using the algorithm I described above, we would need 2,000 values (dimensions) for each word, word2vec and fastText can use a few hundred dimensions. You can download a 300-dimensional model from the fastText website trained on 2 million words from Wikipedia. Word embeddings actually use floating point numbers, so the word ‘a’ in the fastText model we trained for ICLR starts:

  -0.20005 -0.019533 -0.15494 -0.11114 0.074793 0.1194 -0.046101 …

The model for ICLR has 600 dimensions, so there are 600 numbers for each word embedding.

Exactly how word2vec and fastText create these word vectors is outside the scope of this blog post. Essentially, you start by training a model from the corpus, which you feed to the tool. The corpus and the training algorithms define relationships between words or character combinations. The corpus used to train the model therefore determines how the words’ relationships are actually represented in the model. If you train with a French corpus, you will get good results when calculating document embeddings for French text; but you would get bad results if you presented English text. In the same vein, if you train a model using legal documents, you will get a model that is more sensitive to legal meanings and definitions.

So, if we were to use Areopagitica as our corpus, we would have a model trained to recognise only uses of the words as used in Areopagitica. We could calculate sentence embeddings (which we could treat like document embeddings) for every sentence in Areopagitica to determine which sentences were most similar. However, the results would likely be quite poor. The reason for this is that the corpus is too small. Ideally, you need a lot of words, the more the better. For ICLR, we used all of the judgment transcripts and Case Reports contained in ICLR.online as our corpus. This represents around 200,000 documents; about 600,000,000 words.

But what are document embeddings, and how do you create them?

Document embeddings

Once you have word embeddings, you will want to create document embeddings. To create a document embedding, take all the word embeddings of the document and use cosine similarity to combine the dimensions into a single representation with the same number of dimensions.

Cosine similarity combines two vectors by calculating the cosine of the angle between them and multiplying by the length of the side not participating in the cosine calculation.

Cosine similarity is a nice way to explain it visually, but there is an algebraic isomorphism using dot products of normalised vectors.

This algorithm combines word embeddings into document embeddings:

  1. Create an empty embedding (where all values are 0) called A. This is our aggregate.
  2. For each word embedding, w:
    1. normalise w;
    2. for each value in A, add the corresponding value in w (add the first number in A to the first number in w, the second number in A to the second number in w, and so on).
  3. Divide each value in A by the number of words used to calculate it.

To normalise a word embedding, take the square root of the sum of each value’s square (let’s call it n); then divide each value in the embedding by n.

This is the implementation used by fastText. However we have tweaked the algorithm to make use of tf–idf. tf–idf weights words that are considered important. A word is considered important if its frequency within the corpus is low compared to its frequency within a document. Since words like ‘a’ occur throughout the corpus, it is weighted low; whereas a word like ‘theft’ will be weighted higher, because it occurs only in a subset of documents within the corpus.

Calculating the distance between document embeddings

The final puzzle piece is to quantify how similar documents are based on their document embeddings. The distance between two documents is used as an indication of how similar they are. Documents that are close together are more similar than those that are further away.

To calculate the distance between two document embeddings, Cosine Similarity can be used again. This time, the vectors of the document embeddings are collapsed to a single value between 0 and 1. Identical documents have the value 1, completely orthogonal documents have the value 0.

We use a library called Faiss, created by Facebook Research, to store and query document embeddings. This is very fast, and you can query multiple input embeddings simultaneously.

And finally…

This post describes some of the major components used in Case Genie and delves a little into the concepts and some of the algorithms; but there is a lot more that I don’t have time to cover: specifically, how do we prepare the text of the corpus for training the model? I will cover that in another post.

Understanding an “impossible” error

As discussed in the previous post on Sharing Failures, seeing how other people have dealt with bugs and errors can often help you avoid your own or give you ways to track down the source of a problem when one does make its appearance. So in that spirit, here is the story of a baffling error we fixed recently.

The error came from a content delivery platform we have been working on for a publisher client. At the point of a release and for several hours after we were seeing some errors, but there were a few reasons why this was very confusing.

The site is built using Scala / Play and uses Akka HTTP to make API calls between services. The error we were seeing was one that generally means that requests are coming in to a frontend service faster than the backend can service them:

BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool (........) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.]]

So apparently the pool of requests was filling up and causing a problem. But the first thing that was strange was that this was persisting for several hours after the release. At the point of a release it’s understandable that this error could occur with various services being started and stopped, causing requests to back up. After that the system was not under particularly high load, so why was this not just a transient issue?

The next thing that was strange was that we were only seeing this when users were accessing very particular content. We were only seeing it for access to content in reference works. These are what publishers confusingly call “databases” and cover things like encyclopedias, directories or dictionaries. But it wasn’t all databases, only certain ones and different ones at different times. On one occasion we would see a stream of errors for Encyclopedia A and then the next time we hit this error it would be Dictionary B generating the problems instead. If the cause was a pool of requests filling up, why would it affect particular pieces of content and not others, when they all use the same APIs?

Another thing that was puzzling – not every access to that database would generate an error. We’d either get an error or the content would be rendered fine, both very quickly. The error we were seeing suggested that things were running slowly somewhere, but the site seemed to be snappy, just producing intermittent errors for some content some of the time.

We spent lots of time reading Akka HTTP documentation trying to figure out how we could be seeing these problems, but it didn’t seem to make any sense. I had the feeling that I was missing something because the error seemed to be “impossible”. I even commented to a colleague that it felt like once we worked out what was going on I would talk about it at one of our dev forums. That prediction turned out to be true. Looking at Akka HTTP documentation would not help because the error message itself was in some sense a misdirection.

The lightbulb moment came when I spotted this code in our frontend code:

private lazy val databaseNameCache: LoadingCache[String, Future[DatabaseIdAndName]] = 
    CacheBuilder.newBuilder().refreshAfterWrite(4, TimeUnit.HOURS).....

We are using Guava’s LoadingCache to cache the mapping between the id of a database and its name since this almost never changes. (Sidenote: Guava’s cache support is great, also check out the Caffeine library inspired by it). The problem here is that we are not storing a DatabaseIdAndName object in the cache, but a Future. So we are in some sense putting the operation to fetch the database name into the cache. If that fails with an Exception, then every time we look in the cache for it we will replay the exception. Suddenly all the pieces fell into place. A transient error looking up a database name at release time was being put in a cache on one frontend server and replayed for hours. The whole akka pool thing was more or less irrelevant.

In the short term we fixed the problem by waiting for the concrete data to be returned to store that in the cache rather than a Future object. In that scenario, a failure to fetch the value would just yield an error and nothing would be cached for future look ups. However, much of the code using this cache is asynchronous, so it’s cleaner and probably better from a performance perspective if you can continue to use Future where possible. So the longer term solution was to revert to putting Future objects in the cache but carefully adding code to invalidate any cache entries that resolve to an exception.

I think the lesson here is – if an error doesn’t make sense then maybe some technical sleight-of-hand is going on and the error you are seeing is not the real problem. Maybe it’s all an illusion…

Women in Tech Festival Global 2021

If you have worked in the tech industry for some time, you are likely to have noticed the issue with diversity. Information Technology was probably thought of as a male domain, and we can see the consequences of such thinking on a global level now.

67 Bricks strives to be a diverse and inclusive workplace, and we continuously improve our D&I awareness and practices. That is why for the second year in a row we attended the Women in Tech Festival aimed to champion diversity and empower companies and individuals to be allies for underrepresented groups. I did a presentation titled “It’s good to give back” at the event, which I immensely enjoyed, because, as a woman in tech myself, the topic of diversity is very close to my heart, and I take great interest in it.

This blog post gives a summary of some sessions I attended virtually.

Opening Note

The event started with the opening note from the Belonging, Inclusion and Diversity Lead of Investec, Zandi Nkhata. She spoke about reasons why women leave the tech industry: 

  • the lack of female role models.
  • experience of microaggressions – that is things people say to you that kind of remind you that you do not belong.  
  • the fact that your experience at a company might vary on whether or not you have an inclusive leader.

She also explained the difference between diversity and inclusion which I think is excellent: diversity is inviting someone to a party and inclusion is asking someone to dance. She also highlighted that only 20% of the workforce in the industry are women.

What can companies do to make their places diverse and inclusive? As an example, Investec’s vision is to make it a place where it is easy for people to be themselves, and to achieve that they set up different networks for people to speak up and listen to their feedback, provide learning and training opportunities about bullying, harassment and discrimination and have an allies programme.

Zandi also mentioned that it’s good to set KPIs with regards to diversity and inclusion, but they are not quotas, you have to be fair in achieving these targets.

Glass Ceiling or Sticky Floor

This panel discussion was about career progression – either knowing you’re probably the best candidate for a promotion yet not getting this promotion, or being capable enough but being obstructed by impostor syndrome, not having a career plan or a mentor.

The main point of the discussion was that a person finding themselves not progressing needs to ask themselves: “what is limiting my growth and what is in my control?”. You need to create a career plan and ensure you are in control. The importance of networking for women was highlighted, and events like Women in Tech is a great opportunity to do that. 

Another piece of advice was to focus on progress rather than perfection, and to learn to not be scared of asking questions even if you might think they are stupid (because they are not!).

Employers also have a duty to help with career progression. It is important to create career paths, understand them and enable employees to understand them as well, making it clear what is needed to get from A to B. It’s also vital to identify the strengths of each individual and know the exact purpose of each person in a team.

The panel also spoke about those who are in search of a new job and what question they might want to ask potential employers to decide about the suitability of a company for them; the suggestions were to look at the leadership gender balance and whether the company is doing any work regarding diversity and inclusion, among others.

Companies should not be scared to bring in people outside of the tech industry, reskill them and tap into their wealth of experience and transferable skills because the mixture of these experiences, strengths and insights can enable the team to grow.

How Old Are You?

This was about progressing in your career when you’re older. A lot of the focus here was on menopause awareness. This topic is still taboo, so safe spaces need to be created to make this conversation more visible, allowing people to speak about it without embarrassment. A lot of people still don’t know much about it even though their female relatives or friends might be experiencing menopause.

The speaker suggests that companies start with things like short talks about it in staff meetings. Some employers hold regular menopause cafes, others hold sessions on what to expect during this challenging time.  An emphasis was made on educating men (especially line managers) to feel comfortable about discussing menopause, and strategies for coping with it in all-male environments, which was mainly to push towards diversity and inclusion, having company policies around menopause and working together.

You Do Belong Here

This session was focused on combating impostor syndrome. This is typically associated with women (men do experience it too though) and the panellists shared useful tips that help them to overcome it:

  • Try to understand if it’s impostor syndrome or the culture that doesn’t let you grow. Some level of self-doubt is experienced by everyone.
  • Having a conversation about it helps combat it. It is the manager’s responsibility to create space where people can discuss it.
  • Some people use journaling. For example, you made a mistake, and 5 days later it’s still eating at you, and you still think about what you could have done. So to avoid that, by writing down what happened and what you could do next time, you get it out of your system.
  • Keep a list of your successes to read from time to time.
  • Refresh your CV and bio regularly as it allows you to focus on your achievements.
  • Educate yourself in neuroscience; humans are programmed to think negatively, and understanding this enables you to interpret your behaviour and thoughts.
  • Instead of changing yourself and trying to adopt a new personality type in certain situations to suit someone else, decide for yourself how you want to come across. However, beware if you go too far, If you’re not genuine, it’s not a sensible place to be. The best thing is to be your authentic self.

To sum up, thanks to this year and last year’s events I spoke to several inspiring females and got a bigger picture of what issues exist for women and LGBTQ+ communities in the tech industry, and what we can do to deal with them. I definitely learnt a lot from the Women in Tech Festival 2021. It was also great to realise that we, 67 Bricks, are doing all the right things to be as diverse and inclusive a workplace as we can. I look forward to sharing more of my learnings with my colleagues.

The First 67 Bricks Architectural Kata

On the 13th of October, 67 Bricks held its first Architectural Kata as an opportunity for developers to practice and experience architectural design. Ted Neward, the original creator of architectural katas, puts forward the argument for them quite succinctly:

  “So how are we supposed to get great architects, if

  they only get the chance to architect fewer than

  a half-dozen times in their career?”

Ted Neward

The kata exercise was quite simple to prepare for and run. Beforehand I picked a number of interesting katas from Neal Ford’s list, suitably anglicised them and prepared handouts for both online and in-person teams. Then on the day, it was a case of assigning everyone into teams and explaining the steps we’d be going through that day:

  1. Gather into teams and read through the assigned case study
  2. Discuss for the next 50 minutes and come up with an architecture, making sure to document any assumptions. Teams could ask me questions about the case study and I’d happily provide extra detail
  3. Each team would then take turns presenting their architecture to the other teams and fielding questions from them
  4. Finally a voting phase where everyone else gets the chance to show thumbs up/down/sideways to indicate how well they thought the presenting team did

Drivers

As a technical consultancy, we are often involved in architectural design of systems we create or take part in creating. Each client is individual; they seek different opportunities, are subject to different constraints and have different technical strategies. It’s critical for 67 Bricks’ success to have skilled architects able to influence and design suitable systems. 

Though, how can we develop these skills? As Ted Neward identified, most only get a half-dozen tries at it over their career. While books, online courses and talks help, knowledge needs to be applied and feedback loops closed to truly improve. We’ve never tried an event like a Kata before and I was interested to see what we could learn from it.

Sharing ideas and learning from one another effectively can be a challenge for any organization and made all the harder thanks to COVID-19 prompting a rapid move to remote working. The exercise looked to provide a great opportunity for participants to meet and work with others they may not otherwise have the chance to.

Kata Afternoon

Initially I hoped to get some of the more experienced technical leads to kick off the afternoon by talking about how they architect systems. This proved difficult. Those who I spoke to either thought they didn’t know that much or didn’t feel like they could speak to an audience well about how to do architecture. As such these initial talks didn’t happen and raised a lot of questions around how to arm participants with enough knowhow to feel comfortable tackling the task. I fear this may have led to some teams struggling too much to learn and make effective progress.

Splitting teams took some thought, I wanted to be sure each team had a mix of experience and tried to aim for teams of individuals who haven’t worked together before. I’m glad I spent the time to do this, the Kata would not have had the same impact if everyone was in the same team they usually are (one of the Kata rules is to try and break up regular teams).

Running the actual Kata was reasonably straightforward, if a little awkward thanks to having to manage both online and an in-person group. Some online teams got stuck waiting for me to join their Zoom room to answer their questions. While the in-person group was easy to recognise if they needed a question answered or a nudge in the right direction. 

I tried to prepare some broader scope for each case study before the event in preparation for questions being asked, I quickly found myself having to improvise. I actually found this surprisingly fun. Some teams may have made things harder for themselves by asking too many detailed questions and trying to cover every single detail in their architecture.

The presentation phase had mixed results, some teams did a strong job and were able to present their ideas well, answer questions clearly and came up with really suitable architectures. Other teams struggled a bit more, both with coming up with a suitable architecture and being able to communicate it with everyone else.

Feedback and Lessons Learned

Feedback was overwhelmingly positive, lots of participants really enjoyed the experience. They liked working with different people and felt they learned a lot from one another about how they tackle this type of task.

I think aiming for mixed team compositions was good and allowed individuals to have interactions with colleagues they wouldn’t normally work with. I’ve had feedback a number of times that with the move to remote, developers felt a lot more insular in their teams and I hoped this gave a chance to break out of those groups.

Doing both in-person and remote teams did make running the Kata a little challenging. I felt that I couldn’t effectively keep an eye on all the teams making it tough to know when to nudge a team that may be getting a little stuck or provide an answer to a pressing question.

Next time I would look at including publishing consultants to act as clients. I perhaps enjoyed being a fickle client a little too much. Having a more dedicated individual to represent the client creates opportunities for encouraging more of a discussion, simulating something closer to the real world when we work with our clients. It would also make it easier to manage the event and our publishing consultants can gain some insight to the architecture process. 

Conclusions

Given the positive response, I would happily organise another Kata in the future. If you are thinking about running one for your own organisation, I would highly recommend it. It’s a great chance to meet and learn from other devs. I certainly learned a lot simply observing the various teams taking different approaches to their case studies and the variety of solutions they came up with.

What have I been listening to?

A while ago, Tim suggested we could have a #now-listening channel in our company Slack, in which people could post details of what they were listening to. It occurred to me that it might be a fun challenge to try to figure out from what I’d posted on there who my favourite artist was, and which was my most-listened-to album. So I rolled up my sleeves and got to work. This is an account of what I did and my various thought processes as I went along…

Challenge: figure out how to get my posts from our #now-listening channel and do some statistics to them.

Session 1: After school run, but before work…

Start – there’s an API. https://api.slack.com

Read the documentation: https://api.slack.com/methods/search.messages looks useful – how do I call it?

I NEED A TOKEN! Aha – https://api.slack.com/apps – a “generate token” button…

Access token: xoxe.xoxp-blah-blah-blah. SUCCESS!

First obvious question: has someone done this already? Google knows everything: https://github.com/slack-scala-client/slack-scala-client

Create a new project: sbt new scala/scala-seed.g8 – add dependency on slack-scala-client, ready to rock! In such a hurry; I can’t even be bothered to set up a package, just hijack the Hello app that came in the skeletal project…

From docs:

val token = "MY TOP SECRET TOKEN"
implicit val system = ActorSystem("slack")
try {
  val client = BlockingSlackApiClient(token)
  client.searchMessages(WHAT TO PUT HERE?)
} finally {
  Await.result(system.terminate(), Duration.Inf)
}

… maybe something like…?:

val ret = client.searchMessages("* in:#67bricks-now-listening from:@Daniel", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

RUN IT

Fails. Because HelloSpec fails (I mentioned I just hijacked the OOTB Hello app). Fix with the delete key.

RUN IT

[WARN] [11/26/2021 08:40:37.418] [slack-akka.actor.default-dispatcher-2] [akka.actor.ActorSystemImpl(slack)] Illegal header: Illegal 'expires' header: Illegal weekday in date 1997-07-26T05:00:00: is 'Mon' but should be 'Sat'
Exception in thread "main" slack.api.ApiError: missing_scope
at slack.api.SlackApiClient$.$anonfun$makeApiRequest$3(SlackApiClient.scala:92)

:-​(

Google: “missing_scope” and interpret results

The token used is not granted the specific scope permissions required to complete this request.

:-​( :-​(

Maybe I have to create an app and add it to the workspace? I’ll try that.

Created, figured out how to add the user token scope “search:read” – and I got a new token!

Token= xoxp-blahblahblah

Rerun: I got a response!

{
  "ok":true,
  "query":"* in:#67bricks-now-listening from:@Daniel",
  "messages": {
    "total":0,
    "pagination": {
      "total_count":0,
      "page":1,
      "per_page":5,
      "page_count":0,
      "first":1,
      "last":0
    },
    "paging": {
      "count":5,
      "total":0,
      "page":1,
      "pages":0
    },
    "matches": []
  }
}

:-​(

Let’s just search in the channel without specifying a name…?

val ret = client.searchMessages("in:#67bricks-now-listening", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening",
  "messages": {
    "total": 7113,
    "pagination": {
      "total_count": 7113,
      "page": 1,
      "per_page": 5,
      "page_count": 1423,
      "first": 1,
      "last": 5
    },
    "paging": {
      "count": 5,
      "total": 7113,
      "page": 1,
      "pages": 1423
    },
    "matches": [
      {
        "username": "daniel.rendall",
        "other": "field_here"
      }
    ]
  }
}

Aha! My username is daniel.rendall, let’s try that:

val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening from:@daniel.rendall",
  "messages": {
    "total": 3213,
    "pagination": { ... etc }
  }
}

Success! Also – 3213 messages – sounds plausible. This is looking good… but sort direction seems wrong…? Try switching to “desc” => same result.

(Time spent so far: about half an hour – better stop or will miss the morning call!)

Session 2: Re-run – still works (hooray!)

Copy and paste output and save as response.json, fix up with jq so I can examine it:

cat response.json | jq '.' > response_tidied.json

And now:

"pagination": {
  "total_count": 3234,
  "page": 1,
  "per_page": 5,
  ... etc

Number has gone up – I’m still listening to things!

So, I could parse the responses to work out what the next page should be, or I could just loop – with pages of size 100 (if the API will return them) there should be 33. So we will loop and save these as 1.json, 2.json etc. First rule of scraping – aim to do it just once and save the result locally.

Horrible quick and dirty code alert!

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  outDir.mkdirs()
  implicit val system = ActorSystem("slack")
  try {
    val client = BlockingSlackApiClient(token)
    (1 to 33).foreach { pageNum =>
      try {
        val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall",
sort = Some("timestamp"),
sortDir = Some("desc"),
count = Some(100),
page = Some(pageNum))
        Files.write(new File(outDir, "" + pageNum + ".json").toPath, ret.toString().getBytes(StandardCharsets.UTF_8), StandardOpenOption.CREATE)
        println(s"Got page $pageNum")
      } catch {
        case NonFatal(e) =>
          println(s"Couldn't get page $pageNum - ${e.getMessage}")
      }
      Thread.sleep(1000)
    }
  } finally {
    Await.result(system.terminate(), Duration.Inf)
  }
}

… prints up a reasuring list “Got page 1” => “Got page 33” and no (reported) errors!

Second rule of scraping – having done it and got the data, zip it up and put it somewhere just in case you destroy it…

Tidy it all (non essential, but makes it easier to look at):

mkdir tidied
ls output | while read JSON ; do cat output/$JSON | jq '.' > tidied/$JSON ; done

On scanning the data – it looks plausible, I can’t see an obvious “date” field but there’s a cryptic “ts” field (sample value: “1638290823.124400”) which is maybe a timestamp? A problem for another day…

(Time spent this session: about 20 minutes)

Session 3: I can haz stats?

Need to load it in. A new main method in a new object…

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  val jsObjects = outDir.listFiles().map { f =>
    Json.parse(new FileInputStream(f))
  }
  println(jsObjects.head)
}

Prints something sensible. Now need to get it in a useful form: define simplest class that could possibly work.

case class Message(iid: UUID, ts: String, text: String, permalink: String)
object Message {
  implicit val messageReads: Reads[Message] = (
  (__ \ "iid").read[UUID] and
  (__ \ "ts").read[String] and
  (__ \ "text").read[String] and
  (__ \ "permalink").read[String]
  ) (Message.apply _)
}

Not sure if I need the ID, but I like IDs. Looks like a UUID.

… oh, also some classes to wrap the whole result with minimum of faff (and Reads, omitted for brevity):

case class SearchResult(messages: Messages)
case class Messages(total: Int, matches: Seq[Message])

Go for broke:

val jsObjects: Array[JsResult[Seq[Message]]] = outDir.listFiles().map { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches)
}

Unpleasant type signature alert – Array[JsResult[Seq[Message]]] Let’s assume nothing will go wrong and just use “.get” and “.flatMap”:

val messages: Seq[Message] = outDir.listFiles().flatMap { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches).get
}.toList

That gives me 3234 Message objects, which is reassuring. They include top-level messages, and responses to threads. As far as I can see, the thread responses include a ?thread_ts parameter in their permalink, therefore filter them out – leaves 1792 remaining.

val filtered = messages.filterNot(_.permalink.contains("?thread_ts"))
filtered.take(10).map(_.text).foreach(println)

…and voila:

The things I’m looking for will all have the format “Artist – Album”. Regex time!

val ArtistAlbumRegex = "(.*?) - (.*)".r("artist", "album")

Wait, what…? “@deprecated(“use inline group names like (?<year>X) instead”, “2.13.7”)”

Didn’t know that had changed. Ho hum…

val ArtistAlbumRegex: Regex = "(?<artist>.*?) - (?<album>.*)".r

  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

Even more promising:

Getting there! Now, there are bound to be loads of duplicates. So I guess the most obvious thing to do is count them. Let’s see if I can find the albums I’ve listened to the most, and their counts. I’m going to define a canonical key for grouping an ArtistAndAlbum just in case I’ve not been completely consistent in capitalisation.

case class ArtistAndAlbum(artist: String, album: String) {
  val groupingKey: (String, String) = (artist.toLowerCase, album.toLowerCase)
}

Then we should be able to count by:

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)sorted.take(10).foreach(println)

(The Bob Lazar Story – Vanquisher,1)
(The Pretenders – Pretenders (152),1)
(Saxon – Lionheart,1)
(Leprous – Tall Poppy Syndrome,1)
(Tom Petty and the Heartbreakers – Damn the Torpedoes (231),1)
(2Pac – All Eyez on Me (436),1)
(Bon Iver – For Emma, Forever Ago (461),1)
(James – Stutter,1)
(Elton John – Honky Château (251),1)
(Ice Cube – AmeriKKKa’s Most Wanted,1)

Ooops – wrong way – also the numbers in brackets need to be removed. Not sure there’s a nicer way to invert the ordering then explicitly passing the Ordering that I want to use…

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonAlbums.take(10).foreach(println

(Benny Andersson – Piano,8)
(Meilyr Jones – 2013),7)
(Brian Eno – Here Come The Warm Jets),4)
(Richard &amp; Linda Thompson – I Want To See The Bright Lights Tonight),4)
(Admirals Hard – Upon a Painted Ocean,4)
(Neuronspoiler – Emergence,4)
(Global Communication – Pentamerous Metamorphosis),4)
(Steely Dan – Countdown To Ecstasy,4)
(Pole – 2,4)
(Faith No More – Angel Dust,4)

That looks plausible, actually. I like Piano. I’m guessing there are loads of other “4” albums…

But who is my most listened to artist? I have a shrewd idea I know who it will turn out to be – my prediction is that it will be a four word band name with the initials HMHB. Use the fact that I defined my grouping key to start with the artist

val mostCommonArtists = artistsAndAlbums.groupBy(_.groupingKey._1)
.view.map { case (_, seq) => seq.head.artist -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonArtists.take(10).foreach(println)

(Half Man Half Biscuit,28)
(Fairport Convention,19)
(R.E.M.,18)
(Steeleye Span,15)
(Julian Cope,15)
(Various,13)
(James,12)
(Cardiacs,11)
(Faith No More,9)
(Brian Eno,9)

Bingo! The mighty Half Man Half Biscuit in there at #1. One flaw is immediately apparent – this naive approach doesn’t distinguish between “listening to lots of albums by an artist as part of business-as-usual” and “listening to an artist’s entire back catalogue in one go” (which accounts for the high showings of Fairport Convention, R.E.M. and Steeleye Span). Worry about that some other time.

How many albums have I listened to?

val distinctAlbums = artistsAndAlbums.distinctBy(_.groupingKey)
println("Total albums = " + artistsAndAlbums.length)
println("Distinct albums = " + distinctAlbums.length)

Total albums = 1371
Distinct albums = 1255

.. but that will be wrong because I’ve listened to some albums in the context of e.g. working through the Rolling Stone or NME’s list of top 500 albums, and in those cases I appended the number to the list e.g. “Battles – Mirrored (NME 436)”. So chop that off the end of the album name:

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) =>
    ArtistAndAlbum(artist, album.replaceAll("\\([^)]+\\)$", "").trim)
}

Distinct albums = 1191

This final session took about 50 minutes, so if my maths is correct, the total time spent on this was a little under 2 hours. TBH I’m slightly dubious about the results; after listing all of the albums I’ve listened to in alphabetical order I’m sure there are some missing (e.g. I tackled the entire Prince back catalogue, but there were only a handful of Prince albums in there, ditto for David Bowie). I suspect a bit more work and exploration of the Slack API might reveal what I’m missing. Or maybe my method for distinguishing main messages from responses is wrong (just had a thought; maybe a main message that begins a thread also gets the ?thread_ts parameter).  But it’s close enough for now, and appears to confirm my suspicion that Half Man Half Biscuit are my most listened to artist.

And now, what with it being the season of goodwill and all that, it’s time for my special Christmas Playlist