Embracing Impermanence (or how to check my sbt build works)

Stable trading relationships with nearby countries. Basic human rights. A planet capable of sustaining life. What do these three things have in common?

The answer is that they are all impermanent. One moment we have them, the next moment – whoosh! – they’re gone.

Today I decided I would embrace our new age of impermanence insofar as it pertains to my home directory. Specifically, I wondered whether I could configure a Linux installation so that my home directory was mounted in a ramdisk, created afresh each time I rebooted the server.

Why on earth would I want to do something like that?

The answer is that I have a Scala project, built using sbt (the Scala Build Tool), and I thought I’d clear some of the accumulated cruft out of the build.sbt file, starting with the configured resolvers. These are basically the repositories which will be searched for the project’s dependencies – there were a few special case ones (e.g. one for JGit, another for MarkLogic) and I strongly suspected that the dependencies in question would now be found in the standard Maven repository. So they could probably be removed, but how to check, since all of the dependencies would now exist in caches on my local machine?

A simple solution would have been to delete the caches, but that involves putting some effort into finding them, plus I have developed a paranoid streak about triggering unnecessary file writes on my SSD. So I had a cunning plan – build a VirtualBox VM and arrange for the home directory on it to be a ramdisk, thus I could check the code out to it and verify that the code will build from such a checkout, and this would then be a useful resource for conducting similar experiments in the future.

Obviously this is not quite a trivial undertaking, because I need some bits of the home directory (specifically the .ssh directory) to persist so I can create the SSH keys needed to authenticate with GitHub (and our internal GitLab). Recreating those each time the machine booted would be a pain.

After a bit of fiddling, my home-grown solution went something like this:

  • Create a Virtualbox VM, give it 8G memory and a 4G disk (maybe a bit low if you think you’ll want docker images on it; I subsequently ended up creating a bigger disk and mounting it on /var/lib/docker)
  • Log into VM (my user is daniel), install useful things like Git, curl, zip, unzip etc.
  • Create SSH keys, upload to GitHub / GitLab / wherever
  • Install SDKMAN! to manage Java versions
  • Create /var/daniel and copy into it all of the directories and files in my home directory which I wanted to be persisted; these were basically .ssh for SSH keys, .sdkman for java installations, .bashrc which now contains the SDKMAN! init code, and .profile
  • Save the following script as /usr/local/bin/create_home_dir.sh – this wipes out /home/daniel, recreates it and mounts it as tmpfs (i.e. a ramdisk) and then symlinks into it the stuff I want to persist (everything in /var/daniel)
#!/bin/bash
DIR=/home/daniel

mount | grep $DIR && umount $DIR

[ -d $DIR ] && rm -rf $DIR

mkdir $DIR

mount -t tmpfs tmpfs $DIR

chown -R daniel:daniel $DIR

ls -A /var/daniel | while read FILE
do
  sudo -u daniel ln -s /var/daniel/$FILE $DIR/
done
  • Save the following as /etc/systemd/system/create-home-dir.service
[Unit]
description=Create home directory

[Service]
ExecStart=/usr/local/bin/create_home_dir.sh

[Install]
WantedBy=multi-user.target
  • Enable the service with systemctl enable create-home-dir
  • Reboot and hope

And it turns out that this worked; when the server came back I could ssh into it (i.e. the authorized_keys file was recognised in the symlinked .ssh directory) and I had a nice empty workspace; I could git clone the repo I wanted, then build it and watch all of the dependencies get downloaded successfully. I note with interest that having done this, .cache is 272M and .sbt is 142M in size. That seems to be quite a lot of downloading! But at least it’s all in memory, and will vanish when the VM is switched off…

Dev Forum – Parsing Data

Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.

After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]}) would be considered valid, while the sequence ((({(>>]) would be invalid.

First we define the set of classes that describes the parsed structure:

object BracketParser {

  sealed trait Brackets

  case class RoundBrackets(content: Brackets)
     extends Brackets

  case class SquareBrackets(content: Brackets)
     extends Brackets

  case class AngleBrackets(content: Brackets)
     extends Brackets

  case class CurlyBrackets(content: Brackets)
     extends Brackets

  case object Empty extends Brackets

}

Next, we define the matching rules that parboiled uses:

package com.sixtysevenbricks.examples.parboiled

import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._

class BracketParser extends Parser {

  /**
   * The input should consist of a bracketed expression
   * followed by the special "end of input" marker
   */
  def input: Rule1[Brackets] = rule {
    bracketedExpression ~ EOI
  }

  /**
   * A bracketed expression can be roundBrackets,
   * or squareBrackets, or... or the special empty 
   * expression (which occurs in the middle). Note that
   * because "empty" will always match, it must be listed
   * last
   */
  def bracketedExpression: Rule1[Brackets] = rule {
    roundBrackets | squareBrackets | 
    angleBrackets | curlyBrackets | empty
  }

  /**
   * The empty rule matches an EMPTY expression
   * (which will always succeed) and pushes the Empty
   * case object onto the stack
   */
  def empty: Rule1[Brackets] = rule {
    EMPTY ~> (_ => Empty)
  }

  /**
   * The roundBrackets rule matches a bracketed 
   * expression surrounded by parentheses. If it
   * succeeds, it pushes a RoundBrackets object 
   * onto the stack, containing the content inside
   * the brackets
   */
  def roundBrackets: Rule1[Brackets] = rule {
    "(" ~ bracketedExpression ~ ")" ~~>
         (content => RoundBrackets(content))
  }

  // Remaining matchers
  def squareBrackets: Rule1[Brackets] = rule {
    "[" ~ bracketedExpression ~ "]"  ~~>
        (content => SquareBrackets(content))
  }

  def angleBrackets: Rule1[Brackets] = rule {
    "<" ~ bracketedExpression ~ ">" ~~>
        (content => AngleBrackets(content))
  }

  def curlyBrackets: Rule1[Brackets] = rule {
    "{" ~ bracketedExpression ~ "}" ~~>
        (content => CurlyBrackets(content))
  }


  /**
   * The main entrypoint for parsing.
   * @param expression
   * @return
   */
  def parseExpression(expression: String):
    ParsingResult[Brackets] = {
    ReportingParseRunner(input).run(expression)
  }

}

While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.

To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.

In praise of functional programming

At 67 Bricks, we are big proponents of functional programming. We believe that projects which use it are easier to write, understand and maintain. Scala is one of the most common languages we make use of and we see more object oriented languages like C# are often written with a functional perspective.

This isn’t to say that functional languages are inherently better than any other paradigms. Like any language, it’s perfectly possible to use it poorly and produce something unmaintainable and unreadable.

Immutable First

At first glance immutable programming would be a challenging limitation. Not being able to update the value of a variable is restrictive but that very restriction is what makes functional code easier to understand. Personally I found this to be one of the hardest concepts to wrap my head around when I first started functional programming. How can I write code if I can’t change the value of variables?

In practice, reading code was made much easier as once a value was assigned, it never changed, no matter how long the function was or what was done with the value. No more trying to keep track in my head how a long method changes a specific variable and all the ways it may not thanks to various control flows.

For example, when we pass a list into a method, we know that the reference to the list won’t be able to change, but the values within the list could. We would hope that the function was named something sensible, perhaps with some documentation that makes it clear what it will do to our list but we can never be too sure. The only way to know exactly what is happening is to dive into that method and check for ourselves which then adds to the cognitive load of understanding the code. With a more functional programming language, we know that our list cannot be changed because it is an immutable data structure with an immutable reference to it.

High Level Constructs

Functional code can often be more readable than object oriented code thanks to the various higher level functions and constructs like pattern matching. Naturally readability of code is very dependent on the programmer; it’s easy to make unreadable code in any language, but in the right hands these higher level constructs make the intended logic easier to understand and change. For example, here are 2 examples of code using regular expressions in Scala and C#:

def getFriendlyTime(string time): String = {
  val timestampRegex = "([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})".r
  time match {
    case timestampRegex(hour, minutes, _, _) => s"It's $minutes minutes after $hour"
    case _ => "We don't know"
  }
}
public static String GetFriendlyTime(String time) {
  var timestampRegex = new Regex("([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})");
  var match = timestampRegex.Match(time);
  if (match == null) {
    return "We don't know";
  } else {
    var hours = match.Groups[2].Value;
    var minutes = match.Groups[1].Value;
    return $"It's {minutes} minutes after {hour}";
  }
}

I would argue that the pattern matching of Scala really helps to create clear, concise code. As time goes on features from functional languages like pattern matching keep appearing in less functional ones like C#. A more Java based example would be the streams library, inspired by approaches found in functional programming.

This isn’t always the case, there are some situations where say a simple for loop is easier to understand than a complex fold operation. Luckily, Scala is a hybrid language, providing access to object oriented and procedural styles of programming that can be used when the situation calls for it. This flexibility helps programmers pick the style that best suits the problem at hand to make understandable and maintainable codebases.

Pure Functions and Composition

Pure functions are easier to test than impure functions. If it’s simply a case of calling a method with some arguments and always getting back the same answer, tests are going to be reliable and repeatable. If a method is part of a class which maintains lots of complex internal state, then testing is going to be unavoidably more complex and fragile. Worse is when the class requires lots of dependencies to be passed into it, these may need to be mocked or stubbed which can then lead to even more fragile tests.

Scala and other functional languages encourage developers to write small pure functions and then compose them together in larger more complex functions. This helps to minimise the amount of internal state we make use of and makes automated testing that much easier. With side effect causing code pushed to the edges (such as database access, client code to call other services etc.), it’s much easier to change how they are implemented (say if we wanted to change the type of database or change how a service call is made) making evolution of the code easier.

For example, take some possible Play controller code:

def updateUsername(id: int, name: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateName(name) match {
        case Some(updatedUser) =>     
         userRepo.save(updatedUser)
         bus.emit(updatedUser.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

def updateEmail(id: int, email: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateEmail(email) match {
        case Some(updatedUser) =>
          userRepo.save(updatedUser)
          bus.emit(updatedUser.events)
          Ok()
        case None => BadRequest()
    case None => NotFound()
}

Using composition, the common elements can be extracted and we can make a more dense method by removing duplication.

def updateUsername(id: int, name: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateName(name))
}

def updateEmail(id: int, email: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateEmail(email))
  }
}

private def updateEntityAndSave(repo: Repo[T])(id: int)(f: T => Option[T]): Result = {
  repo.get(id) match {
    case Some(entity) =>
      f(entity) match {
        case Some(updatedEntity) =>
         repo.save(updatedEntity)
         bus.emit(updatedEntity.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

None Not Null

Dubbed the billion-dollar mistake by Tony Hoare, nulls can be a source of great nuisance for programmers. Null Pointer Exceptions are all too often encountered and confusion reigns over whether null is a valid value or not. In some situations a null value may never happen and need not be worried about while in others it’s a perfectly valid value that has to be considered. In many OOP languages, there is no way of knowing if an object we get back from a method can be null or not without either jumping into said method or relying on documentation that often drifts or does not exist.

Functional languages avoid these problems simply by not having null (often Unit is used to represent a method that doesn’t return anything). In situations where we may want it, Scala provides the helpful Option type. This makes it explicit to the reader that you call this method, you will get back Some value or None. Even C# is now introducing features like Nullable Reference Types to help reduce the possible harm of null.

Scala also goes a step further, providing an Either construct which can contain a success (Right) of failure (Left) value. These can be chained together to using a railway styles of programming. This chaining approach can lead us to have an easily readable description of what’s happening and push all our error handling to the end, rather than sprinkle it among the code.

Using Option can also improve readability when interacting with Java code. For example, a method returning null can be pumped through an Option and combined with a match statement to lead to a neater result.

Option(someNullReturningMethod()) match {
  case Some(result) =>
    // do something
  case None =>
    // method returned null. Bad method!
}

Conclusions

There are many reasons to prefer functional approaches to programming. Even in less functionally orientated languages I have found myself reaching for functional constructs to arrange my code. I think it’s something all software developers should try learning and applying in their code.

Naturally, not every developer enjoys FP, it is not a silver bullet that will overcome all our challenges. Even within 67 Bricks we have differing opinions on which parts of functional programming are useful and which are not (scala implicits can be quite divisive). It’s ultimately just another tool in our expansive toolkit that helps us to craft correct, readable, flexible and functioning software.

Understanding an “impossible” error

As discussed in the previous post on Sharing Failures, seeing how other people have dealt with bugs and errors can often help you avoid your own or give you ways to track down the source of a problem when one does make its appearance. So in that spirit, here is the story of a baffling error we fixed recently.

The error came from a content delivery platform we have been working on for a publisher client. At the point of a release and for several hours after we were seeing some errors, but there were a few reasons why this was very confusing.

The site is built using Scala / Play and uses Akka HTTP to make API calls between services. The error we were seeing was one that generally means that requests are coming in to a frontend service faster than the backend can service them:

BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool (........) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.]]

So apparently the pool of requests was filling up and causing a problem. But the first thing that was strange was that this was persisting for several hours after the release. At the point of a release it’s understandable that this error could occur with various services being started and stopped, causing requests to back up. After that the system was not under particularly high load, so why was this not just a transient issue?

The next thing that was strange was that we were only seeing this when users were accessing very particular content. We were only seeing it for access to content in reference works. These are what publishers confusingly call “databases” and cover things like encyclopedias, directories or dictionaries. But it wasn’t all databases, only certain ones and different ones at different times. On one occasion we would see a stream of errors for Encyclopedia A and then the next time we hit this error it would be Dictionary B generating the problems instead. If the cause was a pool of requests filling up, why would it affect particular pieces of content and not others, when they all use the same APIs?

Another thing that was puzzling – not every access to that database would generate an error. We’d either get an error or the content would be rendered fine, both very quickly. The error we were seeing suggested that things were running slowly somewhere, but the site seemed to be snappy, just producing intermittent errors for some content some of the time.

We spent lots of time reading Akka HTTP documentation trying to figure out how we could be seeing these problems, but it didn’t seem to make any sense. I had the feeling that I was missing something because the error seemed to be “impossible”. I even commented to a colleague that it felt like once we worked out what was going on I would talk about it at one of our dev forums. That prediction turned out to be true. Looking at Akka HTTP documentation would not help because the error message itself was in some sense a misdirection.

The lightbulb moment came when I spotted this code in our frontend code:

private lazy val databaseNameCache: LoadingCache[String, Future[DatabaseIdAndName]] = 
    CacheBuilder.newBuilder().refreshAfterWrite(4, TimeUnit.HOURS).....

We are using Guava’s LoadingCache to cache the mapping between the id of a database and its name since this almost never changes. (Sidenote: Guava’s cache support is great, also check out the Caffeine library inspired by it). The problem here is that we are not storing a DatabaseIdAndName object in the cache, but a Future. So we are in some sense putting the operation to fetch the database name into the cache. If that fails with an Exception, then every time we look in the cache for it we will replay the exception. Suddenly all the pieces fell into place. A transient error looking up a database name at release time was being put in a cache on one frontend server and replayed for hours. The whole akka pool thing was more or less irrelevant.

In the short term we fixed the problem by waiting for the concrete data to be returned to store that in the cache rather than a Future object. In that scenario, a failure to fetch the value would just yield an error and nothing would be cached for future look ups. However, much of the code using this cache is asynchronous, so it’s cleaner and probably better from a performance perspective if you can continue to use Future where possible. So the longer term solution was to revert to putting Future objects in the cache but carefully adding code to invalidate any cache entries that resolve to an exception.

I think the lesson here is – if an error doesn’t make sense then maybe some technical sleight-of-hand is going on and the error you are seeing is not the real problem. Maybe it’s all an illusion…

What have I been listening to?

A while ago, Tim suggested we could have a #now-listening channel in our company Slack, in which people could post details of what they were listening to. It occurred to me that it might be a fun challenge to try to figure out from what I’d posted on there who my favourite artist was, and which was my most-listened-to album. So I rolled up my sleeves and got to work. This is an account of what I did and my various thought processes as I went along…

Challenge: figure out how to get my posts from our #now-listening channel and do some statistics to them.

Session 1: After school run, but before work…

Start – there’s an API. https://api.slack.com

Read the documentation: https://api.slack.com/methods/search.messages looks useful – how do I call it?

I NEED A TOKEN! Aha – https://api.slack.com/apps – a “generate token” button…

Access token: xoxe.xoxp-blah-blah-blah. SUCCESS!

First obvious question: has someone done this already? Google knows everything: https://github.com/slack-scala-client/slack-scala-client

Create a new project: sbt new scala/scala-seed.g8 – add dependency on slack-scala-client, ready to rock! In such a hurry; I can’t even be bothered to set up a package, just hijack the Hello app that came in the skeletal project…

From docs:

val token = "MY TOP SECRET TOKEN"
implicit val system = ActorSystem("slack")
try {
  val client = BlockingSlackApiClient(token)
  client.searchMessages(WHAT TO PUT HERE?)
} finally {
  Await.result(system.terminate(), Duration.Inf)
}

… maybe something like…?:

val ret = client.searchMessages("* in:#67bricks-now-listening from:@Daniel", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

RUN IT

Fails. Because HelloSpec fails (I mentioned I just hijacked the OOTB Hello app). Fix with the delete key.

RUN IT

[WARN] [11/26/2021 08:40:37.418] [slack-akka.actor.default-dispatcher-2] [akka.actor.ActorSystemImpl(slack)] Illegal header: Illegal 'expires' header: Illegal weekday in date 1997-07-26T05:00:00: is 'Mon' but should be 'Sat'
Exception in thread "main" slack.api.ApiError: missing_scope
at slack.api.SlackApiClient$.$anonfun$makeApiRequest$3(SlackApiClient.scala:92)

:-​(

Google: “missing_scope” and interpret results

The token used is not granted the specific scope permissions required to complete this request.

:-​( :-​(

Maybe I have to create an app and add it to the workspace? I’ll try that.

Created, figured out how to add the user token scope “search:read” – and I got a new token!

Token= xoxp-blahblahblah

Rerun: I got a response!

{
  "ok":true,
  "query":"* in:#67bricks-now-listening from:@Daniel",
  "messages": {
    "total":0,
    "pagination": {
      "total_count":0,
      "page":1,
      "per_page":5,
      "page_count":0,
      "first":1,
      "last":0
    },
    "paging": {
      "count":5,
      "total":0,
      "page":1,
      "pages":0
    },
    "matches": []
  }
}

:-​(

Let’s just search in the channel without specifying a name…?

val ret = client.searchMessages("in:#67bricks-now-listening", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening",
  "messages": {
    "total": 7113,
    "pagination": {
      "total_count": 7113,
      "page": 1,
      "per_page": 5,
      "page_count": 1423,
      "first": 1,
      "last": 5
    },
    "paging": {
      "count": 5,
      "total": 7113,
      "page": 1,
      "pages": 1423
    },
    "matches": [
      {
        "username": "daniel.rendall",
        "other": "field_here"
      }
    ]
  }
}

Aha! My username is daniel.rendall, let’s try that:

val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))

Gives:

{
  "ok": true,
  "query": "in:#67bricks-now-listening from:@daniel.rendall",
  "messages": {
    "total": 3213,
    "pagination": { ... etc }
  }
}

Success! Also – 3213 messages – sounds plausible. This is looking good… but sort direction seems wrong…? Try switching to “desc” => same result.

(Time spent so far: about half an hour – better stop or will miss the morning call!)

Session 2: Re-run – still works (hooray!)

Copy and paste output and save as response.json, fix up with jq so I can examine it:

cat response.json | jq '.' > response_tidied.json

And now:

"pagination": {
  "total_count": 3234,
  "page": 1,
  "per_page": 5,
  ... etc

Number has gone up – I’m still listening to things!

So, I could parse the responses to work out what the next page should be, or I could just loop – with pages of size 100 (if the API will return them) there should be 33. So we will loop and save these as 1.json, 2.json etc. First rule of scraping – aim to do it just once and save the result locally.

Horrible quick and dirty code alert!

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  outDir.mkdirs()
  implicit val system = ActorSystem("slack")
  try {
    val client = BlockingSlackApiClient(token)
    (1 to 33).foreach { pageNum =>
      try {
        val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall",
sort = Some("timestamp"),
sortDir = Some("desc"),
count = Some(100),
page = Some(pageNum))
        Files.write(new File(outDir, "" + pageNum + ".json").toPath, ret.toString().getBytes(StandardCharsets.UTF_8), StandardOpenOption.CREATE)
        println(s"Got page $pageNum")
      } catch {
        case NonFatal(e) =>
          println(s"Couldn't get page $pageNum - ${e.getMessage}")
      }
      Thread.sleep(1000)
    }
  } finally {
    Await.result(system.terminate(), Duration.Inf)
  }
}

… prints up a reasuring list “Got page 1” => “Got page 33” and no (reported) errors!

Second rule of scraping – having done it and got the data, zip it up and put it somewhere just in case you destroy it…

Tidy it all (non essential, but makes it easier to look at):

mkdir tidied
ls output | while read JSON ; do cat output/$JSON | jq '.' > tidied/$JSON ; done

On scanning the data – it looks plausible, I can’t see an obvious “date” field but there’s a cryptic “ts” field (sample value: “1638290823.124400”) which is maybe a timestamp? A problem for another day…

(Time spent this session: about 20 minutes)

Session 3: I can haz stats?

Need to load it in. A new main method in a new object…

val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
  val jsObjects = outDir.listFiles().map { f =>
    Json.parse(new FileInputStream(f))
  }
  println(jsObjects.head)
}

Prints something sensible. Now need to get it in a useful form: define simplest class that could possibly work.

case class Message(iid: UUID, ts: String, text: String, permalink: String)
object Message {
  implicit val messageReads: Reads[Message] = (
  (__ \ "iid").read[UUID] and
  (__ \ "ts").read[String] and
  (__ \ "text").read[String] and
  (__ \ "permalink").read[String]
  ) (Message.apply _)
}

Not sure if I need the ID, but I like IDs. Looks like a UUID.

… oh, also some classes to wrap the whole result with minimum of faff (and Reads, omitted for brevity):

case class SearchResult(messages: Messages)
case class Messages(total: Int, matches: Seq[Message])

Go for broke:

val jsObjects: Array[JsResult[Seq[Message]]] = outDir.listFiles().map { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches)
}

Unpleasant type signature alert – Array[JsResult[Seq[Message]]] Let’s assume nothing will go wrong and just use “.get” and “.flatMap”:

val messages: Seq[Message] = outDir.listFiles().flatMap { f =>
  Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches).get
}.toList

That gives me 3234 Message objects, which is reassuring. They include top-level messages, and responses to threads. As far as I can see, the thread responses include a ?thread_ts parameter in their permalink, therefore filter them out – leaves 1792 remaining.

val filtered = messages.filterNot(_.permalink.contains("?thread_ts"))
filtered.take(10).map(_.text).foreach(println)

…and voila:

The things I’m looking for will all have the format “Artist – Album”. Regex time!

val ArtistAlbumRegex = "(.*?) - (.*)".r("artist", "album")

Wait, what…? “@deprecated(“use inline group names like (?<year>X) instead”, “2.13.7”)”

Didn’t know that had changed. Ho hum…

val ArtistAlbumRegex: Regex = "(?<artist>.*?) - (?<album>.*)".r

  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)

Even more promising:

Getting there! Now, there are bound to be loads of duplicates. So I guess the most obvious thing to do is count them. Let’s see if I can find the albums I’ve listened to the most, and their counts. I’m going to define a canonical key for grouping an ArtistAndAlbum just in case I’ve not been completely consistent in capitalisation.

case class ArtistAndAlbum(artist: String, album: String) {
  val groupingKey: (String, String) = (artist.toLowerCase, album.toLowerCase)
}

Then we should be able to count by:

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)sorted.take(10).foreach(println)

(The Bob Lazar Story – Vanquisher,1)
(The Pretenders – Pretenders (152),1)
(Saxon – Lionheart,1)
(Leprous – Tall Poppy Syndrome,1)
(Tom Petty and the Heartbreakers – Damn the Torpedoes (231),1)
(2Pac – All Eyez on Me (436),1)
(Bon Iver – For Emma, Forever Ago (461),1)
(James – Stutter,1)
(Elton John – Honky Château (251),1)
(Ice Cube – AmeriKKKa’s Most Wanted,1)

Ooops – wrong way – also the numbers in brackets need to be removed. Not sure there’s a nicer way to invert the ordering then explicitly passing the Ordering that I want to use…

val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonAlbums.take(10).foreach(println

(Benny Andersson – Piano,8)
(Meilyr Jones – 2013),7)
(Brian Eno – Here Come The Warm Jets),4)
(Richard &amp; Linda Thompson – I Want To See The Bright Lights Tonight),4)
(Admirals Hard – Upon a Painted Ocean,4)
(Neuronspoiler – Emergence,4)
(Global Communication – Pentamerous Metamorphosis),4)
(Steely Dan – Countdown To Ecstasy,4)
(Pole – 2,4)
(Faith No More – Angel Dust,4)

That looks plausible, actually. I like Piano. I’m guessing there are loads of other “4” albums…

But who is my most listened to artist? I have a shrewd idea I know who it will turn out to be – my prediction is that it will be a four word band name with the initials HMHB. Use the fact that I defined my grouping key to start with the artist

val mostCommonArtists = artistsAndAlbums.groupBy(_.groupingKey._1)
.view.map { case (_, seq) => seq.head.artist -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonArtists.take(10).foreach(println)

(Half Man Half Biscuit,28)
(Fairport Convention,19)
(R.E.M.,18)
(Steeleye Span,15)
(Julian Cope,15)
(Various,13)
(James,12)
(Cardiacs,11)
(Faith No More,9)
(Brian Eno,9)

Bingo! The mighty Half Man Half Biscuit in there at #1. One flaw is immediately apparent – this naive approach doesn’t distinguish between “listening to lots of albums by an artist as part of business-as-usual” and “listening to an artist’s entire back catalogue in one go” (which accounts for the high showings of Fairport Convention, R.E.M. and Steeleye Span). Worry about that some other time.

How many albums have I listened to?

val distinctAlbums = artistsAndAlbums.distinctBy(_.groupingKey)
println("Total albums = " + artistsAndAlbums.length)
println("Distinct albums = " + distinctAlbums.length)

Total albums = 1371
Distinct albums = 1255

.. but that will be wrong because I’ve listened to some albums in the context of e.g. working through the Rolling Stone or NME’s list of top 500 albums, and in those cases I appended the number to the list e.g. “Battles – Mirrored (NME 436)”. So chop that off the end of the album name:

val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
  case ArtistAlbumRegex(artist, album) =>
    ArtistAndAlbum(artist, album.replaceAll("\\([^)]+\\)$", "").trim)
}

Distinct albums = 1191

This final session took about 50 minutes, so if my maths is correct, the total time spent on this was a little under 2 hours. TBH I’m slightly dubious about the results; after listing all of the albums I’ve listened to in alphabetical order I’m sure there are some missing (e.g. I tackled the entire Prince back catalogue, but there were only a handful of Prince albums in there, ditto for David Bowie). I suspect a bit more work and exploration of the Slack API might reveal what I’m missing. Or maybe my method for distinguishing main messages from responses is wrong (just had a thought; maybe a main message that begins a thread also gets the ?thread_ts parameter).  But it’s close enough for now, and appears to confirm my suspicion that Half Man Half Biscuit are my most listened to artist.

And now, what with it being the season of goodwill and all that, it’s time for my special Christmas Playlist

Command line parsing with Scala dynamic and macros

In our developer meeting this week, we talked about using the Scala features of “dynamic” and macros to simplify code for command-line parsing.

The problem we were trying to solve is a recent project in which we have lots of separate command-line apps, with a lot of overlap between their arguments, but with a need for app-specific arguments and to apply different default parameters to each app.

Continue reading “Command line parsing with Scala dynamic and macros”