A while ago, Tim suggested we could have a #now-listening channel in our company Slack, in which people could post details of what they were listening to. It occurred to me that it might be a fun challenge to try to figure out from what I’d posted on there who my favourite artist was, and which was my most-listened-to album. So I rolled up my sleeves and got to work. This is an account of what I did and my various thought processes as I went along…
Challenge: figure out how to get my posts from our #now-listening channel and do some statistics to them.
Session 1: After school run, but before work…
Start – there’s an API. https://api.slack.com
Read the documentation: https://api.slack.com/methods/search.messages looks useful – how do I call it?
I NEED A TOKEN! Aha – https://api.slack.com/apps – a “generate token” button…
Access token: xoxe.xoxp-blah-blah-blah. SUCCESS!
First obvious question: has someone done this already? Google knows everything: https://github.com/slack-scala-client/slack-scala-client
Create a new project: sbt new scala/scala-seed.g8
– add dependency on slack-scala-client, ready to rock! In such a hurry; I can’t even be bothered to set up a package, just hijack the Hello app that came in the skeletal project…
From docs:
val token = "MY TOP SECRET TOKEN"
implicit val system = ActorSystem("slack")
try {
val client = BlockingSlackApiClient(token)
client.searchMessages(WHAT TO PUT HERE?)
} finally {
Await.result(system.terminate(), Duration.Inf)
}
… maybe something like…?:
val ret = client.searchMessages("* in:#67bricks-now-listening from:@Daniel", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))
RUN IT
Fails. Because HelloSpec fails (I mentioned I just hijacked the OOTB Hello app). Fix with the delete key.
RUN IT
[WARN] [11/26/2021 08:40:37.418] [slack-akka.actor.default-dispatcher-2] [akka.actor.ActorSystemImpl(slack)] Illegal header: Illegal 'expires' header: Illegal weekday in date 1997-07-26T05:00:00: is 'Mon' but should be 'Sat'
Exception in thread "main" slack.api.ApiError: missing_scope
at slack.api.SlackApiClient$.$anonfun$makeApiRequest$3(SlackApiClient.scala:92)
:-(
Google: “missing_scope” and interpret results
The token used is not granted the specific scope permissions required to complete this request.
:-( :-(
Maybe I have to create an app and add it to the workspace? I’ll try that.
Created, figured out how to add the user token scope “search:read” – and I got a new token!
Token= xoxp-blahblahblah
Rerun: I got a response!
{
"ok":true,
"query":"* in:#67bricks-now-listening from:@Daniel",
"messages": {
"total":0,
"pagination": {
"total_count":0,
"page":1,
"per_page":5,
"page_count":0,
"first":1,
"last":0
},
"paging": {
"count":5,
"total":0,
"page":1,
"pages":0
},
"matches": []
}
}
:-(
Let’s just search in the channel without specifying a name…?
val ret = client.searchMessages("in:#67bricks-now-listening", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))
Gives:
{
"ok": true,
"query": "in:#67bricks-now-listening",
"messages": {
"total": 7113,
"pagination": {
"total_count": 7113,
"page": 1,
"per_page": 5,
"page_count": 1423,
"first": 1,
"last": 5
},
"paging": {
"count": 5,
"total": 7113,
"page": 1,
"pages": 1423
},
"matches": [
{
"username": "daniel.rendall",
"other": "field_here"
}
]
}
}
Aha! My username is daniel.rendall, let’s try that:
val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall", sort = Some("timestamp"), sortDir = Some("asc"), count = Some(5))
Gives:
{
"ok": true,
"query": "in:#67bricks-now-listening from:@daniel.rendall",
"messages": {
"total": 3213,
"pagination": { ... etc }
}
}
Success! Also – 3213 messages – sounds plausible. This is looking good… but sort direction seems wrong…? Try switching to “desc” => same result.
(Time spent so far: about half an hour – better stop or will miss the morning call!)
Session 2: Re-run – still works (hooray!)
Copy and paste output and save as response.json, fix up with jq so I can examine it:
cat response.json | jq '.' > response_tidied.json
And now:
"pagination": {
"total_count": 3234,
"page": 1,
"per_page": 5,
... etc
Number has gone up – I’m still listening to things!
So, I could parse the responses to work out what the next page should be, or I could just loop – with pages of size 100 (if the API will return them) there should be 33. So we will loop and save these as 1.json, 2.json etc. First rule of scraping – aim to do it just once and save the result locally.
Horrible quick and dirty code alert!
val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
outDir.mkdirs()
implicit val system = ActorSystem("slack")
try {
val client = BlockingSlackApiClient(token)
(1 to 33).foreach { pageNum =>
try {
val ret = client.searchMessages("in:#67bricks-now-listening from:@daniel.rendall",
sort = Some("timestamp"),
sortDir = Some("desc"),
count = Some(100),
page = Some(pageNum))
Files.write(new File(outDir, "" + pageNum + ".json").toPath, ret.toString().getBytes(StandardCharsets.UTF_8), StandardOpenOption.CREATE)
println(s"Got page $pageNum")
} catch {
case NonFatal(e) =>
println(s"Couldn't get page $pageNum - ${e.getMessage}")
}
Thread.sleep(1000)
}
} finally {
Await.result(system.terminate(), Duration.Inf)
}
}
… prints up a reasuring list “Got page 1” => “Got page 33” and no (reported) errors!
Second rule of scraping – having done it and got the data, zip it up and put it somewhere just in case you destroy it…
Tidy it all (non essential, but makes it easier to look at):
mkdir tidied
ls output | while read JSON ; do cat output/$JSON | jq '.' > tidied/$JSON ; done
On scanning the data – it looks plausible, I can’t see an obvious “date” field but there’s a cryptic “ts” field (sample value: “1638290823.124400”) which is maybe a timestamp? A problem for another day…
(Time spent this session: about 20 minutes)
Session 3: I can haz stats?
Need to load it in. A new main method in a new object…
val outDir = new File("/home/daniel/Scratch/slack/output")
def main(args: Array[String]): Unit = {
val jsObjects = outDir.listFiles().map { f =>
Json.parse(new FileInputStream(f))
}
println(jsObjects.head)
}
Prints something sensible. Now need to get it in a useful form: define simplest class that could possibly work.
case class Message(iid: UUID, ts: String, text: String, permalink: String)
object Message {
implicit val messageReads: Reads[Message] = (
(__ \ "iid").read[UUID] and
(__ \ "ts").read[String] and
(__ \ "text").read[String] and
(__ \ "permalink").read[String]
) (Message.apply _)
}
Not sure if I need the ID, but I like IDs. Looks like a UUID.
… oh, also some classes to wrap the whole result with minimum of faff (and Reads, omitted for brevity):
case class SearchResult(messages: Messages)
case class Messages(total: Int, matches: Seq[Message])
Go for broke:
val jsObjects: Array[JsResult[Seq[Message]]] = outDir.listFiles().map { f =>
Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches)
}
Unpleasant type signature alert – Array[JsResult[Seq[Message]]] Let’s assume nothing will go wrong and just use “.get” and “.flatMap”:
val messages: Seq[Message] = outDir.listFiles().flatMap { f =>
Json.parse(new FileInputStream(f)).validate[SearchResult].map(_.messages.matches).get
}.toList
That gives me 3234 Message objects, which is reassuring. They include top-level messages, and responses to threads. As far as I can see, the thread responses include a ?thread_ts parameter in their permalink, therefore filter them out – leaves 1792 remaining.
val filtered = messages.filterNot(_.permalink.contains("?thread_ts"))
filtered.take(10).map(_.text).foreach(println)
…and voila:
The things I’m looking for will all have the format “Artist – Album”. Regex time!
val ArtistAlbumRegex = "(.*?) - (.*)".r("artist", "album")
Wait, what…? “@deprecated(“use inline group names like (?<year>X) instead”, “2.13.7”)”
Didn’t know that had changed. Ho hum…
val ArtistAlbumRegex: Regex = "(?<artist>.*?) - (?<album>.*)".r
case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)
val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
case ArtistAlbumRegex(artist, album) => ArtistAndAlbum(artist, album)
}
artistsAndAlbums.take(10).foreach(println)
Even more promising:
Getting there! Now, there are bound to be loads of duplicates. So I guess the most obvious thing to do is count them. Let’s see if I can find the albums I’ve listened to the most, and their counts. I’m going to define a canonical key for grouping an ArtistAndAlbum just in case I’ve not been completely consistent in capitalisation.
case class ArtistAndAlbum(artist: String, album: String) {
val groupingKey: (String, String) = (artist.toLowerCase, album.toLowerCase)
}
Then we should be able to count by:
val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)sorted.take(10).foreach(println)
(The Bob Lazar Story – Vanquisher,1)
(The Pretenders – Pretenders (152),1)
(Saxon – Lionheart,1)
(Leprous – Tall Poppy Syndrome,1)
(Tom Petty and the Heartbreakers – Damn the Torpedoes (231),1)
(2Pac – All Eyez on Me (436),1)
(Bon Iver – For Emma, Forever Ago (461),1)
(James – Stutter,1)
(Elton John – Honky Château (251),1)
(Ice Cube – AmeriKKKa’s Most Wanted,1)
Ooops – wrong way – also the numbers in brackets need to be removed. Not sure there’s a nicer way to invert the ordering then explicitly passing the Ordering that I want to use…
val mostCommonAlbums = artistsAndAlbums.groupBy(_.groupingKey)
.view.map { case (_, seq) => seq.head -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonAlbums.take(10).foreach(println
(Benny Andersson – Piano,8)
(Meilyr Jones – 2013),7)
(Brian Eno – Here Come The Warm Jets),4)
(Richard & Linda Thompson – I Want To See The Bright Lights Tonight),4)
(Admirals Hard – Upon a Painted Ocean,4)
(Neuronspoiler – Emergence,4)
(Global Communication – Pentamerous Metamorphosis),4)
(Steely Dan – Countdown To Ecstasy,4)
(Pole – 2,4)
(Faith No More – Angel Dust,4)
That looks plausible, actually. I like Piano. I’m guessing there are loads of other “4” albums…
But who is my most listened to artist? I have a shrewd idea I know who it will turn out to be – my prediction is that it will be a four word band name with the initials HMHB. Use the fact that I defined my grouping key to start with the artist
val mostCommonArtists = artistsAndAlbums.groupBy(_.groupingKey._1)
.view.map { case (_, seq) => seq.head.artist -> seq.length }.toList.sortBy(_._2)(Ordering[Int].reverse)
mostCommonArtists.take(10).foreach(println)
(Half Man Half Biscuit,28)
(Fairport Convention,19)
(R.E.M.,18)
(Steeleye Span,15)
(Julian Cope,15)
(Various,13)
(James,12)
(Cardiacs,11)
(Faith No More,9)
(Brian Eno,9)
Bingo! The mighty Half Man Half Biscuit in there at #1. One flaw is immediately apparent – this naive approach doesn’t distinguish between “listening to lots of albums by an artist as part of business-as-usual” and “listening to an artist’s entire back catalogue in one go” (which accounts for the high showings of Fairport Convention, R.E.M. and Steeleye Span). Worry about that some other time.
How many albums have I listened to?
val distinctAlbums = artistsAndAlbums.distinctBy(_.groupingKey)
println("Total albums = " + artistsAndAlbums.length)
println("Distinct albums = " + distinctAlbums.length)
Total albums = 1371
Distinct albums = 1255
.. but that will be wrong because I’ve listened to some albums in the context of e.g. working through the Rolling Stone or NME’s list of top 500 albums, and in those cases I appended the number to the list e.g. “Battles – Mirrored (NME 436)”. So chop that off the end of the album name:
val artistsAndAlbums = messages.filterNot(_.permalink.contains("?thread_ts")).map(_.text).collect {
case ArtistAlbumRegex(artist, album) =>
ArtistAndAlbum(artist, album.replaceAll("\\([^)]+\\)$", "").trim)
}
Distinct albums = 1191
This final session took about 50 minutes, so if my maths is correct, the total time spent on this was a little under 2 hours. TBH I’m slightly dubious about the results; after listing all of the albums I’ve listened to in alphabetical order I’m sure there are some missing (e.g. I tackled the entire Prince back catalogue, but there were only a handful of Prince albums in there, ditto for David Bowie). I suspect a bit more work and exploration of the Slack API might reveal what I’m missing. Or maybe my method for distinguishing main messages from responses is wrong (just had a thought; maybe a main message that begins a thread also gets the ?thread_ts parameter). But it’s close enough for now, and appears to confirm my suspicion that Half Man Half Biscuit are my most listened to artist.
And now, what with it being the season of goodwill and all that, it’s time for my special Christmas Playlist…