Dev Forum – Parsing Data

Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.

After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]}) would be considered valid, while the sequence ((({(>>]) would be invalid.

First we define the set of classes that describes the parsed structure:

object BracketParser {

  sealed trait Brackets

  case class RoundBrackets(content: Brackets)
     extends Brackets

  case class SquareBrackets(content: Brackets)
     extends Brackets

  case class AngleBrackets(content: Brackets)
     extends Brackets

  case class CurlyBrackets(content: Brackets)
     extends Brackets

  case object Empty extends Brackets

}

Next, we define the matching rules that parboiled uses:

package com.sixtysevenbricks.examples.parboiled

import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._

class BracketParser extends Parser {

  /**
   * The input should consist of a bracketed expression
   * followed by the special "end of input" marker
   */
  def input: Rule1[Brackets] = rule {
    bracketedExpression ~ EOI
  }

  /**
   * A bracketed expression can be roundBrackets,
   * or squareBrackets, or... or the special empty 
   * expression (which occurs in the middle). Note that
   * because "empty" will always match, it must be listed
   * last
   */
  def bracketedExpression: Rule1[Brackets] = rule {
    roundBrackets | squareBrackets | 
    angleBrackets | curlyBrackets | empty
  }

  /**
   * The empty rule matches an EMPTY expression
   * (which will always succeed) and pushes the Empty
   * case object onto the stack
   */
  def empty: Rule1[Brackets] = rule {
    EMPTY ~> (_ => Empty)
  }

  /**
   * The roundBrackets rule matches a bracketed 
   * expression surrounded by parentheses. If it
   * succeeds, it pushes a RoundBrackets object 
   * onto the stack, containing the content inside
   * the brackets
   */
  def roundBrackets: Rule1[Brackets] = rule {
    "(" ~ bracketedExpression ~ ")" ~~>
         (content => RoundBrackets(content))
  }

  // Remaining matchers
  def squareBrackets: Rule1[Brackets] = rule {
    "[" ~ bracketedExpression ~ "]"  ~~>
        (content => SquareBrackets(content))
  }

  def angleBrackets: Rule1[Brackets] = rule {
    "<" ~ bracketedExpression ~ ">" ~~>
        (content => AngleBrackets(content))
  }

  def curlyBrackets: Rule1[Brackets] = rule {
    "{" ~ bracketedExpression ~ "}" ~~>
        (content => CurlyBrackets(content))
  }


  /**
   * The main entrypoint for parsing.
   * @param expression
   * @return
   */
  def parseExpression(expression: String):
    ParsingResult[Brackets] = {
    ReportingParseRunner(input).run(expression)
  }

}

While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.

To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.

An often asked question

I thought I’d pen a short blog post on a question I frequently find myself asking.

You’re right, but are you relevant?

Me

As software developers, it’s all too easy to want to select the new hot technologies available to us. Building a simple web app is great, but building a distributed web-app based on an auto-scaling cluster making use of CQRS and an SPA front-end written in the latest JavaScript hotness is just more interesting. Software development is often about finding simple solutions to complex problems, allowing us to minimize complexity.

  • You’re right, distributed microservices will allow you to deploy separate parts of your app independently, but is that relevant to a back-office system which only needs to be available 9-5?
  • You’re right, CQRS will allow you to better scale the database to handle a tremendous number of queries in parallel, but is that relevant when we only expect 100 users a day?
  • You’re right, that new Javascript SPA framework will let you create really compelling, interactive applications, but is that relevant to a basic CRUD app?

Lots of modern technology can do amazing things and there are always compelling reasons to choose technology x, but are those reasons relevant to the problem at hand? If you spend a lot of time up front designing systems with tough-to-implement technologies and approaches which aren’t needed; a lot of development effort would have been needlessly spent adding all kinds of complexity which wasn’t needed. Worse, we could have added lots of complexity that then resists change, preventing us from adapting the system in the direction our users require.

At 67 Bricks, we encourage teams to start with something simple and then iterate upon it, adding complexity only when it is justified. Software systems can be designed to embrace change. An idea that’s the subject of one of my favourite books Building Evolutionary Architectures.

So next time you find yourself architecting a big complex system with lots of cutting edge technologies that allow for all kinds of “-ilities” to be achieved. Be sure to stop and ask yourself. “I’m right, but am I relevant?”

In praise of functional programming

At 67 Bricks, we are big proponents of functional programming. We believe that projects which use it are easier to write, understand and maintain. Scala is one of the most common languages we make use of and we see more object oriented languages like C# are often written with a functional perspective.

This isn’t to say that functional languages are inherently better than any other paradigms. Like any language, it’s perfectly possible to use it poorly and produce something unmaintainable and unreadable.

Immutable First

At first glance immutable programming would be a challenging limitation. Not being able to update the value of a variable is restrictive but that very restriction is what makes functional code easier to understand. Personally I found this to be one of the hardest concepts to wrap my head around when I first started functional programming. How can I write code if I can’t change the value of variables?

In practice, reading code was made much easier as once a value was assigned, it never changed, no matter how long the function was or what was done with the value. No more trying to keep track in my head how a long method changes a specific variable and all the ways it may not thanks to various control flows.

For example, when we pass a list into a method, we know that the reference to the list won’t be able to change, but the values within the list could. We would hope that the function was named something sensible, perhaps with some documentation that makes it clear what it will do to our list but we can never be too sure. The only way to know exactly what is happening is to dive into that method and check for ourselves which then adds to the cognitive load of understanding the code. With a more functional programming language, we know that our list cannot be changed because it is an immutable data structure with an immutable reference to it.

High Level Constructs

Functional code can often be more readable than object oriented code thanks to the various higher level functions and constructs like pattern matching. Naturally readability of code is very dependent on the programmer; it’s easy to make unreadable code in any language, but in the right hands these higher level constructs make the intended logic easier to understand and change. For example, here are 2 examples of code using regular expressions in Scala and C#:

def getFriendlyTime(string time): String = {
  val timestampRegex = "([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})".r
  time match {
    case timestampRegex(hour, minutes, _, _) => s"It's $minutes minutes after $hour"
    case _ => "We don't know"
  }
}
public static String GetFriendlyTime(String time) {
  var timestampRegex = new Regex("([0-9]{2}):([0-9]{2}):([0-9]{2}).([0-9]{3})");
  var match = timestampRegex.Match(time);
  if (match == null) {
    return "We don't know";
  } else {
    var hours = match.Groups[2].Value;
    var minutes = match.Groups[1].Value;
    return $"It's {minutes} minutes after {hour}";
  }
}

I would argue that the pattern matching of Scala really helps to create clear, concise code. As time goes on features from functional languages like pattern matching keep appearing in less functional ones like C#. A more Java based example would be the streams library, inspired by approaches found in functional programming.

This isn’t always the case, there are some situations where say a simple for loop is easier to understand than a complex fold operation. Luckily, Scala is a hybrid language, providing access to object oriented and procedural styles of programming that can be used when the situation calls for it. This flexibility helps programmers pick the style that best suits the problem at hand to make understandable and maintainable codebases.

Pure Functions and Composition

Pure functions are easier to test than impure functions. If it’s simply a case of calling a method with some arguments and always getting back the same answer, tests are going to be reliable and repeatable. If a method is part of a class which maintains lots of complex internal state, then testing is going to be unavoidably more complex and fragile. Worse is when the class requires lots of dependencies to be passed into it, these may need to be mocked or stubbed which can then lead to even more fragile tests.

Scala and other functional languages encourage developers to write small pure functions and then compose them together in larger more complex functions. This helps to minimise the amount of internal state we make use of and makes automated testing that much easier. With side effect causing code pushed to the edges (such as database access, client code to call other services etc.), it’s much easier to change how they are implemented (say if we wanted to change the type of database or change how a service call is made) making evolution of the code easier.

For example, take some possible Play controller code:

def updateUsername(id: int, name: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateName(name) match {
        case Some(updatedUser) =>     
         userRepo.save(updatedUser)
         bus.emit(updatedUser.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

def updateEmail(id: int, email: string) = action {
  val someUser = userRepo.get(id)
  someUser match {
    case Some(user) =>
      user.updateEmail(email) match {
        case Some(updatedUser) =>
          userRepo.save(updatedUser)
          bus.emit(updatedUser.events)
          Ok()
        case None => BadRequest()
    case None => NotFound()
}

Using composition, the common elements can be extracted and we can make a more dense method by removing duplication.

def updateUsername(id: int, name: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateName(name))
}

def updateEmail(id: int, email: string) = action {
  updateEntityAndSave(userRepo)(id)(_.updateEmail(email))
  }
}

private def updateEntityAndSave(repo: Repo[T])(id: int)(f: T => Option[T]): Result = {
  repo.get(id) match {
    case Some(entity) =>
      f(entity) match {
        case Some(updatedEntity) =>
         repo.save(updatedEntity)
         bus.emit(updatedEntity.events)
         Ok()
        None => BadRequest()
    case None => NotFound()
}

None Not Null

Dubbed the billion-dollar mistake by Tony Hoare, nulls can be a source of great nuisance for programmers. Null Pointer Exceptions are all too often encountered and confusion reigns over whether null is a valid value or not. In some situations a null value may never happen and need not be worried about while in others it’s a perfectly valid value that has to be considered. In many OOP languages, there is no way of knowing if an object we get back from a method can be null or not without either jumping into said method or relying on documentation that often drifts or does not exist.

Functional languages avoid these problems simply by not having null (often Unit is used to represent a method that doesn’t return anything). In situations where we may want it, Scala provides the helpful Option type. This makes it explicit to the reader that you call this method, you will get back Some value or None. Even C# is now introducing features like Nullable Reference Types to help reduce the possible harm of null.

Scala also goes a step further, providing an Either construct which can contain a success (Right) of failure (Left) value. These can be chained together to using a railway styles of programming. This chaining approach can lead us to have an easily readable description of what’s happening and push all our error handling to the end, rather than sprinkle it among the code.

Using Option can also improve readability when interacting with Java code. For example, a method returning null can be pumped through an Option and combined with a match statement to lead to a neater result.

Option(someNullReturningMethod()) match {
  case Some(result) =>
    // do something
  case None =>
    // method returned null. Bad method!
}

Conclusions

There are many reasons to prefer functional approaches to programming. Even in less functionally orientated languages I have found myself reaching for functional constructs to arrange my code. I think it’s something all software developers should try learning and applying in their code.

Naturally, not every developer enjoys FP, it is not a silver bullet that will overcome all our challenges. Even within 67 Bricks we have differing opinions on which parts of functional programming are useful and which are not (scala implicits can be quite divisive). It’s ultimately just another tool in our expansive toolkit that helps us to craft correct, readable, flexible and functioning software.

How to read more interesting books

In our dev meeting, we talked about interesting development related books we had read, and how to read more interesting books. Various people each talked about a book that they had read recently or recommended.

Chris talked about “The Nature of Software Development“, by Ron Jeffries.

It’s by Ron Jeffries, one of the signers of the Agile Manifesto signer. It’s a good book to give to people who don’t know about Agile, and who might be scared of Agile jargon – it’s entirely jargon free. It explains the fundamentals, and the reasons why software development is better done Agile. It cuts through the cargo culting that is common with Agile implementations.

Tim talked about “What to look for in a code review“, by Trisha Gee.

The author is from Jetbrains, and there is a little undercover marketing for Jetbrains tools, but it’s not excessive. It’s written pragmatically (although not from the Pragmatic Press!). It’s very short, which is a plus, particularly if you have a young child and limited reading time. It is split into chapters based on things to look for in code reviews, like tests, security, performance, good principles. Although it is fairly Java focussed, it is still useful for any language. The tips are fairly obvious to most experienced developers, but there is some good stuff in there.

Pasquale talked about “Clean Coder” by Bob Martin.

This is about “saying no” and “saying yes” – the language used when committing to a task – you should say what you’re going to do and then do it. It recommends and explains Test Driven Development. It recommends coding practice, using katas. It discusses time management and managing your own time; and doing estimation. It is good for intermediate developers.

Loic talked about “Designing the user interface“.

This is a classic of Human-Computer Interaction. It’s primarily aimed at students. It covers a little of everything you need to know about HCI; usability, mice, collaboration, etc. It is useful for all developers to learn a bit about usability. It’s 20 years old, and a bit dated – but still relevant.

Reece talked about “The Design and Evaluation of C++“.

This book describes how the early C++ compiler was created, and how various features were designed, and the compromises made to overcome problems. It goes up to C++ 98. It is good for anyone interested in the history of programming languages and how they are created and evolve.

Simon talked about “My job went to India” (now renamed to “The Passionate Programmer“).

Simon read this about 10 years ago, and still recommends it. It is organized into 6 sections, with 52 tips across that. It is useful to think about how to do more than just what’s needed right now in the short-term. Some particular tips are: “Be a generalist” – don’t get bogged down in particular technologies; “Be the worst” – have better people around you; “Be a mind reader” – think about the software being developed and spike out future requirements that haven’t yet been considered, and think about what a customer might need, but they aren’t saying; “Be where you’re at” – do your job well; “How much are you worth” – was the work I did today worth what I am paid? The book could be read by anyone working in tech, but it’s aimed more at programming than other tech jobs. The ‘outsourcing’ part of the title is just a marketing vehicle.

Stephen talked about the “Mythical Man Month

It was written in 1975, but is still true, although a bit dated. Its core message is that a job that takes two months for two people, does not take one month for four people – because of communication, training, etc. It famously coined “Brooks’ Law” – adding more people to a late project makes it later. That’s still true, although perhaps less so nowadays since people it’s easier to split up tasks and make them independent. Sometimes you think “I could have done that in an hour” – but it actually takes a day: because it takes nine times longer to write good quality deployable shareable code, than it does to write something for yourself only. A modern insight on this is that reuse of code, better tools, and TDD can all help.

We had technology problems talking to Bart, but he passed on his notes on “Algorithms To Live By“.

  • It is a book on algorithms that either technical and non-technical people can get an insight on most important algorithms in computer science. Algorithms are presented in form of brief history and application in daily life. For people with computing background this could be a great supplement to already gained knowledge , i.e in the university. For non-computing people this would explain that computer science is not only about installing/fixing Windows. Some of the chapters most interesting to Bart are:
  • Optimal stopping – you will learn the most efficient methods to employ secretary, find a wife or parking space, or sell the house
  • Explore and exploit – you will learn how medical trials with unknown drugs are conducted, when to stop looking for a good restaurant, how the founder of Amazon explains Regret Optimism
  • Caching: How to store data, clean up the house, and human memory issues
  • Scheduling – importance of time, various methods to resolve problem with tasks dependency, how Pathfinder stop responding and was fixed while on Mars
  • Overfitting: how not to over-learn (self and machines)
  • Randomness – how to calculate PI by dropping a pin, how playing cards helped in building the atomic bomb, how to find gigantic prime numbers
  • Games theory – recursion in humans decisions, the “Prisoners’ dilemma” explained

Rhys also wasn’t present, but passed on his recommendation for “Functional Programming in Scala

This discusses how to solve problems in a functional way, how to think about combining functions in algebraic way. It includes exercises too. Particularly when coming from a Java / procedural background, it’s quite a mind shift.

We also briefly discussed why those particular books:

  • Chris had seen it on PragProg.com, and read about it on the mailing list, and liked Ron Jeffries as an author.
  • Tim has a young child and wanted to read a short book.
  • Pasquale realized that human processes around coding are important, and had already read and liked Clean Code.
  • Loic studied HCI at university, and it’s a heavily recommended book from ACM curriculum.
  • Reece is interested in language design, and this was written by the creator of C++. It was interesting to see thought processes of language designers.
  • Simon was doing a lot of Ruby development at the time, and reading blog posts by the author of the book, and found his attitude of interest – and the book was unexpectedly good.
  • Stephen read Mythical Man Month because it was famous.

 

Jumble sale developer meeting

At our developer meeting this week, we had a “jumble sale” theme, where various of us talked briefly about interesting technologies or things we’d been working with recently.

Inigo talked about Runnertrack and Heroku. Runnertrack is an app that he’s written that tracks runners running marathons, such as the London Marathon. Heroku is a very convenient hosting option for applications that are run for brief periods with low traffic like this. It’s easy to set up, and applications can be launched based on pushing to a Git repo.

Dan discussed Project Euler – he’s been working through the (currently 546) Euler problems and has just completed the first 100 problems. The early ones are all easy – the later ones are more challenging. The maths is not hugely complicated. You can get statistics for how many have been completed, and in what languages. Dan is using Scala. It’s a fun thing to do if you like maths and programming.

Chris talked about Serverless Cloud Computing. We’re using Amazon AWS for managing our servers – but that still means that there’s a server somewhere that needs to be managed, and have security updates applied, and apps installed, and so on. However, if you’re using one of their more specific servers like Route 53, or the Simple Queue Service, it’s just a service. API Gateway pro. Amazon Lambda is a mechanism for running code. A Serverless app has a single-page webapp hosted as HTML in S3, and then it triggers code running on Lambda, using data from Dynamo. The major disadvantage is that it’s entirely tied in to Amazon.

Simon talked about BioJS. It has a good set of visualizations, such as tree stuctures, heatmaps of interactions, graph plots, and so on. Some of these are very life science specific, but some are more widely applicable. The BioJS group is also trying to promote good practice in writing biological code – such as version control, common structure, and having demonstrations. Because of this, the BioJS site is essentially a list of references to other GitHub repos.

Nikolay talked about parsing Word documents. One of our clients has supplied us with documents without any real structure, because they’ve been converted from PDF. We’ve written a set of macros to add appropriate structure and reformat those documents, using Word search and replace. Word search and replace is surprisingly powerful – it can match and replace formatting and styles. Annoyingly, there are two forms of search-and-replace available in Word.

Rhys talked about mocking in Specs2 tests for Scala – and capturing how many times a given call has been made and what it is. There is a way of doing this easily – using the “capture” method. This allows you to capture the calls made to the mock, and then you can retrieve the actual values from the captor.

Loic talked about units, and using strong typing in Scala to attach units to your numeric quantities. There is a Squants library that defines a number of standard units, and the type system can then apply dimensional analysis to check that you are performing legal operations on them – not adding a distance to a time, for example.