Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.
After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]})
would be considered valid, while the sequence ((({(>>])
would be invalid.
First we define the set of classes that describes the parsed structure:
object BracketParser {
sealed trait Brackets
case class RoundBrackets(content: Brackets)
extends Brackets
case class SquareBrackets(content: Brackets)
extends Brackets
case class AngleBrackets(content: Brackets)
extends Brackets
case class CurlyBrackets(content: Brackets)
extends Brackets
case object Empty extends Brackets
}
Next, we define the matching rules that parboiled uses:
package com.sixtysevenbricks.examples.parboiled
import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._
class BracketParser extends Parser {
/**
* The input should consist of a bracketed expression
* followed by the special "end of input" marker
*/
def input: Rule1[Brackets] = rule {
bracketedExpression ~ EOI
}
/**
* A bracketed expression can be roundBrackets,
* or squareBrackets, or... or the special empty
* expression (which occurs in the middle). Note that
* because "empty" will always match, it must be listed
* last
*/
def bracketedExpression: Rule1[Brackets] = rule {
roundBrackets | squareBrackets |
angleBrackets | curlyBrackets | empty
}
/**
* The empty rule matches an EMPTY expression
* (which will always succeed) and pushes the Empty
* case object onto the stack
*/
def empty: Rule1[Brackets] = rule {
EMPTY ~> (_ => Empty)
}
/**
* The roundBrackets rule matches a bracketed
* expression surrounded by parentheses. If it
* succeeds, it pushes a RoundBrackets object
* onto the stack, containing the content inside
* the brackets
*/
def roundBrackets: Rule1[Brackets] = rule {
"(" ~ bracketedExpression ~ ")" ~~>
(content => RoundBrackets(content))
}
// Remaining matchers
def squareBrackets: Rule1[Brackets] = rule {
"[" ~ bracketedExpression ~ "]" ~~>
(content => SquareBrackets(content))
}
def angleBrackets: Rule1[Brackets] = rule {
"<" ~ bracketedExpression ~ ">" ~~>
(content => AngleBrackets(content))
}
def curlyBrackets: Rule1[Brackets] = rule {
"{" ~ bracketedExpression ~ "}" ~~>
(content => CurlyBrackets(content))
}
/**
* The main entrypoint for parsing.
* @param expression
* @return
*/
def parseExpression(expression: String):
ParsingResult[Brackets] = {
ReportingParseRunner(input).run(expression)
}
}
While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.
To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.