We had another “jumble sale” developer meeting this week, in which a number of people talked briefly about topics that interested them.
Rhys, Reece and Richard H talked about integrating with third-party authentication and authorization systems. For some of our projects, we have specified a simple API that the third-party authentication provider needs to support. This has been an effective approach, and we’ve found it particularly useful that we wrote a complete test suite to begin with against a dummy provider – that meant that we could very easily tell if the third-party had followed the specification, and potentially they could have run the tests themselves (although they didn’t). For another project, we have used SAML to support single-sign-on to integrate with existing sign-on systems. We’ve used pac4j from Scala to do this integration, and combined it with MarkLogic authentication to allow login to our systems.
Inigo talked about his investigation of why an AWS-hosted MarkLogic cluster was somehow transferring many terabytes of data per month. The possible causes of this were:
- the communication between end users and Play
- … between Play and MarkLogic
- … between the hosts in the ML cluster,
- … and from the servers to their S3 backups.
He used stats from the AWS load balancer combined with Firefox’s network tab to analyze page weight to rule out the communication with end users. By enabling logging on the load balancer used between Play and MarkLogic, he was able to rule out the second cause. Storing data from AWS in S3 is free as long as it’s all in the same region – which it was. Running iftop on the cluster nodes showed which sockets the data was flowing between and when – which pinned it down to the MarkLogic intra-cluster communication, in regular bursts. Looking at what the system was doing, there was a regular background task running that queried all of the documents. This took about 30s, but we hadn’t previously been concerned about the inefficiency since it didn’t affect users. However, because it wasn’t using indices, it needed to do a lot of data transfer between the cluster nodes to check the content – which led to our data transfer volume. Rewriting the query to use indices made it much faster, but also prevented it from having to do any data transfer, significantly dropping our AWS data transfer bill. The lesson of this is that when running a cluster on AWS, we care about different sorts of inefficiency than we would do running it on local machines.
Richard B talked about Effective Scala Futures – how to use Scala futures without using global variables that make testing hard, and how to bulkhead separate threadpools to manage different tasks. His slides are here: Effective Scala Futures dev talk