chris – 67 Bricks blog

Wortüberbreitedarstellungsproblem

Don’t worry if you don’t understand German, the title of this post will make sense if you read on…

We’ve been working for the last few years with De Gruyter to rebuild their delivery platform. This has worked well and we have picked up an award along the way. Part of our approach has been to push out new features and improvements to the site on a weekly basis. Yesterday we did this, deploying a new home page design that has been a month or two in the making. The release went fine, but then we started getting reports that the new home page didn’t look quite right for users on iPhones and iPads. I took a look – it seemed fine on my Android phone and on my daughter’s iPhone. A developer based in India looked on his iPhone with different browsers and everything was as expected. But somehow German users were seeing text that overflowed the edge of the page. So what was going on – how could German Apple devices be so different? Most odd.

It turned out that the problem was not a peculiarity of the German devices, but of the German language. German is famous for its long compound words (like the title of this post) and often uses one big word where English would use a phrase. Our new homepage includes a grid of subjects that are covered in books that De Gruyter publishes. In English these subjects mostly have quite short names, but in German they can be quite long. For smaller screens the subject grid would shift from three to two columns, but even so this was not enough to accommodate the long German words, meaning the page overflowed.

The fix was quite a simple one, for the German version of the page the grid would shift to two columns more readily and then to a single column for a phone screen. But I think the lesson is that there is more to catering for different languages than checking the site looks fine in English and that all the text has been translated. The features of the target language can have unexpected effects and need checking. It’s easy to overlook this when dealing with two languages that are apparently quite similar.

On a similar note, it can be easy to be complacent that your site is easy to use because you understand it or believe it is accessible to those using a screen reader because you have added alt text onto images. Just because it works for you doesn’t mean it works for others and that always needs bearing in mind.

Finally, that title? It translates as something like “Overly wide word display problem” and was suggested by someone at De Gruyter as a German compound word to describe the problem we saw.

Understanding an “impossible” error

As discussed in the previous post on Sharing Failures, seeing how other people have dealt with bugs and errors can often help you avoid your own or give you ways to track down the source of a problem when one does make its appearance. So in that spirit, here is the story of a baffling error we fixed recently.

The error came from a content delivery platform we have been working on for a publisher client. At the point of a release and for several hours after we were seeing some errors, but there were a few reasons why this was very confusing.

The site is built using Scala / Play and uses Akka HTTP to make API calls between services. The error we were seeing was one that generally means that requests are coming in to a frontend service faster than the backend can service them:

BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool (........) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.]]

So apparently the pool of requests was filling up and causing a problem. But the first thing that was strange was that this was persisting for several hours after the release. At the point of a release it’s understandable that this error could occur with various services being started and stopped, causing requests to back up. After that the system was not under particularly high load, so why was this not just a transient issue?

The next thing that was strange was that we were only seeing this when users were accessing very particular content. We were only seeing it for access to content in reference works. These are what publishers confusingly call “databases” and cover things like encyclopedias, directories or dictionaries. But it wasn’t all databases, only certain ones and different ones at different times. On one occasion we would see a stream of errors for Encyclopedia A and then the next time we hit this error it would be Dictionary B generating the problems instead. If the cause was a pool of requests filling up, why would it affect particular pieces of content and not others, when they all use the same APIs?

Another thing that was puzzling – not every access to that database would generate an error. We’d either get an error or the content would be rendered fine, both very quickly. The error we were seeing suggested that things were running slowly somewhere, but the site seemed to be snappy, just producing intermittent errors for some content some of the time.

We spent lots of time reading Akka HTTP documentation trying to figure out how we could be seeing these problems, but it didn’t seem to make any sense. I had the feeling that I was missing something because the error seemed to be “impossible”. I even commented to a colleague that it felt like once we worked out what was going on I would talk about it at one of our dev forums. That prediction turned out to be true. Looking at Akka HTTP documentation would not help because the error message itself was in some sense a misdirection.

The lightbulb moment came when I spotted this code in our frontend code:

private lazy val databaseNameCache: LoadingCache[String, Future[DatabaseIdAndName]] = 
    CacheBuilder.newBuilder().refreshAfterWrite(4, TimeUnit.HOURS).....

We are using Guava’s LoadingCache to cache the mapping between the id of a database and its name since this almost never changes. (Sidenote: Guava’s cache support is great, also check out the Caffeine library inspired by it). The problem here is that we are not storing a DatabaseIdAndName object in the cache, but a Future. So we are in some sense putting the operation to fetch the database name into the cache. If that fails with an Exception, then every time we look in the cache for it we will replay the exception. Suddenly all the pieces fell into place. A transient error looking up a database name at release time was being put in a cache on one frontend server and replayed for hours. The whole akka pool thing was more or less irrelevant.

In the short term we fixed the problem by waiting for the concrete data to be returned to store that in the cache rather than a Future object. In that scenario, a failure to fetch the value would just yield an error and nothing would be cached for future look ups. However, much of the code using this cache is asynchronous, so it’s cleaner and probably better from a performance perspective if you can continue to use Future where possible. So the longer term solution was to revert to putting Future objects in the cache but carefully adding code to invalidate any cache entries that resolve to an exception.

I think the lesson here is – if an error doesn’t make sense then maybe some technical sleight-of-hand is going on and the error you are seeing is not the real problem. Maybe it’s all an illusion…

Lead Developer Conference 2018

I attended the Lead Developer conference in London a couple of weeks ago. I enjoyed it and came back with lots of ideas buzzing around in my head. It’s a single track conference, which is good because you don’t have to make decisions about what to see and what to miss, but also you get to see some things you might not have chosen just based on the title. Many of the speakers have given longer versions of the talks elsewhere, or have written articles on the subject, so if particular topics are of interest it is possible to go and dig in further. You can think of it like a taster menu at a fancy restaurant.

I talked about some of the talks I had seen at our developer meeting on Friday. I couldn’t cover all of them (23 in total I think), so concentrated on a few that had particularly resonated. The full set of conference videos are available to view on YouTube, so go and check them out. Here are some details of the handful of talks I discussed with the team:

Alex Hill – Giving and receiving code reviews gracefully

Alex has written up a longer form in this blog post while the video of her talk is on YouTube.

This talk was about the psychology of code reviews and how to take that into account to get the best outcomes. People sometimes feel defensive about code reviews as it feels as if they are being criticized rather than the code under review.

She talks about dividing up code review comments into 4 quadrants along 2 axes: High vs Low Conflict & High vs Low Reward. The Low Reward, High Conflict things tend to be preferences like where to put brackets and so on. The best way to handle these things is to agree code format standards and automate them away. The Low Conflict things don’t cause problems between team members because they are non-contentious. Things like obvious bugs (in the High Reward area) and debug statements (in the Low Reward area). It’s the High Reward, High Conflict things that are tricky. She suggests considering Conflict Resolution Archetypes- Avoiding vs Yielding vs Competing vs Collaborating. We are aiming for collaboration and she has some suggestions on how to achieve that.

These include: Doing more pair programming and having more discussion before implementing a feature. Ensuring everyone reviews and is reviewed, so there is a level playing field. Using “we” rather than “you” or the passive voice to keep the whole tone of the review more neutral. Asking questions rather than making demands. Just being positive rather than negative or confrontational.

As the receiver of the review, say thank you and also think about how you think someone else would respond.

Adrian Howard – Points don’t mean prizes

There is a longer version of this, in video form from the ACE conference while the short version from Lead Dev is on YouTube.

Adrian works in the intersection between development, UX and product helping companies build the right things. This talk was about various dysfunctions he sees in the way people think about Scrum, Agile and requirements.

The default scrum model that people use is kind of broken. Someone comes up with the vision that everyone is heading to. Someone comes up with the user journeys to get to that place, that gets split up into stories. Those stories are given to the developers and everyone lives happily ever after. But that’s a lie.

Problems arise because the different stories are different sizes. So it’s hard to put them into fixed sprint-sized boxes or to get flow in a Kanban approach. So break them up into smaller ones and we get smoother flow.
Give those to the developers and we’re done. Again that’s a lie.

Stories focus on size and effort not on actual value. So we may have split
up the story and actually delivered little value. So think about:

Bin – can we discard or postpone a story?
Thin – can we deliver less and still get value?
Split – can we break up a story and still get value from the pieces?

Once that is done, give the stories to the development team and we are done. Once again, it’s a lie.

The problem is that often the people who want to follow this approach don’t have the authority to make it happen.

Adrian recommends User Story Mapping as a way to get good stories and keep the big picture in mind. He particularly likes the book that describes it, because if you give someone a book, it has much more weight than just, “hey try this technique”. The output is a map rather than a flat backlog. People tend to do this at the start, but it’s best to keep refining. Some of the ideas of this approach are described in Jeff Patton’s blog post that predates the book.

Nickolas Means – Who destroyed Three Mile Island?

The final talk I discussed from Day One of the conference was about the nuclear reactor meltdown at Three Mile Island. I recommend watching the video of this as he is a good story teller and I am not going to retell it in detail here.

He first outlined the events that lead to the partial meltdown occurring and then discussed the ideas of the “first story” and “second story” as described by Sidney Dekker’s book “Field Guide to Understanding Human Error“. The “first story” is written with hindsight and outcome bias and generally seeks to blame someone for the results. The “second story” seeks to look at what happened through the eyes of those who were there and what they knew at the time. The idea is to start with the assumption that everyone was doing the best they could with the information they had at the time, so human error is never the cause of the event. This leads into the idea of blame-free post-mortems as a way to discover and fix systemic problems rather than seeking someone to blame.

Uberto Barbini – Legacy Code – Big Rewrite or Progressive Rejuvenation?

The first talk I discussed from the second day of the conference was this one about legacy systems. The video of this talk is on YouTube.

A legacy system is old, but it works and usually makes money for the company, or it would have been retired. One of the options for dealing with such a system is to just keep patching it as changes are required. The downside to this is that the system slowly degrades as more and more changes are added.

Another option is the big rewrite. This rarely works out. The thing you are replacing was successful, so not as simple to replace as you might think.
The old system contains quite a bit of knowledge that can be lost in the transition. Finally, data migration is nearly always harder than expected

The best approach seems to be the “Strangler” pattern as described by Martin Fowler whereby the new application wraps the old one and then slowly replaces it over time. This has the advantage of showing results quickly and not requiring a risky “big bang” switchover.

Uberto Barbini has a similar technique which he calls “Alchemical Rejuvenation” – Turning legacy code into gold. It has the following steps:

- Seal with external tests. First of all you need some high-level assurance that the system is working after you make changes. These tests may be discarded later, once there is better testing in place.
- Split into modules. Start improving the internal architecture to separate into logical pieces.
- Clean the module you need to work in, adding tests as you go.
- Repeat as needed

He had an interesting take on code quality – It’s not clean code, TDD, or patterns etc. Those are just tools to get code quality. The real test is if your application has been running for 10 years and you can still add features and fix bugs quickly, then you have high code quality.

Kevin Goldsmith – Using Agile to Build Inclusive Teams

The final talk I discussed was about using agile techniques to improve the way teams are run. The video for this talk is also available on YouTube.

He talked about using post-its to work with one of his reports to work out want they each expected of each other. Similar to the idea of the “Manager Read Me”

In similar theme he talked about mentoring a lead. Again, working out where different responsibilities lie. Is the manager keeping it, Does the manager approve it, Does the new lead inform of their decisions, or Does the new lead take full responsibility?

He also talked about improving team meetings. When it comes to making a decision he has two approaches: Polling – everyone gives their opinion, but in the end the manager decides. Voting – everyone votes. In the end the Manager has to accept and defend the decision. He talked about having a collaborative team meeting agenda in a shared Google Doc. For larger groups he recommends the Lean Coffee approach.

Finally he talked about having more inclusive meetings. The lead needs to resist talking as other people will yield to them. He also suggested having an observer who points out interruptions, people not getting credit etc. This role should be rotated though to avoid people not contributing.

Release It – 2nd edition – part 2!

Chris talked again about Release It 2nd edition.

Last time, Chris talked about “Creating Stability” – things that can go wrong, and how to prevent that.

The next section “Living in Production” is about how a system works in production. Part of this is physical (networks, IPs, etc.). There can be clock problems particularly with VMs. It covers “12 factor apps” – which we’ve discussed before in the context of microservices, coming from the microservices ideas, this is all about making the app not depend on things on the box.

We discussed “Stucco apps” – where if you install subsequent versions 1, 2, 3 of an app on a box, then there will be bits of version 1 and 2 left over – so the app isn’t exactly any of those versions. Instead, you should rebuild from scratch each time (you could use Nix and NixOS for this…). We also discussed configuration – getting environment-appropriate configuration onto each box.

We had a digression about ambient sound from services – like putting microphones in the JET torus – so you can tell whether the system is running normally. Because humans are good at recognizing unusual noises or unusual changes in noise patterns, this can let you pick up on patterns of behaviour that aren’t otherwise obvious.

We talked about setting up logging to demonstrate that the high-level goals of the system are being met. For example, in some systems, it might be really important if page loads have become slow, or users cannot log in, or if the number of purchases per hour has significantly dropped; and these are the important business needs rather than just whether a box is up.

We briefly discussed the merits of the Unix command “uniq -n” for monitoring services; for example to find the counts of unique ID addresses. This is very useful for spotting patterns in your logs.

When upgrading data in SQL databases, the upgrade path is typically straightforward – you migrate it all via a migration, or the apps don’t work. In NoSQL databases, there is no schema, there may be multiple clients using the data imposing their own restrictions, and so it’s not so straightforward. The author suggests a “trickle then batch” approach of first converting the high priority items, then after a while, converting all the other items.

We talked about API changes, and contract tests created by consumers, and versioning of APIs.

The final part of the book is about systemic problems – a grab bag of issues that didn’t come up elsewhere. Load testing scripts can be overly polite and well behaved, and then sites break when hit with real users that aren’t well behaved – so the load testing scripts should be more impolite. We talked about chaos – we’ve discussed chaos monkeys before, but there are various refinements to this idea. For example, a default “opt-in” for chaos monkeys, with the ability to opt-out if your service cannot tolerate chaos. Also, a “zombie apocalypse” – you send home a bunch of people, and see whether any of them are indispensable or not.

Fantastic Lambdas and How to Deploy Them

As mentioned previously, we are using terraform to spin up resources in AWS in an automated and repeatable fashion. Mostly it just works, but now and again things get tricky. We hit such a situation when automating the deployment of AWS Lambdas. We were using terraform to create AWS resources and then continuously deploying with ansible. So if the lambda source code changed, ansible would deploy the new version, while privileges and other plumbing were taken care of by terraform. It all seemed to work well, but trouble was lurking.

The problem was that when setting up the initial version of the lambda in terraform we were effectively creating it empty and leaving it up to ansible to deploy the actual code. This is fine up to the point you need to run your terraform script once again. Terraform defines its resources declaratively, so if additional resources or changes are needed you simply run the script again and everything is brought up to date. But when it came to the lambda it would say to itself “This lambda is declared as being empty, but it isn’t. I’ll fix that!”. So running the terraform script would wipe the source code. Oops.

We got around this by storing the lambda source code in s3 and always deploying from there. The terraform script ensures that the bucket and source zip exists and creates the lambda using that source:

resource "aws_s3_bucket" "source_bucket" {
  bucket = "my-bucket-for-source"
}

resource "aws_s3_bucket_object" "lambda_source" {
  bucket = "${aws_s3_bucket.source_bucket.bucket}"
  key = "source.zip"
  source = "initial_empty_lambda.zip"
}

resource "aws_lambda_function" "my_lambda" {
  function_name = "my_lambda_function"
  s3_bucket = "${aws_s3_bucket.source_bucket.bucket}"
  s3_key = "source.zip"
  runtime = "nodejs4.3"
  environment {
    variables = {
      foo = "bar"
      bez = "baz"
    }
  }
}

Note that creating the zip in the way specified (without using the etag attribute) means that terraform only checks if the file exists in s3. Importantly it won’t overwrite an updated zip with the empty one later on…

Meanwhile, the ansible playbook uploads the latest zip to the s3 bucket and updates the lambda source using that. So now running terraform will not break the lambda, sanity restored.

Terraform tricks to cope with conditionally created resources

Terraform is a great tool and we use it extensively to spin up resources in AWS. It’s very easy to get started, the documentation is great and once you have built yourself a development environment it’s just a matter of changing a few config settings and you have yourself a staging and prod environment too.

It gets more tricky when the different environments are not exactly copies of each other. For example, in development you might want to create a SNS topic to use for testing whereas in staging and production you might want to use an external one.

Creating resources in some environments and not others can be controlled using the “count” parameter on a resource. So in the case of the SNS topic we only want to create in development, we can set an sns_count variable to be 1 in development and 0 elsewhere and use this to only create the topic in development.

But suppose we want to use the ARN of the created topic if it exists, or a fixed ARN of an external topic otherwise. Now it becomes even harder. Suppose that we are creating the SNS topic like this:

resource "aws_sns_topic" "sns_topic" {
  name = "our_topic"
  count = "${var.sns_count}"
}

and we have a variable topic_arn that will contain a fixed topic ARN to use if we have not created one. Terraform ~~does not have any kind of conditional logic, but~~ has a simple ternary function and there is a coalesce function which takes the first non-empty string from a list. So you might think that the following string interpolation would work:

"${coalesce(aws_sns_topic.sns_topic.arn, var.topic_arn)}"

Unfortunately it doesn’t because in the case where we do not create the topic, Terraform complains that there is no arn attribute.

So, to get around this we use this syntax instead:

"${coalesce(join("", aws_sns_topic.sns_topic.*.arn), var.topic_arn)}"

The .*.arn syntax returns a list of the ARNs for all the created topics, which will be either one or none. So we flatten this to a string with the join function and finally, it works as we want. Phew!

Lessons learnt from production problems

At last week’s dev meeting we swapped war stories about problems encountered in production, how we tracked down the root cause and lessons we learnt. Here are some highlights:

Environment is very important. When trying to reproduce a bug, you need to try to replicate the environment as closely as possible. Even very minor changes to libraries, frameworks, subsystems or operating system patches can make a major difference to whether the bug is seen or not.
“We haven’t touched anything”. The person telling you that nothing has changed may well believe this is the case. However, if a system mysteriously stops working it’s best to actually verify that this is the case. There was a bit of moaning about Windows at this point. At least with text file based configuration you can easily do a diff. Not so easy when someone has unchecked a vital property in Windows config settings.
Test your systems like a real user. We discussed a problem on a website where the functionality in question had been tested thoroughly by people within the organisation. Unfortunately, when real external users came to use the site, it didn’t work as expected. This was due to a private address being used that could only be seen within the corporate network. So it worked for internal users, but not for real customers. If it had been tested via an external network this would have come to light before real users hit problems.
The bug may be in old code. It’s possible that the bug you are seeing has been lying dormant for years and is being triggered by some innocuous change. We talked about a situation where a new release would cause a site to have severe performance problems. Much time was spent looking at all the changes going into that release to see if there was some change to database access or the like causing the problem. In the end it transpired that a small change to the cookies used was triggering a latent bug in a script used for load balancing. This script had been running fine in production for years until this seemingly minor change caused it fork processes like crazy and bring down the site.

Surprising output from Specs2

I noticed yesterday that the output from some integration tests running on our CI server were producing large amounts of output. It turned out that 41000 lines of the 43000 lines were coming from a single test. The reason for this is the way Specs2 handles the contain() matcher with lists of Strings. It means that the following:

listOfIds must not(contain(badId))

is effectively the following, checking each string to see if it contains the given one:

listOfIds(0) must not(contain(badId))
listOfIds(1) must not(contain(badId))
listOfIds(2) must not(contain(badId))
....

So if you are looking at a long list, this yields some very verbose output. With types other than String, this seems to work as expected.

Virtuoso Jena Provider Problem

In a project that we’ve just started, we are using OpenLink Virtuoso as a triple store. I encountered a frustrating bug when accessing it via the Jena Provider where submitting a SPARQL query with a top-level LIMIT clause would return one less result than expected. In my case, the first query I tried was an existential query with LIMIT 1, so it caused much head scratching as to why I was getting no results.

Luckily OpenLink are responsive to issues raised on GitHub, so once I raised this issue and created an example project, it was quickly found to be solved by using the latest version of their JDBC4 jar. Problem solved.