Understanding an “impossible” error

As discussed in the previous post on Sharing Failures, seeing how other people have dealt with bugs and errors can often help you avoid your own or give you ways to track down the source of a problem when one does make its appearance. So in that spirit, here is the story of a baffling error we fixed recently.

The error came from a content delivery platform we have been working on for a publisher client. At the point of a release and for several hours after we were seeing some errors, but there were a few reasons why this was very confusing.

The site is built using Scala / Play and uses Akka HTTP to make API calls between services. The error we were seeing was one that generally means that requests are coming in to a frontend service faster than the backend can service them:

BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool (........) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.]]

So apparently the pool of requests was filling up and causing a problem. But the first thing that was strange was that this was persisting for several hours after the release. At the point of a release it’s understandable that this error could occur with various services being started and stopped, causing requests to back up. After that the system was not under particularly high load, so why was this not just a transient issue?

The next thing that was strange was that we were only seeing this when users were accessing very particular content. We were only seeing it for access to content in reference works. These are what publishers confusingly call “databases” and cover things like encyclopedias, directories or dictionaries. But it wasn’t all databases, only certain ones and different ones at different times. On one occasion we would see a stream of errors for Encyclopedia A and then the next time we hit this error it would be Dictionary B generating the problems instead. If the cause was a pool of requests filling up, why would it affect particular pieces of content and not others, when they all use the same APIs?

Another thing that was puzzling – not every access to that database would generate an error. We’d either get an error or the content would be rendered fine, both very quickly. The error we were seeing suggested that things were running slowly somewhere, but the site seemed to be snappy, just producing intermittent errors for some content some of the time.

We spent lots of time reading Akka HTTP documentation trying to figure out how we could be seeing these problems, but it didn’t seem to make any sense. I had the feeling that I was missing something because the error seemed to be “impossible”. I even commented to a colleague that it felt like once we worked out what was going on I would talk about it at one of our dev forums. That prediction turned out to be true. Looking at Akka HTTP documentation would not help because the error message itself was in some sense a misdirection.

The lightbulb moment came when I spotted this code in our frontend code:

private lazy val databaseNameCache: LoadingCache[String, Future[DatabaseIdAndName]] = 
    CacheBuilder.newBuilder().refreshAfterWrite(4, TimeUnit.HOURS).....

We are using Guava’s LoadingCache to cache the mapping between the id of a database and its name since this almost never changes. (Sidenote: Guava’s cache support is great, also check out the Caffeine library inspired by it). The problem here is that we are not storing a DatabaseIdAndName object in the cache, but a Future. So we are in some sense putting the operation to fetch the database name into the cache. If that fails with an Exception, then every time we look in the cache for it we will replay the exception. Suddenly all the pieces fell into place. A transient error looking up a database name at release time was being put in a cache on one frontend server and replayed for hours. The whole akka pool thing was more or less irrelevant.

In the short term we fixed the problem by waiting for the concrete data to be returned to store that in the cache rather than a Future object. In that scenario, a failure to fetch the value would just yield an error and nothing would be cached for future look ups. However, much of the code using this cache is asynchronous, so it’s cleaner and probably better from a performance perspective if you can continue to use Future where possible. So the longer term solution was to revert to putting Future objects in the cache but carefully adding code to invalidate any cache entries that resolve to an exception.

I think the lesson here is – if an error doesn’t make sense then maybe some technical sleight-of-hand is going on and the error you are seeing is not the real problem. Maybe it’s all an illusion…