Understanding an “impossible” error

As discussed in the previous post on Sharing Failures, seeing how other people have dealt with bugs and errors can often help you avoid your own or give you ways to track down the source of a problem when one does make its appearance. So in that spirit, here is the story of a baffling error we fixed recently.

The error came from a content delivery platform we have been working on for a publisher client. At the point of a release and for several hours after we were seeing some errors, but there were a few reasons why this was very confusing.

The site is built using Scala / Play and uses Akka HTTP to make API calls between services. The error we were seeing was one that generally means that requests are coming in to a frontend service faster than the backend can service them:

BufferOverflowException: Exceeded configured max-open-requests value of [256]. This means that the request queue of this pool (........) has completely filled up because the pool currently does not process requests fast enough to handle the incoming request load. Please retry the request later. See https://doc.akka.io/docs/akka-http/current/scala/http/client-side/pool-overflow.html for more information.]]

So apparently the pool of requests was filling up and causing a problem. But the first thing that was strange was that this was persisting for several hours after the release. At the point of a release it’s understandable that this error could occur with various services being started and stopped, causing requests to back up. After that the system was not under particularly high load, so why was this not just a transient issue?

The next thing that was strange was that we were only seeing this when users were accessing very particular content. We were only seeing it for access to content in reference works. These are what publishers confusingly call “databases” and cover things like encyclopedias, directories or dictionaries. But it wasn’t all databases, only certain ones and different ones at different times. On one occasion we would see a stream of errors for Encyclopedia A and then the next time we hit this error it would be Dictionary B generating the problems instead. If the cause was a pool of requests filling up, why would it affect particular pieces of content and not others, when they all use the same APIs?

Another thing that was puzzling – not every access to that database would generate an error. We’d either get an error or the content would be rendered fine, both very quickly. The error we were seeing suggested that things were running slowly somewhere, but the site seemed to be snappy, just producing intermittent errors for some content some of the time.

We spent lots of time reading Akka HTTP documentation trying to figure out how we could be seeing these problems, but it didn’t seem to make any sense. I had the feeling that I was missing something because the error seemed to be “impossible”. I even commented to a colleague that it felt like once we worked out what was going on I would talk about it at one of our dev forums. That prediction turned out to be true. Looking at Akka HTTP documentation would not help because the error message itself was in some sense a misdirection.

The lightbulb moment came when I spotted this code in our frontend code:

private lazy val databaseNameCache: LoadingCache[String, Future[DatabaseIdAndName]] = 
    CacheBuilder.newBuilder().refreshAfterWrite(4, TimeUnit.HOURS).....

We are using Guava’s LoadingCache to cache the mapping between the id of a database and its name since this almost never changes. (Sidenote: Guava’s cache support is great, also check out the Caffeine library inspired by it). The problem here is that we are not storing a DatabaseIdAndName object in the cache, but a Future. So we are in some sense putting the operation to fetch the database name into the cache. If that fails with an Exception, then every time we look in the cache for it we will replay the exception. Suddenly all the pieces fell into place. A transient error looking up a database name at release time was being put in a cache on one frontend server and replayed for hours. The whole akka pool thing was more or less irrelevant.

In the short term we fixed the problem by waiting for the concrete data to be returned to store that in the cache rather than a Future object. In that scenario, a failure to fetch the value would just yield an error and nothing would be cached for future look ups. However, much of the code using this cache is asynchronous, so it’s cleaner and probably better from a performance perspective if you can continue to use Future where possible. So the longer term solution was to revert to putting Future objects in the cache but carefully adding code to invalidate any cache entries that resolve to an exception.

I think the lesson here is – if an error doesn’t make sense then maybe some technical sleight-of-hand is going on and the error you are seeing is not the real problem. Maybe it’s all an illusion…

The Trials and Tribulations of a Working Parent

I’ve worked in the science and technology sector for my whole career, starting off by completing a PhD in Physics, then migrating into computer modelling, and then into software development.  I had my son during the final year of my PhD (oh how naïve I was about how easy that would be) and then immediately hit the dilemma of “how are we going to pay for this?”

Students are not entitled to any kind of maternity leave or pay – when I made enquiries about this I was advised to quit my studies, which would have made me eligible for various benefits.  My department was much more accommodating and gave me four months paid maternity leave – something they were under no obligation to do.  I also managed to claim Maternity Allowance  (https://www.gov.uk/maternity-allowance/how-to-claim), because I had done some part-time maths lecturing which made me eligible.

When I made my first forays into real paid employment, I had to tackle the thorny issue of childcare.  Childminder or nursery?  Enlisting the grand-parents wasn’t an option, and neither was either myself or my husband becoming a stay-at-home parent.  We initially picked a nursery at my husband’s work-place, and then found one closer to home.  The fees were astronomical – larger than the mortgage – but we scraped by.  I had some comments (addressed to me, never to my husband) along the lines of “why bother having kids if you’re never going to see them?” which I shrugged off.

We timed having our second child so that we would not have two children in nursery at the same time for long – if nursery fees for one child were astronomical then double nursery fees were on a whole other level.  My employer at the time (not 67 Bricks, I should add) only offered the barest minimum maternity package, and so I could only afford to take four months’ maternity leave (my take-home pay went down to around £120 a week after the first 6-weeks of leave).  This was before the change in the law that would have allowed my husband to take some extended leave himself – when he requested to do so his employer said something along the lines of “you can do that when men start giving birth to babies”.

Despite the deficiencies in that company’s maternity policy, my immediate line manager was wonderfully accommodating.  He allowed me to have an arrangement where I worked in the office for about 5 hours a day and then completed the rest of my hours at home.  This enabled me to reduce the nursery hours from 9-4 rather than 8-6, and save money while spending more time with the kids.  He also gave me considerable flexibility around school holidays, and working from home on days when I had various child-related errands such as school plays, parent appointments etc.

Speaking of school, nobody ever tells you that having a school-age child is actually harder to fit a job around than having a nursery age child.  The school day ends at 3pm – and who is ready to finish work at 3pm???  Also, there are 13 weeks a year of school holidays to somehow cover.  At the time, the school did not have an after-school club (they started one up in later years) and so I had to find a child-minder who could do after-school pickups.  There was a holiday club at a school in the next village which I used – it was heavily sports-related, which my son in particular did not like, but I told him there was no choice and he had to go.

Over the next decade I went through various child-care arrangements, including nursery, childminders, after-school clubs, holiday clubs, and various forms of flexible working arrangement for both myself and my husband.  Our days were organised with military precision.  Drop the kids at school at 8:40.  Drive like a maniac to work, never being able to arrive earlier than 9:30.  And then having to leave at 5pm on the dot so that I could once more drive like a maniac to pick the kids up by 6pm or risk being fined (typically an immediate fine of £15 and then £10 for every extra 15 minutes you were late).  And the stress of sitting in a traffic jam on the motorway, watching the clock tick, wondering exactly how late I would be.  I changed jobs and my new line manager was equally wonderful – I was never quite able to complete my full hours during office-time, but he was perfectly fine with me making up time in the evenings.  And he never once quibbled when I said that I could not get in any earlier than 9:30, or stay any later than 5pm.

My kids are now in secondary school and make their own way to and from school, and are old enough to be left alone during school holidays.  Therefore I finally no longer have to worry about pick-ups, or astronomical childcare fees, and I don’t have to rush around like a loon trying to pick them up by a specific time.

The key takeaways from all this are:

  • Parents need flexibility.  We have schedules to meet, parents evenings and school plays to attend, and sometimes sick children to tend to.  Having an understanding manager who doesn’t watch the clock, and allows you to complete your working hours according to whatever pattern works, makes our lives so much easier.
  • Child-rearing is expensive.  Attractive maternity packages will improve your staff retention and employee satisfaction no end. The 67 Bricks maternity policy is better than many (Employees who have been here 2 years get 12 weeks on full pay and 12 weeks on half pay).
  • Dads need flexibility too.  For every dad who you allow to leave early to do a school run, there is probably a grateful mum who is able to get on with her own job without worry.  The number of dads standing at the school gate is getting bigger year-on-year.  When I was a child my dad was the only one at the gate, but these days it is much higher and that is only a good thing.  I have seen male colleagues experience discrimination in previous jobs, for example expressions of incredulity when they state an intention to take paternity leave.

Sharing Failures Dev Forum: The most “interesting” bugs you’ve caused

Ely (n.)

The first, tiniest inkling you get that something, somewhere, has gone terribly wrong.

The Deeper Meaning of Liff – Douglas Adams & John Lloyd (Thanks to Stephen for pointing this quote out)

At 67 Bricks we value open and honest communication, and aim to treat each other with respect and humility. Normalising failure helps everyone understand that mistakes happen and shouldn’t be used as a mechanism to punish but instead are an opportunity to learn and become more resilient to future failures. This week’s Dev Forum was on the topic of sharing failures, specifically talking about the most “interesting” bugs you have caused.

64 Bit vs 32 Bit Ubuntu

Our first fail came from a developer working on converting data from base 64 as part of a side project. Developing on the 64 bit version of Ubuntu, the code was stable and the tests were green and so they continued to build the application out. The application was intended to work on a number of different environments including 32 bit systems and this is where the first problems emerged.

The 64 bit OS zeroed out data after it was read and the code was designed to work with this, the 32 bit OS didn’t and instead exhibited random test failures. Sometimes expected values would match up and other times it would be wrong by a character or two, typically at the end of the strings. Finding the root cause required a trial and error approach to eliminate possibilities and eventually get a grasp on the problem and come up with a solution.

Transactions and Logging

Working on a client application this developer made two perfectly reasonable design decisions:

  1. Use transactions with the SQL database to ensure integrity of data changes
  2. Write logs to a table to ensure they are persisted in a convenient location

This approach was perfectly fine while the system worked, but the mystery began when the client started to complain about errors with the system. The logs didn’t show any error messages that lined up with what the client was reporting. Why? The same transaction was used to store logs and to update the data in SQL. If an exception was thrown, the transaction was rolled back preventing any data corruption problems, but also rolled back all the logs!

The application was changed to use a different transaction for logging to ensure logs were persisted. Using these logs meant the root cause of the client problem could be resolved and a lesson learnt about unintended consequences.

Overlogging

Another mystery around missing logs. A web application used a separate log shipping application to take logs from the server and send them to a remote server. However, under heavy loads the logs would become spotty and clear gaps appeared. The reason for this was due to the sheer volume of logs the shipper had to deal with eventually becoming too great and causing it to crash.

The solution was to reduce the number of logs written so the shipper would continue to function even when the main application was under heavy load. This triggered an interesting conversation at the dev forum on the ideas of how many logs should be written. Should you write everything including debug level logs to help with debugging faulty systems? Or should you write no logs whatsoever and only enable them when something starts going wrong?

Naturally we seemed to settle somewhere in the middle, but there were disagreements. Possibly a future dev forum topic?

The Wrong Emails

A client needed to receive important data daily via fax and the application did this by sending an email from a VB app to an address which would convert the body to a fax and send it on. The application needed testing and so had a separate test environment that was presumed to be safe to use for running tests.

It turned out that the test system was using the same email address as the live system, meaning the client was receiving both test and live data. Luckily the client caught on and asked why they were receiving two emails a day and the test system was quickly updated.

From then on this developer now always uses some form of override mechanism to prevent sending actual emails to end users. Others use apps like smtp4dev which will never send on emails (and just capture them so they can be inspected) or services like SES which can be configured to run in a sandbox mode.

Hidden Database Costs

AWS can be very costly, especially if no one is watching the monthly spend. In this case, one database drove up costs to astronomical new highs due to a single lambda running regularly and reading the entire database. The lambda was supposed to delete old data, but had a bug which meant it never deleted anything. Several months of operating the system meant a lot of data had piled up and so did the costs. The fix was simple enough to apply and we added additional monitoring to catch cases of runaway costs.

Similar experiences have been had before with runaway costs on clustered databases like MarkLogic. Normally a well-built index will be able to efficiently query the right data on the right node, but if the index is missing, MarkLogic will pull all the data across from the other node to evaluate it. This can drive up some eye-wateringly high data transfer prices in AWS and as such we now always monitor data transfer costs to ensure we haven’t missed an index.

Caching Fun

Our next issue is about users appearing to be logged in as other users. The system in question used CloudFront CDN to reduce server load and response times. The CDN differentiated between authenticated and unauthenticated users so different pages could be served for users who were authenticated.

The system made use of various headers set by lambdas to differentiate between authenticated and unauthenticated users. The problem arose when session handling was changed and the identifier used was accidentally stored within the CDN. This caused an issue where a user could load a page with a set-cookie header that set the identifier used for a different user’s session.

The team solved this bug by tweaking the edge lambdas to ensure only non-user specific data was cached. Caching in authenticated contexts can be challenging and need to be very carefully considered how they will be used reliably.

Deactivate / Activate

In this bug the business asked for a feature where user accounts could have a deactivated time set in the future that when reached, the user would be considered inactive and unable to access the system. This feature was implemented with a computed field in the SQL server which could be used to determine if the user is active or not.

As the system was already in use, migration scripts were developed to update the database. These needed to be applied in a certain order, however, deployment practices for this system meant that someone else applied the scripts and ended up causing an error where all users ended up deactivated preventing any users from accessing the system. To restore service, the database was rolled back and ultimately the feature was abandoned as too risky by the business.

Some viewed this as an example of why services should be responsible for their own database state and handle database migrations automatically.

Creating Bugs for Google to Index

Magic links can be a very useful feature for making authenticated access to a website easy, as long as these urls remain private to the correct user. In one case the url got cached by Google including the authentication token meaning anyone who could find the link would be able to access the authenticated content! This was fixed by asking Google to remove the URL, invalidating the token in the database and ensuring metadata was added to appropriate pages to prevent bots from indexing pages.

Another Google based bug next; after building a new system, the old one needed to be retired and part of this involved setting up permanent redirects to the new system. However, Google continues to serve up the old site’s urls as opposed to the new system’s urls, and a fix is still being worked out. A lesson learned on how important it is to carefully consider how search engines will crawl and store web sites.

As we see, failures can come from any number of sources, including ourselves. Bugs are a perfectly normal occurrence and working through them is an unavoidable part of building a robust system. By not fearing them, we can become more adept at fixing them and build better systems as a result.