failing – 67 Bricks blog

Ely (n.)

The first, tiniest inkling you get that something, somewhere, has gone terribly wrong.

The Deeper Meaning of Liff – Douglas Adams & John Lloyd (Thanks to Stephen for pointing this quote out)

At 67 Bricks we value open and honest communication, and aim to treat each other with respect and humility. Normalising failure helps everyone understand that mistakes happen and shouldn’t be used as a mechanism to punish but instead are an opportunity to learn and become more resilient to future failures. This week’s Dev Forum was on the topic of sharing failures, specifically talking about the most “interesting” bugs you have caused.

64 Bit vs 32 Bit Ubuntu

Our first fail came from a developer working on converting data from base 64 as part of a side project. Developing on the 64 bit version of Ubuntu, the code was stable and the tests were green and so they continued to build the application out. The application was intended to work on a number of different environments including 32 bit systems and this is where the first problems emerged.

The 64 bit OS zeroed out data after it was read and the code was designed to work with this, the 32 bit OS didn’t and instead exhibited random test failures. Sometimes expected values would match up and other times it would be wrong by a character or two, typically at the end of the strings. Finding the root cause required a trial and error approach to eliminate possibilities and eventually get a grasp on the problem and come up with a solution.

Transactions and Logging

Working on a client application this developer made two perfectly reasonable design decisions:

Use transactions with the SQL database to ensure integrity of data changes
Write logs to a table to ensure they are persisted in a convenient location

This approach was perfectly fine while the system worked, but the mystery began when the client started to complain about errors with the system. The logs didn’t show any error messages that lined up with what the client was reporting. Why? The same transaction was used to store logs and to update the data in SQL. If an exception was thrown, the transaction was rolled back preventing any data corruption problems, but also rolled back all the logs!

The application was changed to use a different transaction for logging to ensure logs were persisted. Using these logs meant the root cause of the client problem could be resolved and a lesson learnt about unintended consequences.

Overlogging

Another mystery around missing logs. A web application used a separate log shipping application to take logs from the server and send them to a remote server. However, under heavy loads the logs would become spotty and clear gaps appeared. The reason for this was due to the sheer volume of logs the shipper had to deal with eventually becoming too great and causing it to crash.

The solution was to reduce the number of logs written so the shipper would continue to function even when the main application was under heavy load. This triggered an interesting conversation at the dev forum on the ideas of how many logs should be written. Should you write everything including debug level logs to help with debugging faulty systems? Or should you write no logs whatsoever and only enable them when something starts going wrong?

Naturally we seemed to settle somewhere in the middle, but there were disagreements. Possibly a future dev forum topic?

The Wrong Emails

A client needed to receive important data daily via fax and the application did this by sending an email from a VB app to an address which would convert the body to a fax and send it on. The application needed testing and so had a separate test environment that was presumed to be safe to use for running tests.

It turned out that the test system was using the same email address as the live system, meaning the client was receiving both test and live data. Luckily the client caught on and asked why they were receiving two emails a day and the test system was quickly updated.

From then on this developer now always uses some form of override mechanism to prevent sending actual emails to end users. Others use apps like smtp4dev which will never send on emails (and just capture them so they can be inspected) or services like SES which can be configured to run in a sandbox mode.

Hidden Database Costs

AWS can be very costly, especially if no one is watching the monthly spend. In this case, one database drove up costs to astronomical new highs due to a single lambda running regularly and reading the entire database. The lambda was supposed to delete old data, but had a bug which meant it never deleted anything. Several months of operating the system meant a lot of data had piled up and so did the costs. The fix was simple enough to apply and we added additional monitoring to catch cases of runaway costs.

Similar experiences have been had before with runaway costs on clustered databases like MarkLogic. Normally a well-built index will be able to efficiently query the right data on the right node, but if the index is missing, MarkLogic will pull all the data across from the other node to evaluate it. This can drive up some eye-wateringly high data transfer prices in AWS and as such we now always monitor data transfer costs to ensure we haven’t missed an index.

Caching Fun

Our next issue is about users appearing to be logged in as other users. The system in question used CloudFront CDN to reduce server load and response times. The CDN differentiated between authenticated and unauthenticated users so different pages could be served for users who were authenticated.

The system made use of various headers set by lambdas to differentiate between authenticated and unauthenticated users. The problem arose when session handling was changed and the identifier used was accidentally stored within the CDN. This caused an issue where a user could load a page with a set-cookie header that set the identifier used for a different user’s session.

The team solved this bug by tweaking the edge lambdas to ensure only non-user specific data was cached. Caching in authenticated contexts can be challenging and need to be very carefully considered how they will be used reliably.

Deactivate / Activate

In this bug the business asked for a feature where user accounts could have a deactivated time set in the future that when reached, the user would be considered inactive and unable to access the system. This feature was implemented with a computed field in the SQL server which could be used to determine if the user is active or not.

As the system was already in use, migration scripts were developed to update the database. These needed to be applied in a certain order, however, deployment practices for this system meant that someone else applied the scripts and ended up causing an error where all users ended up deactivated preventing any users from accessing the system. To restore service, the database was rolled back and ultimately the feature was abandoned as too risky by the business.

Some viewed this as an example of why services should be responsible for their own database state and handle database migrations automatically.

Creating Bugs for Google to Index

Magic links can be a very useful feature for making authenticated access to a website easy, as long as these urls remain private to the correct user. In one case the url got cached by Google including the authentication token meaning anyone who could find the link would be able to access the authenticated content! This was fixed by asking Google to remove the URL, invalidating the token in the database and ensuring metadata was added to appropriate pages to prevent bots from indexing pages.

Another Google based bug next; after building a new system, the old one needed to be retired and part of this involved setting up permanent redirects to the new system. However, Google continues to serve up the old site’s urls as opposed to the new system’s urls, and a fix is still being worked out. A lesson learned on how important it is to carefully consider how search engines will crawl and store web sites.

As we see, failures can come from any number of sources, including ourselves. Bugs are a perfectly normal occurrence and working through them is an unavoidable part of building a robust system. By not fearing them, we can become more adept at fixing them and build better systems as a result.

Tag: failing

Sharing Failures Dev Forum: The most “interesting” bugs you’ve caused