Anyone who has been working in software development for more than a few months will know the ice-cold sensation that creeps over you when something isn’t working and you don’t know why. Luckily, all our team members have lived to tell the tale, and are happy to share their experiences so you might avoid these errors in future…
The Legend of the Kooky Configuration – Rhys Parsons
In my first job, in the late 90s, I was working on a project for West Midlands Fire Service (WMFS). We were replacing a key component (the Data Flow Controller, or DFC) that controlled radio transmitters and was a central hub (GD92 router) for communicating with appliances (fire engines). Communication with the Hill Top Sites (radio transmitters) was via an X.25 network.
The project was going well, we had passed the Factory Acceptance Tests, and it was time to test it on-site. By this point, I was working on the project on my own, even though I only had about two years of experience. I drove down to Birmingham from Hull with the equipment in a hired car, a journey of around 3.5 hours. The project had been going on for about a year by this point, so there was a lot riding on this test. WMFS had to change their procedures to mobilise fire engines via mobile phones instead of radio transmitters, which, back in the late 90s, was quite a slow process (30 seconds call setup). I plugged in the computers and waited for the Hill Top Sites to come online. They didn’t. I scratched my head. A lot. For an entire day. Pouring over code that looked fine. Then I packed it all up and drove back to Hull.
Back in the office, I plugged in the computer to test it. It worked immediately! Why?! How could it possibly have worked in Hull but not in Birmingham! It made absolutely no sense!
I hired a car for the next day and drove back down to Birmingham early, aiming to arrive just after 9, to avoid the shift change. By this point, I was tired and desperate.
I plugged the computer back in again. I had made absolutely no changes, but I could see no earthly reason why it wouldn’t work. “Sometimes,” I reasoned, “things just work.” That was my only hope. This was the second-day WMFS were using slower backup communications. One day was quite a good test of their resilience. Two days were nudging towards the unacceptable. Station Officers were already complaining. I stared at the screen, willing the red graphical LEDs to turn green. They remained stubbornly red. At the end of the day, I packed up the computer and drove back to Hull.
The WMFS project manager phoned my boss. We had a difficult phone conversation, and we decided I should go again the next day.
Thankfully, a senior engineer who had the experience of X.25 was in the office. I told him of this weird behaviour that made no sense. We spoke for about two minutes which concluded with him saying, “What does the configuration look like?”
My mouth dropped. The most obvious explanation. I hadn’t changed the X.25 addresses! I was so busy wondering how the code could be so weirdly broken that I hadn’t considered looking at the configuration. So, so stupid! I hadn’t changed the configuration since I first set up the system, several months earlier, it just wasn’t in my mind as something that needed doing.
Day three. Drove to Birmingham, feeling very nervous and stupid. Plugged in the computer. Changed the X.25 addresses. Held my breath. The graphical LEDs went from red to orange, and then each Hill Top Site went green in turn, as the transmit token was passed back and forth between them and the replacement DFC. Finally, success!
A Nightmare on Character Street – Rosie Chandler
We recently implemented a database hosted in AWS, with the password stored in the AWS Secrets Manager. The password is pulled into the database connection string, which ends up looking something like this:
Where %PASSWORD% here is substituted with the password pulled out of the Secrets Manager. We found one day that calls to the database started throwing connection exceptions after working perfectly fine up until that point.
After spending a lot of time scratching my head and searching through logs, I decided to take a peek into the Secrets Manager to see what the password was. Turns out that day’s password was something like =*&^%$ (note it starts with “=”) which means that the connection string for that day was invalid. After much facepalming, we implemented a one-line fix to ensure that “=” was added to the list of excluded characters for the password.
The Case of the Phantom Invoices – Chris Rimmer
Many years ago I was responsible for writing code that would email out invoices to customers. I was very careful to set things up so that when the code was tested it would send messages to a fake email system, not a real one. Unfortunately, this wasn’t set up in a very fail-safe way, meaning that another developer managed to activate the email job in the test system and sent real emails to real customers with bogus invoices in them. This is not the sort of mistake you quickly forget. Since then I’ve been very careful configuring email systems in test environments so that they can only send emails to internal addresses.
Tales from the Dropped Database – Rich Brown
It was a slow and rainy Thursday morning, I was just settling into my 3rd cup of coffee when a fateful email appeared with the subject ‘Live site down!!!’
Ah, of course, nothing like a production issue to kick-start your morning. I checked the site: it was indeed down. Sadly, the coffee would have to wait.
Logging onto the server, I checked the logs. A shiver ran down my spine.
ERROR: SQL Error – Table ‘users’ does not exist
ERROR: SQL Error – Table ‘articles’ does not exist
ERROR: SQL Error – Table ‘authors’ does not exist
ERROR: SQL Error – Database ‘live-db’ does not exist
Everything was working and then suddenly it stopped, no data existed.
Hopping onto the database server showed exactly that. Everything was gone, every row, every table, even the database itself wasn’t there.
I pulled in the rest of the team and we scratched our collective heads, how could this even happen? The database migration system shouldn’t be able to delete everything. We’ve taken all the right mitigations to prevent injection attacks. There’s no reason for our application to do this.
I asked, “What was everyone doing when the database disappeared?”
Dev 1 – “Writing code for this new feature”
Dev 2 – “Updating my local database”
Dev 3 – “Having a meeting”
Ok, follow up question to Dev 2 – “How did you update your database?”
Dev 2 – “I dropped it and let the app rebuild it as I usually do”
Me – “Show me what you did”
ERROR: SQL Error – Cannot drop database ‘live-db’ because it does not exist
Turned out Dev 2 had multiple SQL Server Manager instances open, one connected to their local test database and the other connected to the live system to query some existing data.
They thought they were dropping their local database and ended up dropping live by mistake.
One quick database restore and everything was back to normal.
Moral of the story, principle of least access. If you have a user who only needs to read data, only grant permissions to read data.