Spooky season special – tales of terrors and errors

Anyone who has been working in software development for more than a few months will know the ice-cold sensation that creeps over you when something isn’t working and you don’t know why. Luckily, all our team members have lived to tell the tale, and are happy to share their experiences so you might avoid these errors in future… 

The Legend of the Kooky Configuration – Rhys Parsons
In my first job, in the late 90s, I was working on a project for West Midlands Fire Service (WMFS). We were replacing a key component (the Data Flow Controller, or DFC) that controlled radio transmitters and was a central hub (GD92 router) for communicating with appliances (fire engines). Communication with the Hill Top Sites (radio transmitters) was via an X.25 network.

The project was going well, we had passed the Factory Acceptance Tests, and it was time to test it on-site. By this point, I was working on the project on my own, even though I only had about two years of experience. I drove down to Birmingham from Hull with the equipment in a hired car, a journey of around 3.5 hours. The project had been going on for about a year by this point, so there was a lot riding on this test. WMFS had to change their procedures to mobilise fire engines via mobile phones instead of radio transmitters, which, back in the late 90s, was quite a slow process (30 seconds call setup). I plugged in the computers and waited for the Hill Top Sites to come online. They didn’t. I scratched my head. A lot. For an entire day. Pouring over code that looked fine. Then I packed it all up and drove back to Hull.

Back in the office, I plugged in the computer to test it. It worked immediately! Why?! How could it possibly have worked in Hull but not in Birmingham! It made absolutely no sense!

I hired a car for the next day and drove back down to Birmingham early, aiming to arrive just after 9, to avoid the shift change. By this point, I was tired and desperate.

I plugged the computer back in again. I had made absolutely no changes, but I could see no earthly reason why it wouldn’t work. “Sometimes,” I reasoned, “things just work.” That was my only hope. This was the second-day WMFS were using slower backup communications. One day was quite a good test of their resilience. Two days were nudging towards the unacceptable. Station Officers were already complaining. I stared at the screen, willing the red graphical LEDs to turn green. They remained stubbornly red. At the end of the day, I packed up the computer and drove back to Hull.

The WMFS project manager phoned my boss. We had a difficult phone conversation, and we decided I should go again the next day.

Thankfully, a senior engineer who had the experience of X.25 was in the office. I told him of this weird behaviour that made no sense. We spoke for about two minutes which concluded with him saying, “What does the configuration look like?”

My mouth dropped. The most obvious explanation. I hadn’t changed the X.25 addresses! I was so busy wondering how the code could be so weirdly broken that I hadn’t considered looking at the configuration. So, so stupid! I hadn’t changed the configuration since I first set up the system, several months earlier, it just wasn’t in my mind as something that needed doing.

Day three. Drove to Birmingham, feeling very nervous and stupid. Plugged in the computer. Changed the X.25 addresses. Held my breath. The graphical LEDs went from red to orange, and then each Hill Top Site went green in turn, as the transmit token was passed back and forth between them and the replacement DFC. Finally, success!

A Nightmare on Character Street – Rosie Chandler
We recently implemented a database hosted in AWS, with the password stored in the AWS Secrets Manager. The password is pulled into the database connection string, which ends up looking something like this:

“Server=myfunkyserver;Port=1234;Database=mycooldatabase;User ID=bigboss;Password=%PASSWORD%”

Where %PASSWORD% here is substituted with the password pulled out of the Secrets Manager. We found one day that calls to the database started throwing connection exceptions after working perfectly fine up until that point. 

After spending a lot of time scratching my head and searching through logs, I decided to take a peek into the Secrets Manager to see what the password was. Turns out that day’s password was something like =*&^%$ (note it starts with “=”) which means that the connection string for that day was invalid. After much facepalming, we implemented a one-line fix to ensure that “=” was added to the list of excluded characters for the password.

The Case of the Phantom Invoices – Chris Rimmer
Many years ago I was responsible for writing code that would email out invoices to customers. I was very careful to set things up so that when the code was tested it would send messages to a fake email system, not a real one. Unfortunately, this wasn’t set up in a very fail-safe way, meaning that another developer managed to activate the email job in the test system and sent real emails to real customers with bogus invoices in them. This is not the sort of mistake you quickly forget. Since then I’ve been very careful configuring email systems in test environments so that they can only send emails to internal addresses.


Tales from the Dropped Database – Rich Brown
It was a slow and rainy Thursday morning, I was just settling into my 3rd cup of coffee when a fateful email appeared with the subject ‘Live site down!!!’

Ah, of course, nothing like a production issue to kick-start your morning. I checked the site: it was indeed down. Sadly, the coffee would have to wait.

Logging onto the server, I checked the logs. A shiver ran down my spine.

ERROR: SQL Error – Table ‘users’ does not exist

ERROR: SQL Error – Table ‘articles’ does not exist

ERROR: SQL Error – Table ‘authors’ does not exist

ERROR: SQL Error – Database ‘live-db’ does not exist

That’s…. unusual…

Everything was working and then suddenly it stopped, no data existed.

Hopping onto the database server showed exactly that. Everything was gone, every row, every table, even the database itself wasn’t there.

I pulled in the rest of the team and we scratched our collective heads, how could this even happen? The database migration system shouldn’t be able to delete everything. We’ve taken all the right mitigations to prevent injection attacks. There’s no reason for our application to do this.

I asked, “What was everyone doing when the database disappeared?”

Dev 1 – “Writing code for this new feature”

Dev 2 – “Updating my local database”

Dev 3 – “Having a meeting”

Ok, follow up question to Dev 2 – “How did you update your database?”

Dev 2 – “I dropped it and let the app rebuild it as I usually do”

Me – “Show me what you did”

ERROR: SQL Error – Cannot drop database ‘live-db’ because it does not exist

Turned out Dev 2 had multiple SQL Server Manager instances open, one connected to their local test database and the other connected to the live system to query some existing data.

They thought they were dropping their local database and ended up dropping live by mistake.

One quick database restore and everything was back to normal.

Moral of the story, principle of least access. If you have a user who only needs to read data, only grant permissions to read data.

Wortüberbreitedarstellungsproblem

Don’t worry if you don’t understand German, the title of this post will make sense if you read on…

We’ve been working for the last few years with De Gruyter to rebuild their delivery platform. This has worked well and we have picked up an award along the way. Part of our approach has been to push out new features and improvements to the site on a weekly basis. Yesterday we did this, deploying a new home page design that has been a month or two in the making. The release went fine, but then we started getting reports that the new home page didn’t look quite right for users on iPhones and iPads. I took a look – it seemed fine on my Android phone and on my daughter’s iPhone. A developer based in India looked on his iPhone with different browsers and everything was as expected. But somehow German users were seeing text that overflowed the edge of the page. So what was going on – how could German Apple devices be so different? Most odd.

It turned out that the problem was not a peculiarity of the German devices, but of the German language. German is famous for its long compound words (like the title of this post) and often uses one big word where English would use a phrase. Our new homepage includes a grid of subjects that are covered in books that De Gruyter publishes. In English these subjects mostly have quite short names, but in German they can be quite long. For smaller screens the subject grid would shift from three to two columns, but even so this was not enough to accommodate the long German words, meaning the page overflowed.

Subject grid in German

The fix was quite a simple one, for the German version of the page the grid would shift to two columns more readily and then to a single column for a phone screen. But I think the lesson is that there is more to catering for different languages than checking the site looks fine in English and that all the text has been translated. The features of the target language can have unexpected effects and need checking. It’s easy to overlook this when dealing with two languages that are apparently quite similar.

On a similar note, it can be easy to be complacent that your site is easy to use because you understand it or believe it is accessible to those using a screen reader because you have added alt text onto images. Just because it works for you doesn’t mean it works for others and that always needs bearing in mind.

Finally, that title? It translates as something like “Overly wide word display problem” and was suggested by someone at De Gruyter as a German compound word to describe the problem we saw.

Embracing Impermanence (or how to check my sbt build works)

Stable trading relationships with nearby countries. Basic human rights. A planet capable of sustaining life. What do these three things have in common?

The answer is that they are all impermanent. One moment we have them, the next moment – whoosh! – they’re gone.

Today I decided I would embrace our new age of impermanence insofar as it pertains to my home directory. Specifically, I wondered whether I could configure a Linux installation so that my home directory was mounted in a ramdisk, created afresh each time I rebooted the server.

Why on earth would I want to do something like that?

The answer is that I have a Scala project, built using sbt (the Scala Build Tool), and I thought I’d clear some of the accumulated cruft out of the build.sbt file, starting with the configured resolvers. These are basically the repositories which will be searched for the project’s dependencies – there were a few special case ones (e.g. one for JGit, another for MarkLogic) and I strongly suspected that the dependencies in question would now be found in the standard Maven repository. So they could probably be removed, but how to check, since all of the dependencies would now exist in caches on my local machine?

A simple solution would have been to delete the caches, but that involves putting some effort into finding them, plus I have developed a paranoid streak about triggering unnecessary file writes on my SSD. So I had a cunning plan – build a VirtualBox VM and arrange for the home directory on it to be a ramdisk, thus I could check the code out to it and verify that the code will build from such a checkout, and this would then be a useful resource for conducting similar experiments in the future.

Obviously this is not quite a trivial undertaking, because I need some bits of the home directory (specifically the .ssh directory) to persist so I can create the SSH keys needed to authenticate with GitHub (and our internal GitLab). Recreating those each time the machine booted would be a pain.

After a bit of fiddling, my home-grown solution went something like this:

  • Create a Virtualbox VM, give it 8G memory and a 4G disk (maybe a bit low if you think you’ll want docker images on it; I subsequently ended up creating a bigger disk and mounting it on /var/lib/docker)
  • Log into VM (my user is daniel), install useful things like Git, curl, zip, unzip etc.
  • Create SSH keys, upload to GitHub / GitLab / wherever
  • Install SDKMAN! to manage Java versions
  • Create /var/daniel and copy into it all of the directories and files in my home directory which I wanted to be persisted; these were basically .ssh for SSH keys, .sdkman for java installations, .bashrc which now contains the SDKMAN! init code, and .profile
  • Save the following script as /usr/local/bin/create_home_dir.sh – this wipes out /home/daniel, recreates it and mounts it as tmpfs (i.e. a ramdisk) and then symlinks into it the stuff I want to persist (everything in /var/daniel)
#!/bin/bash
DIR=/home/daniel

mount | grep $DIR && umount $DIR

[ -d $DIR ] && rm -rf $DIR

mkdir $DIR

mount -t tmpfs tmpfs $DIR

chown -R daniel:daniel $DIR

ls -A /var/daniel | while read FILE
do
  sudo -u daniel ln -s /var/daniel/$FILE $DIR/
done
  • Save the following as /etc/systemd/system/create-home-dir.service
[Unit]
description=Create home directory

[Service]
ExecStart=/usr/local/bin/create_home_dir.sh

[Install]
WantedBy=multi-user.target
  • Enable the service with systemctl enable create-home-dir
  • Reboot and hope

And it turns out that this worked; when the server came back I could ssh into it (i.e. the authorized_keys file was recognised in the symlinked .ssh directory) and I had a nice empty workspace; I could git clone the repo I wanted, then build it and watch all of the dependencies get downloaded successfully. I note with interest that having done this, .cache is 272M and .sbt is 142M in size. That seems to be quite a lot of downloading! But at least it’s all in memory, and will vanish when the VM is switched off…

STEM Ambassadors In the Field

(Or a fun way to introduce local kids to programming)

A previous employer encouraged me to join the STEM Ambassador program at the end of 2017 (https://www.stem.org.uk/stem-ambassadors) and I willingly joined, wanting to give something back to society. The focus of the program is to send ambassadors into schools and local communities, to act as role models and to demonstrate to young people the benefits and rewards that studying STEM subjects can bring. I approached my local primary school (at the time my daughter was a pupil there) about the possibility of setting up an after-school computing club, and they jumped at the chance.

I started the club unsure what to expect, but with a lot of hope and some amount of trepidation. I took on groups of 10 or so KS2 pupils, teaching them the basics of loops, events, variables and functions, largely using Scratch (https://scratch.mit.edu/) and an eclectic mix of programmable robots that I’d acquired over the years (I have a few from Wonder Workshop https://www.makewonder.com/robots/ and also a pair of Lego Boost robots https://www.lego.com/en-gb/product/boost-creative-toolbox-17101). Running the club was extremely rewarding. Some of the kids were brilliant, and will no doubt have a great future ahead of them. Others mainly wanted only to drive the robots around – but I figured that as long as they were having fun then their and my time was well spent.

Then, in March 2020, the Covid-19 pandemic hit. Kids were sent home for months, and all clubs were cancelled, with no knowing when they might start up again. The pandemic has obviously been tough for everyone, but one of the hidden effects has been the impact on the education of our children. It will take years, probably, to know exactly what effect two years of lockdown has had on the attainment opportunities and mental heath of young people. Many of them missed out not only on in-person schooling, but also on all the additional extra-curricular opportunities like school visits, and also things like the STEM Ambassador program.

So now, two years and a change of jobs later, I thought it was about time I got myself back in the field, and start up my STEM activities again.

My first opportunity has been to run a “retro games arcade” stall at the school’s summer fair. This involved commandeering a tiny wooden cabin plonked the wrong-way-round on the edge of the school field, next to one of the temporary classrooms. To turn this into a games arcade I needed to black out the windows to make it dark enough inside to see a computer screen, then to run an extension lead out of the window of the classroom, and to quietly steal a few chairs and tables upon which to set up my “arcade consoles”. Blacking out the windows was achieved by covering them up with garden underlay and sticking drawing pins around the edges (much to the detriment of my poor thumbs).

The field and cabin in which I did my STEM Ambassadoring, with the (mostly willing) assistance of my daughter

For the arcade machines, I wrote two games in Scratch based around the classic arcade games “Defender” and “Frogger”. I set up two laptops to run these games, covering over all but the arrow keys, trackpad and spacebar with shiny card. My aim was to write games that the students could replicate themselves, if they wished. I wanted games simple enough that a small child could play, but would also be fun for an older child or a parent to play as well. The gameplay should ideally last for 1-4 minutes, and the player should be able to accumulate a high score. As the afternoon progressed I kept track of the highest two scores in each game so that the players with these scores could win a prize at the end of the afternoon.

If you’re interested in seeing these games then you can have a look here:

Defender: https://scratch.mit.edu/projects/711066827/

Frogger: https://scratch.mit.edu/projects/317968991/

Of course the afternoon in question was one of the hottest days of the year. I spent 3 hours diving into and out of the tiny sweltering cabin, caught between managing the queue, taking the 50p fee, handing out Pokémon cards to the players (I got a stack of them and gave one out to every player), explaining to the kids how to play the games, and keeping track of the ever-changing high scores. I did have willing help from my daughter (who especially liked taking the money) and my husband (who seemed adept at managing the queue). At some point I managed to eat a burger and grab a drink, but it was a pretty frenetic afternoon.

67 Bricks agreed to give me £50 to pay for prizes. I bought Sonic and Mario soft toys, a Lego Minecraft set, and a large pack of assorted Pokémon cards. I also washed up a Kirby soft toy that I found in my daughter’s “charity shop” pile and added that to the prize pool. Throughout the afternoon I kept track of the top two highest scores in both games, using the incredibly high-tech method of a white-board and dry-wipe marker. The hardest part was figuring out how to spell everyone’s name, and in moving the first-place score to second place every time a high-score was beaten. Oh, and making sure the overly-enthusiastic children didn’t wander off with poor Mario before the official prize giving ceremony.

As the afternoon progressed I encountered some kids who aced the games, and actively competed with each other to keep their place at the top of the leader board. Other children struggled to control the game and I had to give them a helping hand (quite literally – I said I would control the cursor keys while they controlled the space bar). And then there was the dad who was determined to win a prize for his child, and kept returning to make sure of his position on the leader board. But eventually the last burger was eaten, the arcade was closed, and the prizes announced. Four children went home happily clutching their prizes and the rest their collection of assorted Pokémon cards.

For the next step in my STEM Ambassador journey, I have agreed to start up the computing club again in September. I’m hoping to teach the children the skills to write their own arcade games in Scratch. Watch this space.

Things Customers Don’t Understand About Search

Note, this post is based on a dev forum put together by Chris.

Full-text search is a common feature of systems 67 Bricks build. We want to make it easy for users to find relevant information quickly often through a faceted search function. Understanding user needs and building a top notch user experience is vital. When building faceted search, we generally use either ElasticSearch (or AWS’s OpenSearch) or MarkLogic. Both databases offer very similar feature sets when it comes to search, though one is more targetted towards JSON based documents and the other, XML.

Search can seem magical at first glance and do some amazing things but this can lead to situations where customers (and UX/UI designers) assume the search mecahnism can do more than it can. We frequently find a disconnect between what is desired and what is feasable with search systems.

There are 2 main categories of problems we often see are:

  1. Customers / Designers asking for things that could be done, but often come with nasty performance implications
  2. Features that seem reasonable to ask for at first glance, but once dug into reveal logical problems that make developing the feature near impossible

Faceted Search

Faceted search systems are some of the most common systems we build at 67 Bricks. The user experience typicallly starts with a search box that they enter a number of terms into, hit enter and then be presented with a list of results in some kind of relevancy order. Often there is a count of results displayed alongside a pagination mechanism to iterate through the results (e.g. showing results 1-10 of 12,464). We also show facets, counts for how many results fit into different buckets. All this is handled in a single query that often takes less than 100ms which seems miraculous. Of course, this isn’t magic, full-text search systems use a variety of clever indexes to make searching and computing the facet counts quick.

Lets make a search system for a hypothetical website cottagesearch.com. Our first screen will present the user with some options to select a location, the date range they want to stay and how many guests are coming. We perform the search and show the matching results. How should we display the results and more importantly, how do we show the facets?

Let’s say we did a search for 2 bedroom cottages. We’ve seen wireframes for a number of occassions where the facet count for all bedroom numbers are displayed. So users see the number of results applicable to each bedroom count they would get if they didn’t limit the search to just 2 bedrooms (i.e. there aren’t that many 2 bed options, but look at how many 3 bed options are available). At first glance, this seems like a sensible design, but fundmanetally breaks how search systems work with faceting, they will return counts, but only for the search just done.

We could get around this by doing 2 searches, 1 limited by bedrooms and one that does not to retrieve the facet counts. This may seem like a sensible idea when we have 1 facet, but what do we do when we have more? Do we need to do multiple searches, effectively making an N+1 problem? How to we display numbers? Should the counts for the location facet include the limit of bedrooms or not? As soon as we start exploring additional situations we start to see the challenges the original design presents.

This gets harder when we consider non-exclusionary facets. Let’s say our cottage search system lets you filter by particular features, such as a wood burner, hot tub or dishwasher. Now, if we show counts of non-selected facets, what do these numbers represent? Do they include results that already include the selected facet or not? Here, the logic starts to break down and becomes ever more confusing to the end user and difficult to implement for the developer.

Other Complex Facet Situations

A common question we need to ask with non-exclusionary facets: Is it an AND or an OR search? The answer is very domain dependant, but either way we suggest steering away from facets counts in these situations.

Date ranges provide an interesting problem, some sites will purposefully search outside of the selected range so as to provide results near the selected date range. This may be a useful or annoying depending on what the user expects and is trying to achieve. Some users would want exact matches and would have no interest in results that do not meet the selected date range.

Ordering facets is also a questions that may be overlooked. Do you order lexographically or do you order by descending number of matches? What about names, year ranges or numeric values? Again, a lot of what users expect and would want comes down to the domain being dealt with and the needs of the users.

When users select a new facet, what should the UI do? Should the search immediately rerun and the results and facets update or should there be a manual refresh button the user has to select before the search is updated? An immediate refresh would be slower, but let users narrow down carefully, while a manual update would reduce the number of searches done, but then users may be able to select a number of facets in such a way that no results would be returned.

Hierarchies can also prove tricky. We often see taxonomies being used to inform facets, say subjects with sub categories. How should these be displayed? Again there are many solutions to pick from with different sets of trade-offs.

Advanced Search

Advanced search can often be a bit like a peacocks tail – something that looks impressive, but doesn’t contribute a fair share of value based on how much effort it takes to develop. A lot of designers and product owners love the idea of it but in practice, it can end up being somewhat confusing to use and many end users end up avoiding it.

Boolean builders exist in many systems where the designer of advanced search will insist on allowing users to build up some complex search with lots of boolean AND/OR options, but displaying this to users in a way they can understand is challenging. If a user builds a boolean search such as: GDP AND Argentina OR Brazil do we treat it as (GDP AND Argentina) OR Brazil or should it be interpreted as GDP AND (Argentina OR Brazil). We could include brackets in the builder, but this just further complicates the UI.

We frequently get bugs and feedback on advanced search, some of this feedback can amount to different users having contradictory opinions on how it should work. We would ask product owners to carefully consider “How many people will use it?” Google has a well build UI for advanced search that does away with the challenges of boolean logic by having separate fields for ANDs ORs and NOTs.

Google advanced search UI

An advanced search facility can introduce additional complexity when combined with facets. If an advances search lets you select some facets before completing the search, does this form part of the string in the search box? We have had mixed results with enabling power users to enter facets into search fields (e.g. bedrooms:3), but this can be tricky, some users can deal with it, but others may prefer a advanced search builder while others will rely on facets post search.

Summary

In conclusion, we have 3 main takeaways

  • Search is much more complex than it first appears
  • Facets are not magic, just because you can draw a nice wireframe doesn’t make it feasable to develop
  • Advanced search can be tricky to get right and even then, only used by a minority of users

We’ve build many different types of search and have experimented with a number of approaches in the past and we offer some tried and tested principles:

  • Make searches stateless – Don’t add complexity by trying to maintain state between facets changes, simply treat each change as a fresh search. That way URLs can act as a method of persistence and bookmarking common searches.
  • Have facets only display counts for the current search and do not display counts for other facets once one has been selected within that category.
  • Only use relevancy as the default ordering mechanism – You may be tempted to allow results to be ordered in different ways, such as published date, but this can cause problems with weakly matching, but recent results appearing first.
  • Don’t build an advanced search unless you really need to and if you have to, use a Google style interface over a boolean query builder.
  • Check that search is working as expected – Have domain experts check that searches are returning sensible results and look into using analytics to see if users are having a happy journey through the application (i.e. run a search and then find the right result within the first few hits).
  • Beware of exhaustive search use cases – As many search mechanisms work on some score based on relevancy to the terms entered, having a search that guarantees a return of everything relevant can be tricky to define and to develop.

My Journey to Getting AWS Certified

When I joined 67 Bricks in January 2021 I knew close to zero about AWS, and not-a-lot about cloud services in general. I had dabbled a bit in Azure in my previous job, and I understood the fundamentals of what “the cloud” was, but I was very aware that I’d have to get up to speed if I wanted to be useful at developing applications on AWS. I joined our team on the EIU project, and on day one I was exposed to discussions about S3 buckets, lambda functions, glue jobs and SNS topics – all things I knew nothing about.

I asked one of the EIU enablement team to give me an overview, and I was introduced to the AWS console and shown some of the key services. Over the next few months I gradually started to get to grips with the basics – I learned how to upload to and download objects from S3, write to and query a DynamoDB table, and search for things in CloudWatch. I was very proud when I wrote my first lambda function, but I still felt like I was winging it.

I was encouraged by our development manager to look into obtaining some AWS certifications. The obvious starting point was Cloud Practitioner (https://aws.amazon.com/certification/certified-cloud-practitioner/?ch=sec&sec=rmg&d=1) which covers the basics of what “the cloud” is, and the applications of core AWS services. The best course I found to prepare for this was one from Amazon themselves https://explore.skillbuilder.aws/learn/course/134/aws-cloud-practitioner-essentials (you might need to sign in to the skill builder to access it, but the course is free). It uses the analogy of a coffee shop to explain the concepts of instances, scaling, load balancing, messaging and queueing, storage, networking etc, in an easy to understand manner. After a lot of procrastinating, and wondering if I was ready, I eventually took the exam in October 2021 and passed it with a respectable score.

The cloud practitioner course covers AWS services in an abstract manner – you learn about the core services without ever having to use them. In fact you could probably pass the course without ever logging into the AWS console. To demonstrate real experience and knowledge of AWS services, I decided that the certification to go for next was Developer Associate (https://aws.amazon.com/certification/certified-developer-associate/?ch=sec&sec=rmg&d=1). AWS doesn’t offer their own course to study for this certification – instead they provide links to numerous white papers, which make for fairly dry reading, and it is not clear exactly what knowledge is and is not required.

After doing a bit of research I decided that this course on Udemy https://www.udemy.com/course/aws-certified-developer-associate-dva-c01/ by Stephane Maarek was the most highly rated. With 32 hours of videos to absorb, this was not a trivial undertaking, but after slotting in a few hours of study either before work or in the evenings, I made it through with two books stuffed with notes.

The Developer Associate certification requires you to understand at a fairly deep level how the AWS compute, data, storage, messaging, monitoring and deployment services work, and also to understand architectural best practices, the AWS shared responsibility model, and application lifecycle management. A typical exam question for Developer Associate might ask you to calculate how many read-capacity-units or write-capacity-units a DynamoDB table consumes under various circumstances. Another one might test your understanding of how many EC2 instances a particular auto-scaling policy would add or remove. Another question might require you to understand what lambda concurrency limits are for.

After working my way through a number of practice exams (the best ones seem to be by Jon Bonzo, again on Udemy https://www.udemy.com/course/aws-certified-developer-associate-practice-exams-amazon/) I took the plunge and sat the exam in January 2022, again passing with a respectable score.

But what next? The knowledge I’d gained up until this point had given me real practical skills, and a deeper knowledge of how the various AWS services connect together. For example, it was no longer a mystery how lambda functions could be triggered by SNS topics or messages from an SQS queue, and could then call another API perhaps hosted on EC2 to initiate some other process. And I could understand how to utilise infrastructure-as-code (e.g. CloudFormation or CDK) along with services like CodePipeline and CodeDeploy, to automate build processes. But I wanted a greater understanding of the “bigger picture”, and so next I chose to go for the Solutions Architect Associate certification (https://aws.amazon.com/certification/certified-solutions-architect-associate/?ch=sec&sec=rmg&d=1).

The Solutions Architect Associate exam typically presents a scenario and then asks you to choose which option provides the best solution. One option is usually wrong, but there could be more than one solution which would work – but you have to scrutinise the question to see which one best meets the requirements of the scenario. Are they asking for the cheapest solution? Or the fastest? Or the most fault tolerant? (Look for clues like “must be highly available” – and so the correct answer will probably involve multi-AZ deployments). Is any down-time acceptable? Is data required in real time, or is a delay acceptable? (E.g. do we choose Kinesis or SQS?) If a customer is migrating to the cloud are there time constraints, and how much data is there to migrate? (E.g. it can take a month or two to set up a Direct Connect connection, but you could have a Snowmobile in a week. A VPN might work but there are limits to the data transfer rates).

Again, I chose Stephane Maarek’s course on Udemy (https://www.udemy.com/course/aws-certified-solutions-architect-associate-saa-c02/) – his study materials are clear and he also notes which sections are duplicates of those in the developer associate course. I again used Jon Bonzo’s practice exams (https://www.udemy.com/course/aws-certified-solutions-architect-associate-amazon-practice-exams-saa-c03/). There is a fairly hard-core section on VPC, which is something I struggled with. Stephane presents a spaghetti-like diagram showing the relationship between VPCs, public and private subnets, internet gateways, NAT gateways, security groups, route tables, on-premise set-ups, VPC endpoints, transit gateways, direct connections, VPC peering connections etc – and says “by the end of this section you’ll know what all of this means”. He was right, but as someone with limited networking experience and knowledge, I found it pretty tough.

I sat the exam in April 2022, a day before I figured out that the cough and fatigue I’d developed was actually Covid. I passed the exam respectably again, and then collapsed into bed for a few days to recover.

At this point it’s probably worth mentioning how the exam process works. If you like, you can book an exam in an approved test centre. However, I chose to go with the “online proctored” exams hosted by Pearson Vue. You book an exam slot – generally plenty are available at all times of the day and night, and you can usually find a slot within the next day or two that suits. For the exam you need to be sitting at a clear table with nothing within arms reach. Not even a tissue or a glass of water. You need to run some Pearson software on your laptop that checks no other processes are running (so turn off slack, email, shut down your docker containers etc etc), and then launches their exam platform. You will be asked to present photo ID, and then show the proctor your testing environment. They will want to see your chair and table from all angles, and will want to see your arms to make sure you’re not wearing a watch or have anything hiding up your sleeves. You need your mobile phone in the room, but out of reach, in case they need to call you. And you also need to make sure you are undisturbed in the room for the duration of the exam (which is typically 2-3 hours).

This last point was challenging for me. My home-office is not suitable – being far to crammed with potential cheat material, and I also share it with my husband. The only suitable place is my dining table, in the very open-plan ground floor of my house. Finding a time when I can have the ground-floor to myself for 2-3 hours means scheduling the exam for around 7AM in the morning on a day when the kids are not at school. I ended up putting “do not disturb” signs on the door and issuing dire warnings to everyone that they mustn’t come downstairs until I’d given them the all-clear. Anyone wandering sleepily through the room on a quest for coffee could result in the exam proctor dropping my connection and disqualifying me from the exam. Fortunately, all was well and all the exams I’ve sat so far were carried out without incident.

After obtaining the Solutions Architect Associate certification I thought about taking a break. But then I took a look at the requirements for SysOps Administrator Associate (https://aws.amazon.com/certification/certified-sysops-admin-associate/?ch=sec&sec=rmg&d=1) and realised that I’d already covered about two-thirds of the required material. Now SysOps is not something I have a love for. I have a deep respect for people who understand deployments and pipelines and infrastructure. The Enablement team at the EIU, who I work closely with, are miracle workers who regularly perform magic to get things up and running. The idea that I could learn some of that wizardry seemed far-fetched. But I thought I might as well give it a go.

The SysOps certificate focusses a lot on configuration and monitoring. You learn a lot about load balancers, autoscaling policies, CloudFormation and CloudWatch. And yes, all that indepth knowledge about VPCs and hybrid-cloud set-ups is applicable here too. A typical exam question will present a scenario where something has gone wrong, and you have to pick the best option to fix it. For example, someone can’t SSH into an EC2 instance because something is wrong with the security group. Or someone in a child account of a parent organisation can’t access something in another child account. Yet again I went to Stephane Maarek’s course, which was again excellent (https://www.udemy.com/course/ultimate-aws-certified-sysops-administrator-associate/). And Jon Bonzo again provided the practice exams (https://www.udemy.com/course/aws-certified-sysops-administrator-associate-practice-exams-soa-c01/).

I sat the SysOps exam in June 2022. One thing that caused a little trepidation was that this exam includes “exam labs” – these are practical exercises carried out in the AWS console. It was hard to prepare for these because I could not find any practice labs on-line, and so I was going in cold. However, it turned out that the labs were well defined with clear steps on what was required. Even the ones where I had never really looked at the service before, I was able to find it in the console and figure out what I needed to do. I was asked to:

  • Create a backup plan for an EFS system with two types of retention policy
  • Update a CloudFormation stack to fiddle with some EC2 settings, roles, route tables etc
  • Create an S3 static website and configure some Route 53 failover policies

The second of these caused me the most difficulty – I hadn’t anticipated actually having to write a CloudFormation template – they provided one which I needed to edit and it took me a while to figure out how to actually do this. Turns out that you need to save a new version of the template locally and then re-upload it.

I passed the SysOps exam with a more modest mark than for the other certifications, and I definitely breathed a sigh of relief. I am now definitely taking a breather – perhaps in a few months I might take a look at some of the specialist certifications (maybe Data Analytics?) but for the moment I’m going to get back to some of my other neglected hobbies (I like to draw, and play the piano, and one day I’ll maybe finish my epic fantasy trilogy).

The key take-aways from my experience are:

  • The associate level certifications require you to acquire knowledge that is directly applicable in the day-to-day life of a developer or systems administrator.
  • I was initially concerned that the courses would be part of a propaganda machine from AWS, encouraging us to spend ever larger amounts on AWS services. I found this to not be the case at all. Quite a large part of the material teaches us how to save costs, and how to incorporate our existing on-premises infrastructure with AWS, rather than replacing it entirely.
  • Sitting an exam in your own home is definitely preferable than travelling to a test centre – you get far more flexibility over when you can take the exam. However, not everyone will have a suitable place at home to take the exam, particularly if you share your home with other people, or you do not have a suitable table to work at.
  • Studying for these certifications will require a significant time commitment. The online courses run for 20-30 hours or more, assuming you never pause the videos to take notes, or repeat a section. And that is before you take time to revise or do practice exams.
  • Definitely the most valuable tool for preparing for the exams is by completing as many practice exams as you can find. The best ones include detailed explanations about why a particular answer is correct and the others are wrong.
  • Also note that these certifications have an expiry date – typically 3 years – and also the courses are refreshed periodically. For example, the Solutions Architect Associate is being refreshed at the end of August and Solutions Architect Professional is being refreshed in November.

Dev Forum – Parsing Data

Last Friday we had a dev forum on parsing data that came up as some devs had pressing question on Regex. Dan provided us with a rather nice and detailed overview of different ways to parse data. Often we encounter situations where an input or a data file needs to be parsed so our code can make some sensible use of it.

After the presentation, we looked at some code using the parboiled library with Scala. A simple example of checking if a sequence of various types of brackets has matching open and closing ones in the correct positions was given. For example the sequence ({[<<>>]}) would be considered valid, while the sequence ((({(>>]) would be invalid.

First we define the set of classes that describes the parsed structure:

object BracketParser {

  sealed trait Brackets

  case class RoundBrackets(content: Brackets)
     extends Brackets

  case class SquareBrackets(content: Brackets)
     extends Brackets

  case class AngleBrackets(content: Brackets)
     extends Brackets

  case class CurlyBrackets(content: Brackets)
     extends Brackets

  case object Empty extends Brackets

}

Next, we define the matching rules that parboiled uses:

package com.sixtysevenbricks.examples.parboiled

import com.sixtysevenbricks.examples.parboiled.BracketParser._
import org.parboiled.scala._

class BracketParser extends Parser {

  /**
   * The input should consist of a bracketed expression
   * followed by the special "end of input" marker
   */
  def input: Rule1[Brackets] = rule {
    bracketedExpression ~ EOI
  }

  /**
   * A bracketed expression can be roundBrackets,
   * or squareBrackets, or... or the special empty 
   * expression (which occurs in the middle). Note that
   * because "empty" will always match, it must be listed
   * last
   */
  def bracketedExpression: Rule1[Brackets] = rule {
    roundBrackets | squareBrackets | 
    angleBrackets | curlyBrackets | empty
  }

  /**
   * The empty rule matches an EMPTY expression
   * (which will always succeed) and pushes the Empty
   * case object onto the stack
   */
  def empty: Rule1[Brackets] = rule {
    EMPTY ~> (_ => Empty)
  }

  /**
   * The roundBrackets rule matches a bracketed 
   * expression surrounded by parentheses. If it
   * succeeds, it pushes a RoundBrackets object 
   * onto the stack, containing the content inside
   * the brackets
   */
  def roundBrackets: Rule1[Brackets] = rule {
    "(" ~ bracketedExpression ~ ")" ~~>
         (content => RoundBrackets(content))
  }

  // Remaining matchers
  def squareBrackets: Rule1[Brackets] = rule {
    "[" ~ bracketedExpression ~ "]"  ~~>
        (content => SquareBrackets(content))
  }

  def angleBrackets: Rule1[Brackets] = rule {
    "<" ~ bracketedExpression ~ ">" ~~>
        (content => AngleBrackets(content))
  }

  def curlyBrackets: Rule1[Brackets] = rule {
    "{" ~ bracketedExpression ~ "}" ~~>
        (content => CurlyBrackets(content))
  }


  /**
   * The main entrypoint for parsing.
   * @param expression
   * @return
   */
  def parseExpression(expression: String):
    ParsingResult[Brackets] = {
    ReportingParseRunner(input).run(expression)
  }

}

While this example requires a lot more code to be written than a regex, parsers are more powerful and adaptable. Parboiled seems to be an excellent library with a rather nice syntax for defining them.

To summarize, regexes are very useful, but so are parsers. Start with a regex (or better yet, a pre-existing library that specifically parses your data structure) and if it gets too complex to deal with, consider writing a custom parser.

Zen and the art of booking vaccinations

This is a slightly abridged version of a painful experience I had recently when trying to book a Covid vaccination for my 5-year-old daughter, and some musing about what went wrong (spoiler: IT systems). It’s absolutely not intended as a criticism of anyone involved in the process. All descriptions of the automated menu process describe how it was working today.

At the beginning of April, vaccinations were opened up for children aged 5 and over. Accordingly, on Saturday 2nd, we tried to book an appointment for our daughter using the NHS website (https://www.nhs.uk/conditions/coronavirus-covid-19/coronavirus-vaccination/book-coronavirus-vaccination/). After entering her NHS record and date of birth, we were bounced to an error page:

The error page after failing to book an appointment.

There’s no information on the page about what the error might be – possibly this is reasonable given patient confidentiality etc. (at no point had I authenticated myself). I noticed that the URL ended “/cannot-find-vaccination-record?code=9901” but other than that, all I can do is call 119 as suggested.

Dialling 119, of course, leads you to a menu-based system. After choosing the location you’re calling from, you get 4 options:

  1. Test and trace service
  2. Covid-19 vaccination booking service
  3. NHS Covid pass service
  4. Report an issue with Covid vaccination record

So the obvious choice here is “4”. This gives you a recorded message “If you have a problem with your UK vaccination records, the agent answering your call can refer you on to the data resolution team”. This sounds promising! Then there’s a further menu with 3 choices:

  1. If your vaccination record issue is stopping you from making a vaccination booking
  2. If your issue is with the Covid pass
  3. If your issue relates to a vaccination made overseas

Again, the obvious choice here is “1”. This results in you being sent to the vaccination booking service.

The first problem I encountered (in the course of the day I did this many times!) was that many of the staff on the other end seemed to be genuinely confused about how I’d ended up with them. They told me I should redial and choose option 4, and I kept explaining that I had done exactly that and this is where I had ended up. So either menu system is not working and is sending me to the wrong place (although given the voice prompts it sounds like it was doing the right thing), or the staff taking the calls have not been briefed properly.

Eventually I was able to get myself referred to the slightly Portal-esque sounding Vaccination Data Resolution Service. They explained that my daughter did not appear to be registered with a GP, which surprised me because she definitely is. So, they said, I should get in touch with her GP practice and get them to make sure the records were correct on “the national system”.

This I did. I actually went down there (it was lunchtime), the staff at the surgery peered at her records and reported that everything seemed to be present and correct, with no issues being flagged up.

So, then I had more fun and games trying to get one of the 119 operators to refer me back to the VDRS. This was eventually successful, and someone else at the VDRS called me back. She took pity on me, and gave me some more specific information – the “summary case record” on the “NHS Spine portal” which should have listed my daughter’s GP did not.

I phoned the GP surgery, explained this, various people looked at the record and reported that everything seemed fine to them.

More phoning of 119, a third referral, to another person. He wasn’t able to suggest anything, sadly.

So, at this point, I was wracking my brains trying to work out where the problem could lie. I had the VDRS people saying that this data was missing from my daughter’s record, and the GP surgery insisting that all was well. The VDRS chap had mentioned something about it potentially taking up to 4 weeks (!) for updates to come through to them, which suggests that behind the scenes there must be some data synchronisation between different systems. I wondered if there was some kind of way of tagging bits of a patient record with permissions to say who is allowed to see them, and the GP surgery could and the VDRS people couldn’t.

Finally, on the school run to collect my daughter, I thought I’d have one last try at talking to the GP practice in person. I spoke to one of the ladies I’d spoken to on the phone, she took my bit of paper with “NHS Spine portal – summary care record” scrawled on it and went off to see the deputy practice manager. A short while later she returned; they’d looked at that bit of the record and spotted something in it about “linking” the local record with the NHS Spine (she claimed they’d never seen this before1), and that this was not set. I got the impression that in fact it couldn’t be set, because her proposed fix (to be tried tomorrow) is going to be to deregister my daughter and re-register her. And then I should try the vaccine booking again in a couple of days.

As someone who is (for want of a better phrase) an “IT professional”, the whole experience was quite frustrating. As noted at the top, I’m not trying to criticise any person I dealt with – everyone seemed keen to help. I’m also not trying to cast aspersions on the GP’s surgery – as far as they were aware, there was nothing amiss with the record (until they discovered this “linking” thing). My suspicion is that the faults lie with IT systems and processes.

For example, it sounds like it’s an error for a GP surgery to have a patient record that’s not linked to a record in the NHS Spine, but since many people took a look at the screens showing the data without noticing anything amiss, I’d say that’s some kind of failure of UI design. I wonder how it ended up like that; maybe my daughter’s record predated linking with Spine and somehow got missed in a transitioning process (or no such process occurred)?

It would have been nice if I’d been able to get from the state of knowing there’s some kind of error with the record, to knowing the actual details of the error, without having to jump through so many telephonic hoops. I presume the error code 9901 means something to somebody, but it didn’t mean anything to any of the people I spoke to. In any case, I only spotted it because, as a developer, I thought I’d peep at the URL, but it didn’t seem to be helpful from the point of view of diagnosing the problem. It feels like there’s a missed opportunity here – since people seeing that error page are directed to the 119 service, it would have been helpful to have provided some kind of visible code to enable the call handlers to triage the calls effectively.

In terms of my own development, it was an important reminder that the systems I build will be used by people who are not necessarily IT-literate and don’t know how they work under the covers, and if they go wrong then it might be a bewildering and perplexing experience for them. Being at the receiving end of vague and generic-sounding errors, as I have been today, has not been a lot of fun.


1I found this curious, but a subsequent visit to the surgery a couple of days later to see how the “fix” was progressing clarified things somewhat. My understanding now is that the regular view of my daughter’s record suggested that all was well, and that it was linked with Spine, and it was only when the deputy practice manager clicked through to investigate that it popped up a message saying that it was not linked correctly.

Obviously I don’t know the internals of the system and this is purely speculation, but suppose that the local system system had set a “I am linked with Spine” flag when it tried to set up the link, the linking failed for some reason, and the flag was never unset (or maybe it got set regardless of whether the linking succeeded or failed). Suppose furthermore that the “clicking through” process described actually tries to pull data from Spine. That could give a scenario in which the record looks fine at first glance and gives no reason to suppose anything is wrong, and you only see the problem with some deeper investigation. We can still learn a lesson from this hypothetical conjecture – if you are setting a flag to indicate that some fallible process has taken place, don’t set it before or after you run this process, set it when you have run it and confirmed that it has been successful.


Postscript

Sadly, I never found out what the underlying problem was. We just tried the booking process one day and it worked as expected (although by this point it was moot as my daughter had tested positive for Covid, shortly followed by myself and my wife). A week or so later, we got a registration confirmation letter confirming that she’d been registered with a GP practice, which was reassuring. I’d like to hope that somewhere a bug has been fixed as a result of this…

Organising Git Pull Requests

Recently we had an interesting dev forum on Git workflows. Git is the de facto source control tool of choice for software development. Powerful and flexible, teams have a wide range of choices to make when deciding exactly how to make use of it. This discussion originally stemmed from a Changelog podcast titled “Git your reset on” based on the blog post “Git Organized” by Annie Sexton.

To briefly summarise the above, if we imagine a situation where a merge to main leads to a production deployment and then an issue is found in production, how do we roll back? We look in the git history, find the commit that introduced the bug, revert it and then redeploy to production. But oh no! This revert changed a number of source files beyond the scope of the bug and has ended up introducing a new bug.

The proposed solution to this is to ensure a sensible, clean history of commits is used within a branch. Annie Sexton suggests using a feature branch where you commit regularly with useless messages like WIP until you are ready to submit a pull request. You then run git reset origin/head to be presented with all the changes you have done and how this differs from main, you then progressively add files and make commits with sensible messages to build up a more coherent change.

This approach has a number of advantages we discussed:

  • It provides a neat history of compartmentalised changes
  • Reviewers can then follow a structured story of commits about how the developer arrived at their solution
  • Irrelevant changes are excluded
  • Ensure that all of the included changes are relevant only to the task at hand

Disadvantages of this approach may include:

  • Bad approaches that were tried and abandoned are not in the history
  • Changes aren’t going to always compartmentalise neatly into specific changes of files meaning we would need to commit specific changes within a file, some tools do a bad job supporting this use case

It turns out that a lot of developers at 67 Bricks use a form of history rewriting when developing code, but no one used the specific approach proposed above. git commit–amend is a very useful command to tweak the previous commit. Some (including myself) use git rebase -i as a way of rebuilding the commit history, though this has a limitation in cases where you may want to split a commit up. Finally, others create new branches and build up clean commits on the new branch.

Is force pushing a force for good?

This question really split the developers, some see using git push --force as an evil that indicates you’re using git wrongly and should only be used as a last resort. Others really like the idea of force pushing, but only for branches you’ve been developing on and have yet to create a merge request for.

Squashing?

A workflow some developers have seen before involves using the --squash flag when merging into main. This can create a nice history where each commit neatly maps to a single ticket but the general view was this caused the loss of helpful information from the git history.

Should the software be valid at every commit?

Some argued that this is a sensible thing to strive for, having confidence that you can revert back to any commit and the software will work is a nice place to be at. However, others criticised this as tricky to achieve in practice. Ideally the tests would all be valid and pass too, but this goes against how some developers work; they create a failing test to show the problem or new feature, commit it and then build out code to make the test green. Others claim that tests should change in the same commit as code changes to make it obvious these changes belong together.

Tools used by Developers

At 67 bricks, we try to avoid specifying which precise tool a developer should use for their craft. Different developers prefer different approaches when it comes to using git, some are more comfortable with their IDEs plugin (e.g. IntelliJ has good git integration which can add specific lines to commits) while others may prefer using the command line. Here’s a list of tools we’ve used when dealing with git:

To conclude, specific approaches and tools used vary, but everyone agreed on the benefits of striving for a clean, structured git history.

MarkLogic profiler

When investigating slow XQuery or XSLT queries in MarkLogic, one of the tools it provides is a query profiler. In this post I’ll explain how to make use of the profiler and how to interpret the results.

There are 3 ways you can make use of the profiler:

  1. use the MarkLogic QConsole;
  2. use an IDE that supports the profiler like IntelliJ with the XQuery and XSLT plugin;
  3. use the MarkLogic profiler API.

Using the QConsole

In the MarkLogic QConsole you can enable profiling by clicking on the Profile tab next to the Results tab and pressing the Run button. This will display a table with the profiling results. The profile can be saved by clicking on the download button on the right.

The profile result table contains the following information:

  1. Module:Line No.: Col No. – This is where the expression being profiled starts;
  2. Count – This is the number of times the expression was evaluated;
  3. Shallow %/Shallow ms – This is how long the expression took to run on its own across all Count times it was evaluated;
  4. Deep %/Deep ms – This is how long the expression took to run in addition to the time subexpessions took to evaluate;
  5. Expression – This is the string representation of the query expression being evaluated.

NOTE: Here, ms is microseconds not milliseconds.

If you want to do further processing on the downloaded profile:report XML, you can load that XML by using:

let $data := xdmp:filesystem-file("C:\profiling\profile.xml")
let $profile := xdmp:unquote($data)/prof:report

Analysing Profile Results

Sorting by shallow time is useful for identifying the expressions that take a long time to evaluate. This can identify places where you should modify the query or add indices.

The count column is also useful. It can indicate places where you have nested loops that result in quadratic time queries.

When making optimizations, it can be tricky to know if a change results in a speed up of the query or is not significant, but shows up in the normal time variation when running a query multiple times. To handle this, I like to use a spreadsheet and measure the average (mean) time and standard deviation over 5 or 10 runs.

This can also be useful when varying the number of items being processed (e.g. documents) in order to identify if the query scales with that item, and whether the scaling is linear, quadratic, or some other curve. For this, you can use line graphs to visualise the performance and linear or quadratic regression formulae to check what shape the graph is.

The stock chart graph is useful for comparing the variation between runs. Here, the mininum and maximum times form the extents (low, high) of the chart, and the mean +/- one standard deviation form the inner bar (open, close) of the chart.

Using the Profiler API

Using the profiler API you can use the following to collect the profile data from a query expression:

let $_ := prof:enable(xdmp:request())
(: code to profile, e.g. -- let $_ := local:test() :)
let $_ := prof:disable(xdmp:request())
let $profile := prof:report(xdmp:request())

NOTE: If you are running this from an XQuery 1.0 query instead of a 1.0-ml query then you need to declare the prof namespace:

declare namespace prof = "http://marklogic.com/xdmp/profile";

You can then either save the profile to a file, e.g.:

let $_ := xdmp:save("C:\profiling\profile.xml", $profile)

or process the resulting profile:report XML in the XQuery script.

Alternatively you can use one of the prof:eval or prof:xslt-eval to evaluate an in memory query, or prof:invoke or prof:xslt-invoke to evaluate a module file. These return the profile data as the first item and the result as the remaining items, so you can use the following:

let $ret := prof:eval("for $x in 1 to 10 return $x")
let $profile := $ret[1]
let $results := $ret[position() != 1]

With the profile report XML, you can then process it as you need. For example, to create a CSV version of the QConsole profile table you can use the following query:

declare function local:to-seconds($value) {
  ($value cast as xs:dayTimeDuration) div xs:dayTimeDuration("PT1S")
};

let $overall-elapsed := local:to-seconds($profile/prof:metadata/prof:overall-elapsed)
for $expression in $profile/prof:histogram/prof:expression
let $source := $expression/prof:expr-source/string()
let $uri := string(($expression/prof:uri/text(), "main")[1])
let $line := $expression/prof:line cast as xs:integer
let $column := ($expression/prof:column cast as xs:integer) + 1
let $count := $expression/prof:count cast as xs:integer
let $shallow-time := local:to-seconds($expression/prof:shallow-time)
let $deep-time := local:to-seconds($expression/prof:deep-time)
order by $shallow-time descending
return
  string-join((
    $uri, $line, $column, $count,
    $shallow-time, ($shallow-time div $overall-elapsed) * 100,
    $deep-time, ($deep-time div $overall-elapsed) * 100,
    $source
  ) ! string(.), ",")

NOTE: This does not handle escaping of commas in the $source string.