Technical Debt - what can you do?

A cautionary tale of technical debt.

In preparations for Black Friday three years ago, I came upon an issue with our cache where we were caching a big dictionary as a single object instead of each entry in the dictionary with its own key. The reason you want to don't want it as one big object is obvious. If you need a single item from the dictionary, you don't want to get the all the data and then do a lookup in memory - and even worse, if you just want to update a single entry in the dictionary, you don't want to rewrite the full dictionary each time.

This particular case was a dictionary of products to campaigns on an e-commerce site, where each lookup in the dictionary would provide you with data on the campaign affecting the price of that product. Of course you don't need to get all of products/campaigns to get info on the single product, and we often had to update entries as campaigns were updated. The bigger issue was that we didn't bulk update cached dictionary, as we where using an event based pattern for updating the campaign cache. I.e. updating a product or a campaign affecting fired off an update event that would update the cache for that product or campaign.

This means that even if updates to products where bulked, single events for updating the cache were still fired for each product. This means that the system was reading out the entire cache for all campaigns mapping it to a dictionary, making a single change in one entry and writing the whole thing to the cache invalidating it for all consumers of the cache - in this case 4 web servers, an app server and a job server. This could happen hundreds of times on top of each other when the costumer was doing large product or campaign updates, or when integrations caused prices or campaigns to be updated.

Even though it wasn't an optimal way of doing it, it wasn't a worse problem than the servers could handle it. It was just the wrong way of doing things.
It was put on a list of technical debt, and because it wasn't deemed critical for the system to be running, it wasn't important enough to fix. And even if it was, there was a very small chance that the customer would spend the hours (and money) fixing something, when it already worked - this meant that it would have to be an investment from our own company.

A situation like this is what we call technical debt and it exists in all projects. There are several types of tech debt and there are several approaches to handling it.

Types of technical debt

It's called technical debt because it draws 'interest' and it needs to payed at some point. Sometimes interest is payed over time, and sometimes it needs to be payed at once. Not all technical debt is bad, just like not all debt in real life is bad. Having a mortgage is fine, but taking out a quick high interest loan to buy a new flatscreen TV is not a good idea.

When you pay interest over time

Technical debt in the form of poorly designed, bad or overly complex code is draws interest over time and in the worst case can accumulate to make a codebase so annoying and difficult to work with, that you just want to abandon the whole thing and rewrite it from scratch.

The risk of a mis-estimating a feature because you hit an unexpected overly complex or rigid part of the codebase gets higher the more of this kind of debt is in your system. Over time this leads to bigger and bigger 'buffers' being added to initial estimates, because the developers keep missing them.
The developers feel shitty having to explain to their superiors that another task needs to go over estimate. The customer gets annoyed because they feel simple features are getting too expensive (which they are). You get fewer and fewer new features done, because everything takes longer and when you ship fewer features, there is even less chance that you get to spend time fixing up some of that technical debt you are accumulating. This can be crippling to a project and very hard on developer morale.

A lot of the time this kind of technical debt leads to even more debt because the features added are built on top of code that is difficult to understand which makes it even more difficult to understand.

When you have to repay your entire debt at once - with interest

Sometimes to get something to work and deliver within a deadline, you make a decision you know is wrong, but in order to deliver in time you have to make it. Usually a decision like this is considered okay if it works in the current state of the system, and if you write down somewhere that this decision was made, knowing it was wrong and that it needs to be changed later. Some verbal agreement is often made that you can go back and fix it at some point.

Unfortunately when you have tech debt like this it usually gets ignored for a time, and then forgotten. That is, right up until something breaks. Many times this happens when something that was assumed never to change is changed anyway due to increased load on the system or a new feature being added or the way the system is used is changed for some reason.

Especially in the cases where increased load or a change in use is the issue, the thing will likely not break anything until after it is deployed to production and things start breaking. Then you have to go back and pull out the thing that was knowingly done in the wrong way and redo it the right way - all while the system is unstable. From the customer's point of wiew, it looks like you made a mistake, and will most likely be reluctant to pay for fixing the issue - after all it was your (or your team's decision) to do something that wasn't right the first time around.

The effect of tech debt like this is then dropping everything else to fix the problem, meaning pushing what you should be working on, redoing something the way it should have been done in the first place. Then having a struggle with a upset customer about who should pay for fixing the issue, when they have already experienced a distruption to their system (and possibly business) and delays on every other feature in the pipeline.

Needless to say, this is a really crappy situation to be in and it can really hurt the relationship between the customer and your team - not to mention the morale of the developers scrambling to fix things.

So what do we do?

Creating technical debt is impossible to avoid - everyone has deadlines and estimates to hit. The key, in my opinion, is to explain this to all stakeholders. The customer needs to know what technical debt is, that it is unavoidable, and what effect it has or can have to their interests. They want you to hit the estimates and deadlines and ship the features, but they also don't want you to get slower and slower over time and they definately don't want the system to get unstable. This means that you need to reach an agreement where some amount of time is earmarked to getting rid of technical debt. This has to be done continuously because a system is constantly evolving.

They need to understand that an investment of 10 hours to simplify some part of the system that has grown overly complicated over time, can easily save them multiples of that when future features that touch that area of the code needs to be implemented - not only saving them money but also getting features implemented faster.

Even though they wont see any difference in functionality after the change it is still a clear cut business case that can save them money and time - and the side effect is that the developers are happier and more motivated.

The Professional Responsibility of The Craftsman

Technical debt needs to handled continuosly, and therefore it needs to be spoken about often. You can Boy Scout your way out of some of it, but often it can take hours or days to handle the larger pieces of tech debt and that needs to be prioritized as part of roadmaps and sprints. If you don't do this you will end up in a scenario like the one I started this post with, where a change in the customer's use of the system causes an issue that effects the whole system and you have to go into crisis mode. This is exactly what happened - suddenly the needs of the customer made it so that several times more campaigns needed to be updated several times a day, and the limit of what the cache could handle was hit, causing timeouts and sometimes even crashes. This would take down the whole site, meaning we had to change the caching to be done the right way while we also handling breakdowns in production.

This meant working weekends and after hours to keep the impact to a minimum while implementing a fix and hotfixing the issue.

Ultimately the fix wasn't very difficult to implement and could easily have been done at anytime in the years since wrong caching strategy had been been implemented. It just wasn't a priority because it was working fine.

If it had been something that was regularly brought up, and it had been discussed what risks where associated with doing it this way. It would have been identified that a an increase of the rate of updates could have made everything unstable. The case would have been made to the customer that it only works right now under the assumption that the amount of updates to campaigns stayed around the current level, the customer could then have warned us that a change might be coming and that this needed to be handled prior to that.

It wasn't. It was just written down on a list of technical debt somewhere and wasn't a priority... Right up until it was the top priority.

We as developers, as Craftsmen, have a professional responsibily to keep pointing these things out to all stakeholders, internally and externally, in order to avoid major issues like this. An airplane engineer would never let a plane in the air if they knew that if it started raining the plane would crash - even if the weather forecast said that it wouldn't rain. Weather forecasts are sometimes wrong and if the consequence of rain would be catastrophic, the engineer has the responsibility to either make sure to keep the plane on the ground or fix it so that it can handle the rain.

In the same way we have a professional responsibilty to point out potential points of failure in our systems. To live up to this responsibility we need look at our technical debt constantly and make sure that we know about the possible points of failure so that we can handle them before things start to fail. The costs of a handling technical debt is always much greater if we only do it when something breaks.