Software Reliability with Dishwashers

Some weeks ago my dishwasher started leaking water in the front. At first, I thought the water filter was full, and after cleaning it properly, the issue didn’t happen anymore.

Until some days ago.

This time it came with much more stubbornness, showing errors on the display. The dishwasher is an AEG, and the error code was looking like a ,10 (yeah, comma included). I googled a bit, and it seems that it means that the dishwasher can’t load any more water. It was quite surprising, as just after a few minutes the cleaning program had started, I was able to clearly hear that the water was flowing through the pipes. Then suddenly silence and that error message.

The QA side of me forced me to randomly press the buttons present on the machine, with some more persistence for the Reset one, hoping it would just heal itself – who knows, maybe it was overwhelmed – pressing F5 typically works 🙂 Strangely enough, the error code disappeared when I pressed the arrow down button, totally unexpected, and the program resumed. I was happy until the dishwasher started leaking water again, and this time another error code appeared: ,30.

I googled again, and I found what seems to be a fail-safe mechanism from the manufacturer to prevent the appliance from leaking too much water – which could be dangerous if there are kids, pets, or cables on the ground, I guess. It seems this is a feature called Aquastop.

That was an interesting finding. I googled (again) a bit to understand how this is all connected together, and I found a short video explaining in five minutes how the Aquastop works. It immediately reminded me of the Circuit Breakers that we use for reliability patterns, for example in microservices.

I like to think that this happens when a team of smart engineers sits down together and tries to solve a real problem in a creative way. It’s astonishing what we can learn from electric engineers, or more specifically, from the products we use on a daily basis, if only we had the time to disassemble stuff and see how it was done. In this case, obviously the product can’t heal itself, because maybe the water hose is perforated, it leaks water, etc., however, the gist of it is this: a simple monitoring tool that sends a signal to stop the water inlet valve, when there is too much of it where it shouldn’t be.

Brilliant.

What’s next? Pingdom for washing machines?

Reactive Systems – Responsiveness

This article is the first of a series that I plan to write about Reactive Systems.

A brief analysis: it’s 2018 now, and we often read words like “react”, “reactive”, and similar. As if the word “reactive” itself was not ambiguous enough, someone even started baking frameworks naming them “react.js” – this is unfortunately completely unrelated to the concept of reactive systems as we conceive them.

Ambiguities aside, a few years ago a group of people decided that it was necessary to put together a few non-functional requirements they had learnt during their lives to be essential if you want to build good software – then they named this document the Reactive Manifesto.

Granted that I dislike manifestos, because they scream for attention and due to their PowerPoint-like nature they tend to be misunderstood and often overlooked, I have to admit that in this case, if you are truly, positively led by curiosity about what Reactive Systems are without preconceptions (oh, yet another manifesto, …), you will surely agree that Reactive is the correct term to be used here. I still wouldn’t have called it manifesto, though, and I wouldn’t have asked people to sign it – but these are personal preferences 🙂

So, this document – the manifesto – describes Reactive Systems as responsive, resilient, elastic and message-driven.

 

reactive-traits

In this article I will focus primarily on the first principle: responsive.

What Does It Mean?

Responsive means that our systems need to respond in a timely manner to offer a smooth experience. What does timely mean? 1 second? 3 seconds? There are tons of non-academic studies showing that if your customers have to wait more than X seconds, then Y % of them will choose another competitor. This is of course only one of the measurements you might be interested to. You could be interested to when your brand new printer has to start processing the text, once it receives a new request. Is it OK to have the user of your service/product wait for 10 seconds? Maybe. Everything depends on the use case.

Why Is It So Difficult?

The challenge here implies dealing with real systems – stuff that needs to be maintained, deployed, reviewed, etc., not proof of concepts or your Sundays experiments. For example, in the following picture we can see what a real-world architecture based on micro-services (at Netflix) looks like:

sl32

Saying that your system is responsive means that you have been able to solve lots of the challenges that responsiveness brings along, not always easy to solve: legacy systems still needed, more micro-services than needed, network latency, performance overhead in the software and technologies used, code not optimized, software running on old hardware. This is all part of responsiveness.

Old Hardware? What…?

Before we get into more details, let me share a little observation: since the Cloud has taken off, I have noticed that we developers are less and less careful about the resources that our software needs.

Nowadays, we think in terms of CPU units, which translate differently according to the different cloud vendor (1 vCPU can be a full core or just a half, more or less – there are lots of comparisons between AWS/Azure/GCP out there). However, with these machines on demand, we barely know what processors they have! Who cares? Just give it a t2.large instance and that’s it. Server-less architectures increased even more this disconnection between developers and machines.

This is certainly part of a broader topic that involves costs optimization, resource consumption, and so forth, yet I consider it important, because it has an impact on responsiveness as well. If you use old machines, you may be disappointed.

What Is Responsiveness About?

Responsiveness has the noble goal of providing the best usability – it’s not fun to have to wait for 2 minutes to do something that we think could or should take less. Responsiveness has at its base foundation reliability and availability, and these all serve the goal of creating a valid SLA.

SLA: you may have in your contract that certain API calls will not respond in more than 10 seconds on average per day. By having numbers that define upper boundaries (how long should a response take?), we can quickly decide whether some event is exceptional and deserves attention – for example, last 10 requests took more than 5 seconds? If so, send an alert.

Availability: the famous 99.many-nines% that lots of cloud vendors offer. This simply measures the uptime/total-time, or the % of operable state of your service.

Reliability: often confused with availability, this is more related to stability and fault-tolerance. It measures how long a system performs its function (given an interval). For example, if the service is systematically down for 6 minutes per hour, its availability is 90%. However, its reliability is less than one hour, which could be way more interesting than the overall availability percentage.

A responsive system should be available and reliable, otherwise it can’t stay responsive. Even responding with an error is certainly better than not responding at all. Also, when we have numbers we can act on error conditions, we can offer guarantees, and we can sell a service that returns always something.

Why Responsiveness?

In fact, responsiveness is often perceived as an optimization “feature”, like security. The infamous misunderstanding of Donald Knut’s words “premature optimization is the root of all evil” didn’t help here.

Now, I love quality and I strongly believe it’s the main differentiator between multiple products – why to choose X instead of Y, W, and Z. I see also the value in trying to have stuff done, though. So, why don’t we implement from the ground up a mentality leading to high quality products? A mentality that doesn’t procrastinate, that is not lazy and that believes that the product under development will take off and will be successful. I see more and more often that due to this time to market madness, products lack a lot of non-functional features. Security, quick responses, usability, of course, depending on the domain. For some reason, we tend to think that non functional requirements are useless. However, the fact that we have multiple search engines, multiple e-commerce, etc., should tell us that time to market is important, yes, but on long term what matters is also the set of non functional requirements. You can’t always think that your customers will use your products because you were the first. Eventually someone will do the same and better and add a non-functional feature to it, like security, which seems pretty important lately.

Responsiveness is also important, considering that most people are connected to the internet via a mobile, and when something is slow on a mobile phone, it looks twice as slow as on a laptop – probably due to the fact the focus is higher on the little screen.
Long story short: plan for responsiveness as early as possible in your product roadmap. Don’t procrastinate, trust developers and define a threshold with them – it can be as stupid as a simple “this call has to take up to X seconds”. Only if you have numbers you can brag about it, otherwise it’s pure speculation.

Reality Check – The Role of Technology

There is a sort of myth about responsiveness that tells us that one of the first steps to have responsive services is to choose a great technology. In fact, in the Web Services/SaaS world, it seems that those are often chosen by a trend. As if that wasn’t enough, there are tons of benchmarks online, like https://www.techempower.com, that are often considered as a starting point to choose next framework or whatsoever.

Now, it’s stupidly simple to say my API is responsive, if all your API does is to return a canned response. No framework will disappoint you here, even some old CGI script is able to handle gazillion calls per minute on a modern machine. There are also benchmarks offering some “dynamic” features – like querying. Still, the question I ask myself is how relevant are those benchmarks for what we want to achieve?

I still believe it’s good to have such informative websites, because they give a rough idea about the computing power needed (which could decrease costs, like how big your EC2 instances have to be), yet we have to evaluate properly a technology before falling for it just because it’s in the top-10 fastest/quickest/<superlative-positive-adjective> technology. If you look at the charts on the website mentioned above, as of today, django is in deep troubles compared to almost any other technology out there. However, there are dozens of highly responsive websites using Django, for example instagram, Disqus, pinterest – you can find more here: https://stackshare.io/.

How Do We Achieve Responsiveness?

Having good technologies helps here. Same applies to good code, good design patterns, and so forth. However, if we are able to implement elasticity and resilience we are through.

Next article will focus on those two principles.

 

A need for a better hiring process

Software engineering is a relatively new field. We are still trying to find a common denominator, a bunch of patterns, something that gives us a better overview about how things work, so that we can narrow them down and apply a process – like we do in any other engineering field, e.g., automotive, pharmaceuticals, etc. Something measurable, because when we have numbers we can actually do something. Without numbers it’s chaos and gut feelings, unicorns and failures.

One of those things we need to get better at is certainly the hiring process.

I am amazed by the amount of jobs out there that not always fit their own description – it almost seems they were written by a person within the company that dreams about that job day and night, a sort of Promised Land:

  • You will design and implement highly scalable services that need to serve customers all around the globe – we have already 20 million customers!
  • We work hard but play also hard – we have regular kicker tournaments and the best Espresso machine
  • We want you to learn and get better, we pay you 2000$ per year that you can spend with books and courses of your choice

These are only few of the promises. Sometimes they are quite good – I would not mind having 1k to spend in books and courses, to be honest. However, you will often find on Glassdoor that the same company offering you the Promised Land position has also the following cons, which often contradict what they claim in their job description:

  • management should trust employees more
  • no career opportunities
  • employees are expendable
  • high turnover rate

Similarly, I am quite certain that not every job applicant plays by the rules, and those boring years maintaining that legacy software will need more than some polishing.

While this is certainly a grey area, where ethics, self-promotion and personal/professional needs are different forces fighting against each other – or I should say “together”, maybe –  I am quite aware that people lie, so I prefer to focus on what the person and the company have to offer and how good their interactions are.

Ideally, everyone tells the truth during the hiring process. However, in order for this truth to come out, it needs to be told, written down somewhere, it has to become visible, and all potential ambiguities need to be cleared once for all during the interview process – questions need to be asked, especially difficult ones if you have doubts – there are lots of websites helping with what needs to be asked, just do your homework.

Also, let’s not forget that companies need to find the right candidate, because the first few months are just a cost (I think I read about it for the first time in Peopleware).

How can companies and candidates get to know each other better?

I believe that the best way to improve the hiring process in our field is by providing constant feedback and to put the candidate right in some of the team’s dynamics. Some companies I interviewed with in the last few years offer some feedback, but it’s almost never meant as a way to get to know each other. It’s like being at school over and over: how did I score? oh, cool, I made it!

For example, by analyzing what our daily tasks are, we could try to find out how candidates perform during a normal working day. This way, the feedback is mutual – candidates maybe don’t like the office, the managers, the colleagues, etc., while the company maybe doesn’t like how the candidate fits into the team. If that’s not possible, we could provide remote tools to support them. Accepting a candidate or a company shouldn’t  happen as soon as possible, but as soon as they are convinced about each other.

I do understand that we, as human beings, tend to do what we know best repeatedly, it’s smart because we use what we know already and we don’t need to invest new time. However, it doesn’t have to be that way. At least, when you find out it doesn’t work, maybe that has to be adapted a bit, at least. So, I am talking to all HR people – please, do something about it. Don’t just hire people to fulfill the expectations, to fill the empty desks, to reach the desired head count. Talk to your HR managers, to the CTOs and so forth, discuss about it. Find a better way to hire people.

Struggling with CI and CD

Continuous Integration and Continuous Delivery are two practices that are supposed to boost the development, integration and delivery of features, bug fixes, etc.

Someone may claim that they actually improve only the QA/release side of it, but I like to take the position according to which “done is released” – this is certainly not something I invented –  I can’t remember which book I have taken this from (maybe Continuous Delivery?).

Software development has changed, it has become an industry, still we struggle to reach common definitions. We are full of best practices, de-facto standards, and so forth, yet it’s so easy to end up with people without a clear understanding about certain topics, about the reasons things could be done in a certain way instead of doing it “because it has always worked for us”.

However, I have noticed a particular pattern: when something is a “practice” it gets often misunderstood. The best example is a REST web service. I really don’t want to get into this discussion again, because REST is an interesting concept, yet it has a million implementations. So many that developers in the end wind up frustrated with different ideas of REST.

Pragmatism is certainly important in our field, yet sometimes it’s important to have a common dictionary, so that we can refer to roughly the same concept, especially within the same company. This is unfortunately not always the case. CI and CD are only the tip of the iceberg, together with many other examples, because I think they are difficult to implement.

Why so? Because it’s actually difficult to understand how you want to do things and what you want to do with them. It’s a practice, not a tool, therefore it needs to be understood first, it can be adapted to your needs, why not, but it should not be seen as a savior, because it’s not going to solve all your problems.

If you want to “do” CI and CD correctly, you need to have the right problem, the right mindset, and the right tools. I would personally not do CD with medical software or space components – my experience with these fields is too little, so I can’t say much – it’s just a rough idea.

However, given the right problem (which is difficult to define!), you need to have people with the right mindset: we need to do something new, and this is going to have a severe impact on how we do things here – you won’t be anymore a QA or a DEV, but you will take care of everything from A to Z. Literally.

And, of course, you need the right tools, otherwise it becomes a pain in the *** to manage all these pipelines, tasks, failures, rollbacks, etc. Fortunately, nowadays we have plenty of them, even open source.

So, what is the sort of problems that CI/CD try to help us solve?

I would say that the first and foremost problem they help us with is the time to market – which is essential in business. Then everything else comes almost as a consequence, like “batteries included”:

  • code is always in a deployable state
    • it has passed all the QA rounds of testing, etc.
      • it offers a feeling that things are safe/green, which is always good to have
    • it was built, so it’s ready to be installed, etc.
  • tendency to have metrics-based pipelines
    • for example, if some component doesn’t reach X% of coverage etc., then it won’t be promoted to next stage

What kind of mindset does it require?

It asks people to take what they have always done, wrap it up, and throw it away. Sometimes it even asks them to wipe their *** with it. Pardon my French.

It requires people to think in a way that is deterministic, repeatable, stateless, yet in units that have to be integrated to make “the whole greater than the sum of its parts” (just to mention Aristotle).

It’s not enough to say “we need to commit on a daily basis”. It just doesn’t work that way. Same applies to “we need to achieve X% coverage so that the builds are self-testing”, where X is a ridiculously high number (considering that now the coverage is below the sea level). That will be not only counter productive, but will end up with frustration.

Depending on where you work, this may be harder or easier to implement. Having the right tools here helps a lot, educating people helps even more. In my opinion, the best way to achieve something here is by taking the time to explain the value this new approach offers, compared to what has been done so far, together with all the challenges this implies.

What’s the lesson here?

I think progress is never easy to achieve. It takes courage, an open mindset, some stubbornness, and sometimes also the honesty to say “ok, it doesn’t work this way, maybe it needs some improvements”.

Further Readings

There’s plenty of material to learn about CI/CD, however some of the most important articles about these two practices are from Martin Fowler:

Neolithic

I guess that when most people hear the word “Neolithic”, they simply think about something old, not very old, but old enough to be dated back to a past when our ancestors could not eat a juicy pizza or enjoy a fresh beer – it seems they’d be wrong in both cases.

Funny thing is that most people use the term “Paleolithic” to label something that is very old, something that has “aged”, which you wouldn’t do any more, that has most of the time a relatively negative connotation. Paleolithic and Neolithic, though, are both part of the same larger period known as Stone Age.

However, going back to Neolithic, I think it should be considered again, this time with a more correct and modern interpretation – this word should actually be associated to the “Neolithic revolution”, which according to Wikipedia was:

[…] the wide-scale transition of many human cultures from a lifestyle of hunting and gathering to one of agriculture and settlement, making possible an increasingly larger population.

Now, looking at our history, we can often find that we humans, or I should say Western people (as I don’t have enough knowledge about how this whole Stone Age thing is perceived in the East) tend to think that there are some periods that produced “wide-scale transitions […]”. One recent example is the Industrialization.

I frankly dislike to imagine that such periods happened suddenly, as if our ancestors in the Paleolithic period were all limited or not capable of farming, or as if people in the Middle Ages were all ignorant and illiterate – let’s try to remind a few things that we use and that were created roughly in such a “bad period”: glasses, universities, printed books. Progress is something that doesn’t happen right away, and if people in the Neolithic had the chances to improve their farming techniques I am quite sure that they had to thank also the “poor” guys who were born in the Paleolithic.

Why am I being biased towards Paleolithic then, claiming that they were “poor” guys? Well, that’s the point – you see. This is the main issue that we face when we look at something that happened in a relatively remote historical period – we almost always think that in the past it was all bad, all terrible, etc.

Why Neolithic then? Well, at this point you should have guessed it already. If we are who we are and we have what we have, we certainly have to say thanks to the progress and improvements created over the years. Oh, I almost forgot: of course, today is better than yesterday.