Software Reliability with Dishwashers

Some weeks ago my dishwasher started leaking water in the front. At first, I thought the water filter was full, and after cleaning it properly, the issue didn’t happen anymore.

Until some days ago.

This time it came with much more stubbornness, showing errors on the display. The dishwasher is an AEG, and the error code was looking like a ,10 (yeah, comma included). I googled a bit, and it seems that it means that the dishwasher can’t load any more water. It was quite surprising, as just after a few minutes the cleaning program had started, I was able to clearly hear that the water was flowing through the pipes. Then suddenly silence and that error message.

The QA side of me forced me to randomly press the buttons present on the machine, with some more persistence for the Reset one, hoping it would just heal itself – who knows, maybe it was overwhelmed – pressing F5 typically works ๐Ÿ™‚ Strangely enough, the error code disappeared when I pressed the arrow down button, totally unexpected, and the program resumed. I was happy until the dishwasher started leaking water again, and this time another error code appeared: ,30.

I googled again, and I found what seems to be a fail-safe mechanism from the manufacturer to prevent the appliance from leaking too much water – which could be dangerous if there are kids, pets, or cables on the ground, I guess. It seems this is a feature called Aquastop.

That was an interesting finding. I googled (again) a bit to understand how this is all connected together, and I found a short video explaining in five minutes how the Aquastop works. It immediately reminded me of the Circuit Breakers that we use for reliability patterns, for example in microservices.

I like to think that this happens when a team of smart engineers sits down together and tries to solve a real problem in a creative way. It’s astonishing what we can learn from electric engineers, or more specifically, from the products we use on a daily basis, if only we had the time to disassemble stuff and see how it was done. In this case, obviously the product can’t heal itself, because maybe the water hose is perforated, it leaks water, etc., however, the gist of it is this: a simple monitoring tool that sends a signal to stop the water inlet valve, when there is too much of it where it shouldn’t be.

Brilliant.

What’s next? Pingdom for washing machines?

Akka metrics and traces with kamon.io

Lately I have been working again with Akka, a fantastic framework to build concurrent, fault tolerant systems.

At first, it came as a surprise to me that besides Lightbend telemetry there was almost nothing “officially developed” for something that I consider essential to build a reactive system.

As you may have seen in the previous post, responsiveness without numbers is a bit weird

– we are cool

– do you have numbers to say that?

– no

– then you are not cool. Not at all ๐Ÿ™‚

I am not entirely sure about how much it would cost to subscribe to Lightbend – you need to get in touch with sales, and you may get a contract that probably depends on the volume of your apps, number of nodes, etc. – I am quite sure it would not be so expensive as someone might think. Still, I would prefer not to pay for something that I consider to be basic – this is for me not ancillary.

Enter Kamon.io.

Kamon-io is a quite powerful set of open source libraries that you can plug into your JVM-based projects to gather metrics, traces, spans, and so forth. It’s based on AspectJ, which may not be the most standard way to do things in Scala, but we have to admit that Akka is another kind of beast. In Scala you might have stackable traits to provide metrics, but in Akka they sound like hacks (see here, for example) – it’s not fun that you can’t really “stack” the receive method. Even then, how would you intercept messages going through system actors? You couldn’t do that – it should be done by the akka core team.

Now, the library is quite easy to integrate with – it takes more time to understand what you actually want to measure – see the quickstart. I am going to skip this part, because it’s already documented.

What I would like to show you is how we collect custom metrics – as this is not documented anywhere.

Custom Metrics

As we are going to need Kamon.io to collect metrics, it might be a good idea to use the same approach based on AspectJ, so that the final result is like an extension of the original library that we create based on our needs.

Be wary that you could have something like this every time you want to add something your metrics:

Kamon.counter("app.orders.sent").increment()

but eventually you’ll get tired of it, considering it will bloat your actors code. It’s like having a logged line for each new request your web server is handling – most of the time, web frameworks provide filters that you can apply before/after some events happened, so there is no need to add a single “log.info” statement for that – just create and apply a filter. If you have many actors and many events to record, extracting the handling part might be a better option.

Now, all you need to do is the following: create a new module in your project to have a dedicated resource handling custom metrics. Create the aspect that will handle the interception of the events plus the relative action to take (in this case, simply increment some metric):


package metrics.instrumentation
import com.typesafe.config.ConfigFactory
import org.aspectj.lang.annotation._
import org.slf4j.LoggerFactory
import models.messages.CustomActorEvent
import metrics.CustomMetrics
@Aspect
class CustomActorInstrumentation {
private val config = ConfigFactory.load()
@Pointcut("execution(* org.mypackage.actors.CustomActor.aroundReceive(..)) && args(*, msg)")
def onCustomActorMessagePointcut(msg: Any): Unit = {}
@Before("onCustomActorMessagePointcut(msg)")
def onCustomActorMessageHandler(msg: Any): Unit = {
val customMetrics = CustomMetrics.forSystem("my-system")
msg match {
case e: CustomActorEvent =>
customMetrics.customEvent.increment()
}
}
}

and the CustomMetrics object that wraps all the metrics you want to record – you can find some interesting way to do it here.

Now, CustomActorEvent is a trait. Why do I use pattern matching on an trait, instead of the real message that is received by the actor? As mentioned here:

  • It is a good practice to put an actorโ€™s associated messages in its companion object. This makes it easier to understand what type of messages the actor expects and handles.

Therefore, we define messages inside the companion object that extend a trait that can be easily put into another package, so that we don’t have a tight coupling between the metric-handler and the actor itself.

One last thing worth mentioning: don’t forget to create the relative aop.xml file in your new module with the content you need:


<!DOCTYPE aspectj PUBLIC "-//AspectJ//DTD//EN" "http://www.eclipse.org/aspectj/dtd/aspectj.dtd"&gt;
<aspectj>
<aspects>
<aspect name="metrics.instrumentation.CustomActorInstrumentation"/>
</aspects>
<weaver>
<include within="metrics.instrumentation..*"/>
</weaver>
</aspectj>

view raw

aop.xml

hosted with ❤ by GitHub

You can find very useful information in the AspectJ documentation relative to the configuration.

Good to Know

You will need the following plugins if you plan to use the approach described above:


addSbtPlugin("io.kamon" % "sbt-aspectj-runner" % "1.1.0")
addSbtPlugin("com.lightbend.sbt" % "sbt-javaagent" % "0.1.4")
addSbtPlugin("com.lightbend.sbt" % "sbt-aspectj" % "0.11.0")

view raw

plugins.sbt

hosted with ❤ by GitHub

Now, a question I would like to ask you is the following: what metrics are you collecting?