Different strategies that drive software quality
Despite much debate about the ‘right’ way to test, software development is a process, and almost impossible to achieve perfectly. At some point in the 2000s, when PHP wasn’t just considered a templating language anymore and Ruby just got on Rails, the programming community decided that dynamically typed languages are a great way to reduce cognitive load on the programmer, to stay ruthlessly pragmatic, and avoid the factory-like culture of Java.
Since, the Facebook mantra of ‘move fast and break things’ has been perpetuated all over the start-up scene, and is looked at in the same way you build business, products, houses, or anything that can collapse on you.
And, if you’ve ever worked on software, you know that what can happen, generally will happen.
Somewhere along the way clean code arrived, and the rise of TDD gathered a crowd around ‘writing tests first’ and the belief in ‘the coverage’ – and other mantras that are fun to recite, but significantly less fun to do.
A lot of self-help books, like Clean Code, are vaguely based on personal experience and bring certain programming patterns into domains where they had traditionally not been used, e.g. simplify the C++ esque object orientation with some functional programming concepts such as small, simple, pure functions.
In reality, there has been a lot of research on the software crisis and how to get out of the mess we’re in, and it often contradicts the wisdom of the crowd.
So, let’s take a look at different strategies that drive software quality, and where they actually make a difference. From the bottom to top, I generally look at software verification at the following layers: type system, unit tests, integration tests, and organisational management structure.
Organisational management structure?
Well, maybe we can start with that then.
Organisational behaviour is a social science in its own right, and studies the subtle art of people-management structures. Since software is usually made by humans, they have needs, internal and external motivators, and normally they need to work together to deliver some deliverables.
“Quantifiable results from the study show that team and collaboration structure can be a better predictor of quality than tooling, testing strategies, or other code-based metrics.”
Google ran a study on its teams as working units to identify what made them more effective, but Microsoft‘s research focused on how the organisational structure determined software failure rates.
Both are interesting approaches in their own way, and Microsoft’s study gives us an interesting view into the development of Windows Vista.
The study tells us that smaller, more focused teams produce more reliable software. High-churn of engineers lowers software quality, while tighter collaboration between teams working on the same project will result in lower failure rates.
These might seem as statements coming from Captain Obvious, however, quantifiable results from the study show that team and collaboration structure can be a better predictor of quality than tooling, testing strategies, or other code- based metrics.
The way any individual team controls the quality of their output is the next step from here. Code reviews in particular are a great way to create and maintain a common set of standards.
Written code reviews force engineers to communicate their concerns clearly and this increased technical communication will help everyone on the team learn about different styles and perspectives, and simultaneously help level the skills across the team.
Integration tests are, surprise-surprise, test integration of components or modules in a system. You can also test the integration of integrated modules, and there are turtles all the way down (infinite regress).
It’s often easier to write correct code in isolation – a large amount of bugs occur at system boundaries; validating inputs and formatting outputs, failure to check for permission levels, or bad implementation of interface schemas.
This problem is amplified by the current trend of microservices, where interface versions can fall out-of-sync between various services within the system. At this level, we’re best off writing pass- through end-to-end tests for features, and try to leverage the fact that we have so many other layers of protection against failures – something will eventually trip those wires.
In fact, code coverage in integration tests is shown not to be a reliable indicator of failure rates. If you look at it another way, production is just one big integration test.
A trick I love to do is create a mute-production instance, which receives a portion of the actual production traffic, but will never generate responses to the users.
With enough investment in a stateless orchestration layer, we can even mute-test subtrees of services at strategic places, then make them active and discard the old subtree once the workload is gone.
Coupled with principles behind building highly observable systems, this kind of test environment removes a lot of anxiety around what happens when we deploy to production, because the mute-prod will receive precisely the same data.
The more knobs and probes we expose in live systems, the better visibility we get into the internals.
So, in order to integrate modules, we want some confidence that the modules themselves work according to specification.
This is where unit testing enters the picture. Unit tests are usually fast, and more or less comprehensive tests of isolated pieces of easy-to-grasp building blocks.
Ruby on Rails’ master repo runs about 67 tests, and 176 assertions per second. As a rule of thumb, one test should cover one scenario that can happen to a module.
In comparison to integration testing, the same study by Niedermayr, Juergens and Wagner shows that code coverage on a unit testing level does influence failure rates, if done well.
A study from ’94 by Hutchins et al claims that coverage levels over 90% showed better fault detection rates than smaller test sets, and ‘significant’ improvements occurred as coverage increased from 90% to 100%.
The BDD movement has this fun practice of developing specifications and turning them straight into unit tests.
The benefits of clear, human readable unit tests help document the code, and ease some of the knowledge transfer that needs to happen when developers inevitably come and go, or requirements of existing components change.
Unit testing in my book also includes QuickCheck-style generated tests. The idea of QuickCheck is that instead of having some imperative code of if-this- then-that to walk through, the programmer can list assumptions that need to hold true for the output of the function given some inputs.
QuickCheck then generates tests that try and falsify these assumptions using the implementation and, if it finds one, reduces it to a minimal input that proves them wrong. Interestingly, the amount of scenarios a unit test normally has to cover is heavily influenced by the programming language it’s written in.
Which leads me to necessarily discuss the holy flamewar about static and dynamic typing.
Hindley-Milner: Hindley-Milner-style static type systems such as Haskell or Rust, force the programmer to establish contracts that are checked before the program can be run. What the programmer finds, then, is that they’re suddenly programming two languages in parallel: the type system provides a proof for the program, while there is also program that fulfils the requirements.
This allows a style of reasoning about correctness that, coupled with a helpful compiler, will allow the programmer to focus on things that cannot be proven by the type system: business logic. Of course, this is not a net win. In many cases, a solution using a dynamic type system is much more straightforward and elegant, or, in other cases, the type system constraints make certain implementations basically impossible.
In other cases, using Haskell allowed writing a much smaller program to occur significantly faster, compared to the alternatives, so much so they had to repeat the test in disbelief.
Elegance is in the eye of the beholder, and a beautifully typed abstraction that reduces to a simple state machine during compilation can be just as attractive as a quick and dirty LISP macro.
Sometimes, all that complexity is hard to justify only to please the compiler. It comes through experience, taste, and applying reasoned judgement in the right situation.
People often look at programming as craftsmanship. Yes, we do know how to do the math, but with this neat trick we get 90% there with 10% of the effort, and it may just be good enough – and it will explode in that edge case I think would never actually happen – but I digress.
Weak but static: somewhat more approachable, but providing less stringent verification are languages in the C++/ Java/C#-style OOP family, as well as the likes of C and Go. The type systems here allow for a different kind of flexibility, and more desirable escape hatches to the dynamic world.
A weaker, but still static, type system provides fewer guarantees about the correctness of the programs, something that we have to make up for in testing, and/or coding standards. NASA’s Jet Propulsion Lab, a mass manufacturer of Mars rovers, maintains a set of safe programming guidelines for C.
Their guidelines seem to be effective. Opportunity exceeded its originally planned 90 days of activity by 14 years via careful maintenance. Curiosity is still cruising the surface of Mars, and is being patched on a regular basis.
Dynamic: speaking of JPL, the internet folklore preserves the tale of using LISP at the NASA lab, a dynamic, functional programming language from the 1960s that’s still looked at as one of the most influential inventions in computing science. Today’s most commonly used LISP dialect is Clojure, which sees an increasing popularity in data science circles. Dynamic languages provide ultimate freedom, and little safety.
Most commonly, the only way to determine if a piece of code is in any way reasonable is to run it, which means our testing strategy needs to be more principled and, indeed, thorough, as there’s no ‘next layer’ to fall back to.
In the end, a lot of the arguments boil down to subjective ideas and tastes about software architecture. It seems difficult to deny, however, that some form of static type checking provides several benefits to scaling and maintaining software projects.
Software quality slip away
The list of things we can do to ensure correctness of software quality is far from over. The ‘state of the art’ keeps pushing further, and new approaches gain popularity quickly, especially within the security community.
In absolutely critical modules, such as anything cryptography or safety related, formal verification can increase confidence in parts of the system, but it’s hard to scale. A familiar sentiment can be seen behind the principles of LangSec (langsec.org).
In many cases, the power and expressiveness of our languages allow inadvertent bugs to creep in. LangSec says: ‘make all the invalid states un-representable by the language itself’. Make the language limit what the programmer can do, so they can avoid what they shouldn’t.
This is also the motivation behind coding standards such as JPL’s, which allows for easier reasoning about state and data flow throughout the program code. When we’re reasonably sure that what we need is good enough, we can start fuzzing it. Fuzzing is great.
It is all about feeding unexpected states into a system, and waiting for it to cause failures. This simple idea helps discover security holes in popular software, or can help engineer chaos in the cloud.
As always, producing a stable and secure system requires principled engineering, in software just as much as architecture. We need to understand the pieces that make up the whole, analyse then verify their interactions internally and with the environment.
Despite our best efforts, software quality will diminish, bugs will always creep in, and all we can do is try to ensure the ones that remain are not catastrophic.
However, once software goes live, verification does not stop. Designing for observability by exposing knobs, tracing, alerting, and collecting a set of operational metrics all help us reason about the state of the system while it’s running, which is the ultimate test of it all.
Software development is a process, and it’s practically impossible to achieve perfectly.
As long as the team has a plan to approximate it, and that everybody is committed, we can call it good enough – and then get out of the office and enjoy the sunshine.
Peter Parkanyi, lead security architect, Red Sift