Creating Reliable Tests: Ensuring Confidence in Test Code
Written on
Chapter 1: The Dilemma of Trusting Tests
When tests are intended to confirm that production code operates correctly, how can we ascertain that the tests themselves are reliable?
"This question has puzzled me for quite some time. It raises the issue of why we place trust in test code while being skeptical of production code. Since tests are also code, they are susceptible to bugs and can fail in numerous ways. A test might indicate that the code is faulty when it is actually functioning, or conversely, it might suggest the code is correct when it is not. To have faith in a test, it must be free of bugs; it should accurately fail when the code fails and succeed when the code performs as expected."
The challenge lies in the fact that test code can, and often does, fail. Throughout my career, I have encountered numerous instances of flawed tests. A common example is a Selenium test that fails intermittently due to timeouts—an issue I’ve heard excused far too many times. Such errors are unacceptable; they often mask underlying race conditions. I have witnessed many race conditions obscured by such unreliable tests. Moreover, another prevalent issue with tests is that they do not fail even when the code is broken. This typically occurs when tests are written after manually verifying the code; since the code runs correctly and the test passes, one may mistakenly assume the test code is accurate. Yet, we fail to actually test the test itself. Surprisingly, these silent failures often go unnoticed, despite being among the most common issues.
Without a robust strategy for validating test code, we should refrain from placing our trust in it.
This conundrum has led me to actively seek solutions. Ultimately, it appears we cannot blindly trust tests; we must implement techniques to ensure their reliability.
Section 1.1: Strategies for Reliable Testing
The first strategy I discovered is the AAA pattern—also known as GTW. AAA stands for Arrange, Act, and Assert, while GTW stands for Given, Then, When. This method serves as a framework for structuring tests. It divides the testing process into three distinct parts: Arrange sets up the code in a known state, Act executes the operation being tested, and Assert checks that the final outcome matches expectations. This structure helps to minimize test complexity, allowing only one action per test, discouraging interleaving of Act and Assert, and promoting brevity. According to Gherkin guidelines, scenarios should ideally contain three to five steps, thus enhancing the quality of tests by simplifying their logic.
Subsection 1.1.1: The Importance of Specific Examples
The second strategy involves utilizing concrete examples. I've encountered tests that derive data from various sources, perform calculations, and then expect the code to yield the same results. Such tests can become tautological, essentially validating the code against something too similar to itself. Instead, we should treat tests like exam questions: we must know precisely what we want to ask the code and what the expected answers should be. This approach enhances our confidence in the tests, as it provides a clear description of our expectations.
However, these two strategies alone do not fully address the issue. While they reduce the likelihood of bugs in tests, they do not guarantee that the code was developed correctly. Thus, I sought a deeper solution.
Section 1.2: The Fail-First Technique
I discovered a technique that allows us to verify the accuracy of the test code: the fail-first approach.
This is where Test-Driven Development (TDD) becomes significant. Why do I place my trust in TDD? Consider this: when you write a test before implementing the associated code, the expectation is that the test will fail. If the test passes before the code is written, it indicates one of two possibilities: either the functionality is already in place (which is rare) or the test itself is flawed. This leads to the first validation of the test.
Now, suppose you create a test before writing the code and, as expected, it fails. Can you trust the test code? Not necessarily; it might be checking for something different. You then write the minimal code necessary to pass the test. If the code passes the test, it validates that the test was indeed correct. However, if it does not, it indicates that the test was not adequately checking the intended functionality, necessitating adjustments. Thus, we see a second layer of validation for the test.
Surprisingly, it is the code that ultimately verifies the test’s accuracy. Furthermore, TDD encourages small changes and rapid iterations, which is essential. If you write three tests and then develop code to address all three, you may not identify which test was incorrect. This issue is compounded when the test code is lengthy. Utilizing the AAA pattern and examples ensures that tests remain concise.
Chapter 2: Trusting QA Tests
Can we place our trust in QA tests? This is a critical question. QA tests are typically developed after the code is written, which undermines their reliability. Moreover, it is common to see QA tests violate AAA principles, employing complex logic to compute expected outcomes. While it may seem counterintuitive, we should remain cautious about the trustworthiness of QA tests. However, there are solutions. Techniques such as Behavior-Driven Development (BDD) and Continuous Delivery advocate for shifting QA responsibilities to earlier stages in the development process, offering more advantages than one might initially think. I will delve deeper into this topic soon.
The first video title is Expert Insights on Developing Safe, Secure and Trustworthy AI. This video provides valuable perspectives on creating AI systems that are safe, reliable, and trustworthy.
The second video title is Session 2B: Developing AI Trust: From Theory to Testing and the Myths In Between. This session explores the theoretical and practical aspects of establishing trust in AI systems, debunking common misconceptions along the way.