Ben Northrop

  Decisions and software development
Essays   |   Popular   |   Cruft4J   |   RSS    

The Siren Song of Automated UI Testing

(14 comments)
  February 19th 2014

For more than a decade now, cautionary tales have been written about the costs, perils, and complexities of automating UI tests, and yet we, on the business and development sides, still seek the promise land (as depicted by the graph below). It often starts with a familiar story...


The application is launched, and it's a big success. Users are happy. Unfortunately, there's little time to celebrate, because with success comes requests for new features and enhancements. And they're coming in fast. Fortunately, the development team is up for the challenge. They start churning out code, and the team does frequent releases to keep pace with the feature demand.

Things are going well for a while, but as the application grows, things get a little more complicated. The team finds that even with solid automated unit testing practices in place some new features end up breaking old features in quite unexpected ways. Regression bugs. Ugh. Users get frustrated, and so business owners do too.

To avoid these embarrassing snafus, the team decides to slow down, and implement a more rigorous regression testing practice. The problem is, this being the nascent stages of the application, regression testing is all manual, and as everyone knows manual testing is time consuming. Exorbitantly so.

After just a few releases of comprehensive manual regression tests, it becomes obvious that this is a big bottleneck. The business demands frequent releases, but there just isn't the QA capacity to keep up. The team has two choices. They can choose to slim down its testing scope so that regression tests can be completed more quickly, but of course this would introduce a greater likelihood of regression bugs. Alternatively, the team could choose to release less frequently so not to repeatedly incur the high cost of regression testing, but then users would have to wait longer for new features, and time-to-market opportunities would be missed. Neither choice is pleasant, and the business grows concerned.

It's here that someone introduces the idea of automating the UI regression tests. Eureka! By automating tests, the team can have it's cake and eat it too. The scope of the regression tests needn't be narrowed, since automated tests can execute faster than a human. Also, releases can be frequent, because automated tests can be run as often as needed. The graph above captures the promise of automated UI testing.

Note that the cost function for manual testing is linear - it costs a fixed X for each run of the test. The cost of automated testing on the other hand is a step-wise function: to run it once costs more than running it manually (because you have to write the test), but then every subsequent run costs nothing at all. Overall return on investment is achieved when the two lines intersect, and so the more often you run the tests, the more you save.

This is the promise at least, and both sides (for once!) are completely on board. The business owners love the idea - they can get features to market more quickly and with high quality, and still not have to hire a larger QA staff. The developeres love the idea as well - automation is basically a virtue of every good programmer, and it's always exciting to play with the cool testing framework du jour to boot. A win-win!

Sadly, as anyone who has experience with automated UI testing knows, it never quite works out this way. While some organizations do eventually achieve some ROI automating UI tests, quite often the effort is a fruitless boondoggle - the tests that result are either unmaintainable, ineffective, brittle, or all of the above. Time and money are sunk, and little value is really extracted. The problem, in my opinion, is not that automated UI testing is hopeless, but rather that the expectation that is set up front about automated testing's ROI, as captured in the graph above, is overly simplistic. In my experience, effective automation is more expensive to implement up-front, and still requires non-trivial "care and feeding" throughout its life. In other words, automated UI testing is no silver bullet.

In the rest of this post I'd like to explore some of the subtle or unexpected challenges in automated UI testing I've come across in my project experiences, and hopefully re-frame a better image of its expected ROI. Ok, here we go...

Test resiliency vs. brittleness

The graph above belies probably the most important insight about automated tests: they're not all equal. Tests can be written to be resilient, such that a failure of the test almost always indicates a failure of the feature. These are tests you can trust. Alternatively, they can be written to be brittle, such that trivial changes to the application or data cause the tests to fail. These are tests to be wary of. When they fail, they must be examined, re-run, and then either tweaked or re-written altogether (i.e. test death).

In practice, tests can fail for all sorts of reasons. The location or order of UI elements on the screen might change. A brittle test would break, but a resilient test would account for this. Data might be valid when the test was written, but may have changed when the test was run. A brittle test would break, but a resilient test would set up fresh data and tear it down after. There might be a network hiccup that causes the UI to be unresponsive. A brittle test would break, but a resilient test would recognize this and try again. And so on.

As expected, it takes much more time and skill to create a resilient test than a brittle one, but the resilient test will live longer. Given this distinction then, it's clear that the cost function of automated tests should be bifurcated. The line below shows how brittle tests are relatively cheap to write, but will die quickly. The line above shows how a resilient test is more expensive to write, but will last longer:


It's important to note that this picture is actually generous for brittle tests. In many cases they will die before they reach the point of intersection with manual testing (i.e. no ROI). The canonical example of a brittle test is a record-and-play script created, say, with a simple Selenium plugin with almost zero effort. For a variety of reasons (like the ones given above: UI order, data, network), this type of test will probably die before the week ends.

Strong vs. weak verification

Another important distinction to draw of tests is their verification effectiveness. Tests can be written to be extremely observant, or completely clueless. For example, it's not uncommon that a suite of automated UI tests might run, pass with 100% success, and then later that day a business owner will notice that all the text is lime green and the content areas are in random places. The application is obviously broken, and any human tester would notice it in a single glance, but the automated test was never written to verify the look-and-feel of the application. Given that the entire point of a test is to catch defects, this test failed its mission.

The point here is that tests can be written with weak or strong verification abilities. On one end of the spectrum, the test could wander through the application and just verify that there were no errors. This would be cheap to implement, but it might not give much confidence that the application is truly working. On the other end of the spectrum, the test could try to mimic a human tester's judgement, verifying anything and everything a human would look for. This, unfortunately, is not easy (if even possible). There is a choice to be made, therefore, about the type of tests that should be created, and the more effective a test is at spotting defects, the more costly that test will be to write.

Maintenance is never free

Even if you choose to write resilient tests, maintenance of them is still not free (as the horizontal line in the first graph suggests). Someone must monitor test runs, and when a test fails, that person must investigate. Resilient tests should only fail for valid reasons, but a human still needs to make sure. In these cases, the maintainer will most likely first try to re-create the test failure, either by re-running the test and watching it, or stepping through it manually. If it breaks again, great, a defect report can be written up. If not (as often happens), the maintainer needs to dig in and figure out why it would break once but not again. What do the application logs say at the time the test broke? What else was happening at that time? What did the data look like? There are plenty of variables, and it often takes a non-trivial amount of time to understand what happened. In a nutshell, maintaining automated tests is not free.

The automation infrastructure

Beyond examining test failures, someone also must be responsible for kicking them off, or better yet configuring them to run on some schedule (from a continuous integration server, say). The more robust the setup, however, the more technical complexity that's taken on. Managing this complexity can cost a non-trivial amount of time and effort. Tests can suck resources from machines, causing headaches (e.g. memory errors, etc.) for administrators. Tests can take a prohibitive amount of time to run and need to be forked/clustered. Tests can hit security blocks (e.g. tests need to mimic user behavior, and users often need to log in, but how does the test login though without storing someone's actual password?). Or finally, tests can depend on frameworks and libraries that must be upgraded. The point is that automated tests introduce a slew of technical challenges never encountered in the world of manual testing.

Making the app testable

A final often-overlooked cost is the time it takes to make the application itself testable. For an automated test to drive a screen, it must have some hook for the UI elements. For example, in order to "Enter 'John Doe' in the 'name' field", the test needs some way to find the 'name' field. Some approaches for locating UI Elements (e.g. using preceding label names, XPath) can be brittle, since small, seemingly innocuous changes to the UI can still confuse the test, and cause it to not be able to find the UI elements it needs to drive. When this happens, tests fail.

Resilient methods for referencing UI elements can sometimes require changes to the application itself (e.g. adding an "id" attribute to input elements, if the application is HTML based). In some instances, making these changes can require a significant effort. Again, this is a challenge avoided when testing manually.

Conclusion

In this post I painted a gloomy picture of automated UI testing, probably too gloomy. My experience has not been that automated UI testing is necessarily ineffective or a waste of time, but rather that the promise land of easy, cheap, human-quality testing via automation is a chimera. I've seen many teams go down the path of automated testing with rosy expectations, only to emerge 6 months later with hundreds of man-hours burned, a set of brittle tests with weak verification abilities, and only a handful of successes (i.e. defects prevented). In a future post, however, I'd like to correct this gloomy course and talk about situations in which automated UI testing can work, in my experience. Until then, I'd love to hear any comments you had, good or bad. Thanks!




I believe that software development is fundamentally about making decisions, and so this is what I write about (mostly). I'm a Distinguished Technical Consultant for Summa and have two degrees from Carnegie Mellon University, most recently one in philosophy (thesis here). I live in Pittsburgh, PA with my wife, 3 energetic boys, and dog. Subscribe here or write me at ben at summa-tech dot com.



Got a Comment?

Name:
Website:
Are you human:
Comment:

Comments (14)



Rodney E February 19, 2014
Good article. This rings very true. Are your concerns just against specifically Automated UI Testing? What about Automated Functional or Integration tests? What about Functional Tests against a Service Layer?

Steve Wedig February 19, 2014
Great article Ben! I'm writing a series of articles about continuous delivery for webapps and I'll definitely be linking to it.

I've come to the conclusion that for Selenium tests to be resilient to UI changes, it is necessary to put ids everywhere. Essentially the UI has two interfaces: one for humans, and the other for Selenium tests. Writing tests that use XPath seems like a modularity disaster to me. If a team isn't able to modify the UI to provide an id interface for Selenium, I think it is in a potentially unworkable situation.


Danny K February 19, 2014
Software QA professional here with 16 years of writing manual and automated tests. Your article is spot-on and should be a must read for every PM, Dev, and QA professional. Thanks for articulating what I've known for so long and yet could not explain.

Brennan Fee February 19, 2014
Because you have never achieved it does not mean it is not possible or effective. I have successfully led numerous teams on various sized projects to automated testing success. It is true that the graphs never look as you described... but I would maintain that they aren't supposed to.

In the end, the challenge many people have is not understanding what is needed but how to get there. That's when you call a true professional.

Tyler B February 19, 2014
This is a wonderful article. After spending several years in QA and QA Management, attempting to limit expectations on automation's direct effect towards a product has been extremely difficult. Many do see it as a magical bullet to streamline productivity, give developers a tool to give immediate feedback on their code, and a reason to limit QA resources.

Unfortunately, those same people don't see the issues that you bring up above, in which it takes time to maintain and establish resilient code. This same amount of time, those same people want used towards manual testing to deliver their product on time.

I agree that automation can and should be used in places where there are more straight ahead regression tests. For instance, billing, to make sure purchases are processing properly, but even then, there might be items that are outside of the testing framework, that may not be caught unless manual testing is used.

Anyway, thank you for the article, as it's something I can use to point others towards to help them understand that it takes time and effort to get the proper balance to help in pushing development forward quickly and efficiently.

M February 20, 2014
Interesting post. Can you provide some real world code examples of the unmaintainable, brittle tests that you describe here? Having examples of what not to do would greatly benefit anyone considering automating their UI tests.

One point I have to disagree with is this: -

"Given that the entire point of a test is to catch defects, this test failed it's mission."

Tests have many reasons to exist, and not just to validate the thing under test.

Michael Herrmann February 20, 2014
Very interesting article - thanks. I'm developing a web automation library that aims to make tests easier to write as well as more resilient to changes in the application under test. Maybe you'd like to have a peek: http://heliumhq.com

Ben Northrop February 20, 2014
Thanks for the comments!

@Rodney E - In my opinion, it's more difficult to write automated tests for the UI that are both resilient and have strong verification abilities, so there's more cost/risk here than for Functional and Integration tests. Moreover, it seems (to me) that there's something "flashier" or attractive (to both business and technical folks) about UI tests (i.e. watching the screen "drive itself")...and so I think there's more potential for hype/high expectations than for Functional or Integration tests (since they're all just code...that business owners can't see/watch). So putting that together...greater cost/risk...greater potential for hype/high expectations, I think makes automated UI tests more vulnerable to disaster.

@Steve Wedig - Yup...I've come to the same exact conclusion. Thanks for sharing that.

@Danny K - Thanks!

@Brennan Fee - Respectfully, that wasn't really the message I was trying to send with this post. I have had success with automated UI tests, but not in the way that I and others expected when we first got into it (i.e. we were hoping for the promise land of easy, cheap, human-quality testing via automation). Thanks for the comment though.

@Tyler B - Thanks much! I've had the similar experiences...and I finally felt like capturing it. Very glad it resonated.

@M - Good question - I'm hoping to follow up with another post - maybe a little more tangible. Very fair point about the "mission" of tests...some of my colleagues took issue with this as well. Even if a test doesn't find defects, there is some value in the confidence it's given that things are working (relatively) well. I would argue that the confidence is just a function of how much trust you have in your tests to actually find defects...and so it all comes back to finding defects.

@Michael Herrman - Very cool. I will definitely check it out! Thanks.

Olan B February 20, 2014
Great article! Agree with everything. I've been re-factoring Selenium tests for a while now - keeping them resilient and maintainable is by far the biggest challenge.

A few tips I've used to make our web UI tests more robust are:

- We used data-* attributes (data-test) to provide a testing hook in the UI. This makes it easy to spot the 'testable' units of HTML from source view, and keeps things maintainable. It also allows you to give them useful names.

- We used XPATH a lot. Commands like "//*[contains(text(),' [copy]')]/ancestor::*[@data-test='campaign']" or "get the campaign element that has the text '[copy]' somewhere in it's subtree" make it very useful. (Although obviously relying on text output won't make the most robust test!)

- We started moving towards 'page objects', or well, objects. Which is obvious really, but once you start out with the Selenium IDE, it's too late before you realise...

Seth February 21, 2014
The cost of making a GUI app testable ought to be subsumed by making the program accessible, since assistive devices and software will typically be accessing your program the very same way you'd have your tests do so.

Or if you had no plans to make your app accessible then at least making it testable can have the added benefit of making it available to impaired users.

Noam Kfir February 25, 2014
My experience has been very much like yours. Your post is right on the money. I would add what I believe are a few more factors, in addition to the ones you mentioned.

Lack of training: Most manual testers I've worked with had little or no coding or scripting skills, but all UI automation tools require at least a minimal set. At the very least, they need some code to generate synthetic or random data. This often requires significant training. It usually also requires the active involvement and participation of the coders, not just to add accessibility and automation features to the product code base, but also to help the testers write the more complicated tests.

The learning curve: Companies tend to assume (or hope) that giving the testers some time to practice using the tool, or even paid professional training, is enough to make the testers productive. But it never is. It takes time. Not just to use the tool, but also to get accustomed to the different approach. Building UI automation tests is a very different discipline.

Mismatched expectations: It seems to me that these issues are magnified in certain environments. For example, in companies whose QA team or UI automation engineers are managed directly by a (possibly former) coder, the expectations will often lead the company to choose a UI automation tool that focuses on coding tools that offer greater flexibility, instead of simpler tools with a better GUI. There are other similar cases in which the people evaluating and choosing the tools are not the same people that will be using them, inevitably leading to failure.

On a positive note, when companies do account for these factors in their plans and make well-informed decisions, the UI automation efforts are often highly successful, or at least better match expectations.

Looking forward to your followup post!

Corporately Disenchanted February 25, 2014
Oh my, you just described part of my worst nightmare in hi-tech - being pushed to accept a huge test automation project to build a complete set of automated tests for a product employing my team of 8 that was already overworked. I asked my mgr who supposedly had run a larger engineering org than we were what errors she saw in my scoping of the project. She said nothing. Nothing. I presume she just didn't want to push back up the chain but she was no *** help to me.

The company contracted an outside company to do the work while my team was treated as lame. Took that team a year and in the end they had 20 developers locally and 40 remotely. The one who bid that contract confided in me later that knowing what he knew now, he would have bid it at 100x. I was fired before I ever saw the results of that effort.

I really enjoyed doing good QA. I hated the stupidity around how it was run (like so many other corporate environments these days).

uw06670 March 31, 2014

Thank you for summarizing so well what I've learned and observed over my time in software. Its so frustrating at times trying to explain it to managers or even others in testing, I'll simply point them to your article from now on.

I'll modify something you said, and hopefully address "M"'s objection. The entire point of a test case (automated in this case) is that it will catch a defect IF one is present. Having a set of automated BVTs passing at 100 out of 100 everyday is great, as long as if something its supposed to be testing breaks it actually finds it. If the 100 tests always pass, yet sometimes you have bad builds, its probably time to add a few new ones that will look for the stuff that does break from time to time as this will help the team know earlier that something is wrong.

thanks for your post.

uw06670 March 31, 2014

Thank you for summarizing so well what I've learned and observed over my time in software. Its so frustrating at times trying to explain it to managers or even others in testing, I'll simply point them to your article from now on.

I'll modify something you said, and hopefully address "M"'s objection. The entire point of a test case (automated in this case) is that it will catch a defect IF one is present. Having a set of automated BVTs passing at 100 out of 100 everyday is great, as long as if something its supposed to be testing breaks it actually finds it. If the 100 tests always pass, yet sometimes you have bad builds, its probably time to add a few new ones that will look for the stuff that does break from time to time as this will help the team know earlier that something is wrong.

thanks for your post.