For more than a decade now, cautionary tales have been written about the costs, perils, and complexities of automating UI tests, and yet we, on the business and development sides, still seek the promise land (as depicted by the graph below). It often starts with a familiar story...

The application is launched, and it's a big success. Users are happy. Unfortunately, there's little time to celebrate, because with success comes requests for new features and enhancements. And they're coming in fast. Fortunately, the development team is up for the challenge. They start churning out code, and the team does frequent releases to keep pace with the feature demand.
Things are going well for a while, but as the application grows, things get a little more complicated. The team finds that even with solid automated unit testing practices in place some new features end up breaking old features in quite unexpected ways. Regression bugs. Ugh. Users get frustrated, and so business owners do too.
To avoid these embarrassing snafus, the team decides to slow down, and implement a more rigorous regression testing practice. The problem is, this being the nascent stages of the application, regression testing is all manual, and as everyone knows manual testing is time consuming. Exorbitantly so.
After just a few releases of comprehensive manual regression tests, it becomes obvious that this is a big bottleneck. The business demands frequent releases, but there just isn't the QA capacity to keep up. The team has two choices. They can choose to slim down its testing scope so that regression tests can be completed more quickly, but of course this would introduce a greater likelihood of regression bugs. Alternatively, the team could choose to release less frequently so not to repeatedly incur the high cost of regression testing, but then users would have to wait longer for new features, and time-to-market opportunities would be missed. Neither choice is pleasant, and the business grows concerned.
It's here that someone introduces the idea of automating the UI regression tests. Eureka! By automating tests, the team can have it's cake and eat it too. The scope of the regression tests needn't be narrowed, since automated tests can execute faster than a human. Also, releases can be frequent, because automated tests can be run as often as needed. The graph above captures the promise of automated UI testing.
Note that the cost function for manual testing is linear - it costs a fixed X for each run of the test. The cost of automated testing on the other hand is a step-wise function: to run it once costs more than running it manually (because you have to write the test), but then every subsequent run costs nothing at all. Overall return on investment is achieved when the two lines intersect, and so the more often you run the tests, the more you save.
This is the promise at least, and both sides (for once!) are completely on board. The business owners love the idea - they can get features to market more quickly and with high quality, and still not have to hire a larger QA staff. The developeres love the idea as well - automation is basically a virtue of every good programmer, and it's always exciting to play with the cool testing framework du jour to boot. A win-win!
Sadly, as anyone who has experience with automated UI testing knows, it never quite works out this way. While some organizations do eventually achieve some ROI automating UI tests, quite often the effort is a fruitless boondoggle - the tests that result are either unmaintainable, ineffective, brittle, or all of the above. Time and money are sunk, and little value is really extracted. The problem, in my opinion, is not that automated UI testing is hopeless, but rather that the expectation that is set up front about automated testing's ROI, as captured in the graph above, is overly simplistic. In my experience, effective automation is more expensive to implement up-front, and still requires non-trivial "care and feeding" throughout its life. In other words, automated UI testing is no silver bullet.
In the rest of this post I'd like to explore some of the subtle or unexpected challenges in automated UI testing I've come across in my project experiences, and hopefully re-frame a better image of its expected ROI. Ok, here we go...
Test resiliency vs. brittleness
The graph above belies probably the most important insight about automated tests: they're not all equal. Tests can be written to be resilient, such that a failure of the test almost always indicates a failure of the feature. These are tests you can trust. Alternatively, they can be written to be brittle, such that trivial changes to the application or data cause the tests to fail. These are tests to be wary of. When they fail, they must be examined, re-run, and then either tweaked or re-written altogether (i.e. test death).
In practice, tests can fail for all sorts of reasons. The location or order of UI elements on the screen might change. A brittle test would break, but a resilient test would account for this. Data might be valid when the test was written, but may have changed when the test was run. A brittle test would break, but a resilient test would set up fresh data and tear it down after. There might be a network hiccup that causes the UI to be unresponsive. A brittle test would break, but a resilient test would recognize this and try again. And so on.
As expected, it takes much more time and skill to create a resilient test than a brittle one, but the resilient test will live longer. Given this distinction then, it's clear that the cost function of automated tests should be bifurcated. The line below shows how brittle tests are relatively cheap to write, but will die quickly. The line above shows how a resilient test is more expensive to write, but will last longer:

It's important to note that this picture is actually generous for brittle tests. In many cases they will die before they reach the point of intersection with manual testing (i.e. no ROI). The canonical example of a brittle test is a record-and-play script created, say, with a simple Selenium plugin with almost zero effort. For a variety of reasons (like the ones given above: UI order, data, network), this type of test will probably die before the week ends.
Strong vs. weak verification
Another important distinction to draw of tests is their verification effectiveness. Tests can be written to be extremely observant, or completely clueless. For example, it's not uncommon that a suite of automated UI tests might run, pass with 100% success, and then later that day a business owner will notice that all the text is lime green and the content areas are in random places. The application is obviously broken, and any human tester would notice it in a single glance, but the automated test was never written to verify the look-and-feel of the application. Given that the entire point of a test is to catch defects, this test failed its mission.
The point here is that tests can be written with weak or strong verification abilities. On one end of the spectrum, the test could wander through the application and just verify that there were no errors. This would be cheap to implement, but it might not give much confidence that the application is truly working. On the other end of the spectrum, the test could try to mimic a human tester's judgement, verifying anything and everything a human would look for. This, unfortunately, is not easy (if even possible). There is a choice to be made, therefore, about the type of tests that should be created, and the more effective a test is at spotting defects, the more costly that test will be to write.
Maintenance is never free
Even if you choose to write resilient tests, maintenance of them is still not free (as the horizontal line in the first graph suggests). Someone must monitor test runs, and when a test fails, that person must investigate. Resilient tests should only fail for valid reasons, but a human still needs to make sure. In these cases, the maintainer will most likely first try to re-create the test failure, either by re-running the test and watching it, or stepping through it manually. If it breaks again, great, a defect report can be written up. If not (as often happens), the maintainer needs to dig in and figure out why it would break once but not again. What do the application logs say at the time the test broke? What else was happening at that time? What did the data look like? There are plenty of variables, and it often takes a non-trivial amount of time to understand what happened. In a nutshell, maintaining automated tests is not free.
The automation infrastructure
Beyond examining test failures, someone also must be responsible for kicking them off, or better yet configuring them to run on some schedule (from a continuous integration server, say). The more robust the setup, however, the more technical complexity that's taken on. Managing this complexity can cost a non-trivial amount of time and effort. Tests can suck resources from machines, causing headaches (e.g. memory errors, etc.) for administrators. Tests can take a prohibitive amount of time to run and need to be forked/clustered. Tests can hit security blocks (e.g. tests need to mimic user behavior, and users often need to log in, but how does the test login though without storing someone's actual password?). Or finally, tests can depend on frameworks and libraries that must be upgraded. The point is that automated tests introduce a slew of technical challenges never encountered in the world of manual testing.
Making the app testable
A final often-overlooked cost is the time it takes to make the application itself testable. For an automated test to drive a screen, it must have some hook for the UI elements. For example, in order to "Enter 'John Doe' in the 'name' field", the test needs some way to find the 'name' field. Some approaches for locating UI Elements (e.g. using preceding label names, XPath) can be brittle, since small, seemingly innocuous changes to the UI can still confuse the test, and cause it to not be able to find the UI elements it needs to drive. When this happens, tests fail.
Resilient methods for referencing UI elements can sometimes require changes to the application itself (e.g. adding an "id" attribute to input elements, if the application is HTML based). In some instances, making these changes can require a significant effort. Again, this is a challenge avoided when testing manually.
Conclusion
In this post I painted a gloomy picture of automated UI testing, probably too gloomy. My experience has not been that automated UI testing is necessarily ineffective or a waste of time, but rather that the promise land of easy, cheap, human-quality testing via automation is a chimera. I've seen many teams go down the path of automated testing with rosy expectations, only to emerge 6 months later with hundreds of man-hours burned, a set of brittle tests with weak verification abilities, and only a handful of successes (i.e. defects prevented). In a future post, however, I'd like to correct this gloomy course and talk about situations in which automated UI testing can work, in my experience. Until then, I'd love to hear any comments you had, good or bad. Thanks!