Automated visual verification is hard

; Date: Wed Dec 21 2005

Tags: Java

In the Quality Team we try to automate our testing as much as possible. This is easy for tests of the core library or other functionality where there's no GUI. But when you bring in a GUI like for AWT/Swing tests then the test complexity goes up dramatically, because for some scenarios you need to verify the graphics rendered correctly.

There's a general strategy we have in GUI testing we call "layered testing". There are always aspects to the GUI API's that we can test programmatically. E.g. if you click on a Button, you can programmatically test the ActionListener gets called, etc. We use that trick to avoid automating visual verification in most cases.

But I've been thinking about a really hard problem in this area for awhile. A few months ago I watched a video shot by Scoble (that Microsoft guy) about this ??sparkle?? product they're going to ship (they==Microsoft). I don't remember the name of the technology, but that doesn't matter. The capability is to change GUI components so they render as a pseudo-3D animation. In the video the team showed radiobuttons who would dance around if they were selected, or listboxes who'd been subclassed beyond recognition to shift their contents around a 3D space.

Looks cool ... but if you can't afford to test it, then how good will the result be?

That is -- what I've observed from 16 years in software development is there's always a limited budget for testing. In testing the GUI you can rely on manual testing, but it's labor intensive and your budget quickly runs out. The common quip in testing is "Testing is never finished, only abandoned" which is what happens when the budget runs out. If you can automate the testing then your budget ought to stretch further (except that automation doesn't come free, as it carries some overhead) and you can do more testing within your budget and time constraints.

Which gets me to dancing buttons.

Today buttons and other GUI components don't dance, they're rather still instead. But we can see the writing on the wall. CPU's are getting faster, memory more abundant, and more importantly graphics acceleration becoming relatively common. In a relatively short time it will be fairly common for GUI's to no longer be flat 2D still-lifes. Instead they'll be 2.5D dancing animations, and maybe they'll sing too with sound output.

I can just see the marketeers salivating over that coolness.

Anyway ... I'm wondering how we're going to test that capability when it becomes ubiquitous enough to worry about it.

With the flat 2D still-life style components we have today it's hard to automate visual verification. You can take a screen capture of the component and compare it against a known-good image. But there's a lot of caveats and overhead so in practice it's not that simple. But with dancing (and maybe singing) GUI components the complexity is amplified even more. It's not a still-life, but it's moving, and when you make the screen capture it's like trying to shoot a moving target.

What are we going to do when GUI's are dancing and singing away? How will we be able to automatically verify it's dancing the right way and not stumbling around?

That's what has been in the back of my mind since I saw Scoble's video.

Source: (


Once again we should turn to video games developers. Maybe they do have automated tests, in which case they could give us their solution to this tricky problem. But maybe they just don't do that :)

Posted by: gfx on December 21, 2005 at 04:51 PM

Yes, yes, yes... the ones I've reached do manual testing.

Posted by: robogeek on December 21, 2005 at 05:01 PM

Ive pondered this question alot and find myself wondering how do you prove that your visuals are right? The only thing Ive come up with is to take a picture of a properly visualised instance, take a picture of the test and see if the images are identical by what data is in both pictures. This is a tough problem which I have never been able to answer right. leouser

Posted by: leouser on December 23, 2005 at 08:52 AM

GUI testing might become much easier if you change your approach. Here's my idea. Implement the visitor pattern on your UI, to represent it in XML:

  • A JButton outputs <button text="Clickme">. This doesn't need to be a complete dump of the JButton's state, just what's significant.
  • A JTextField outputs <textfield text="XYZ">.
  • A JPanel outputs <panel>, with the hosted components nested.
  • Similarly for the other components and custom components in your app. You can make this automatic with the java.beans encoder.

Now you're reduced verification from a graphical problem to a textual one, which I believe is easier to solve. You'll still need to figure out how your screens are all navigated to, which might be done by hooking a script up to something that can read that XML and 'click buttons' in response!

Posted by: jessewilson on December 23, 2005 at 01:12 PM

What about using a mock Graphics2D object? I see two scenarios:

  • simple component testing: the Graphics2D calls are few and well known... so you can record the calls and programmaticaly check the sequence of calls in a unit test
  • complex component or group of components: the first time you do visual testing, once you get the expected result, you use a Graphics2D mock that performs a recording of calls (along with parameters) and stores it in a file.

Then you use this file to perform an automated comparison to make sure the calls to the Graphics2D objects stay the same, and thus the visual appearance of the component does not change.

This approach is more detailed than a reference image approach, and moreover I found during tests performed on reference images that fonts are not always renderered the same among different versions of the JDK and on different platforms.

The disadvantages are that it's most probably slower and probably too picky (so I guess it should be used as a way to perform regression testing, but probably not test driven design ;-)

Posted by: aaime on December 25, 2005 at 01:10 AM

Apple might be another good source for suggestions. Expose already does some automation. (And it's surprisingly useful. I expected it to mostly be a distraction.)

_Posted by: dwalend on December 26, 2005 at 07:06 AM)

Err. That's animation, not automation. I'll blame the egg nog.

_ Posted by: dwalend on December 26, 2005 at 07:07 AM_

Games have nice-looking, yet simple GUI. Often they are even vertical (similar to what we had on host systems) and don't deal with lots of data or complex dependencies, restrictions, rules... And, gfx, I can tell you they don't have automatic tests!! (Been there, seen it, it sucked)

Posted by: herkules on December 27, 2005 at 02:58 AM

Probably it is time to admit defeat? The arms race between GUI implementation and GUI testing was lost long ago. Automatic GUI teste software doesn't really cut it. You have to put intensive manual labour into it to get out very little. And what you get is very fragile.

What you try to do in automatic GUI testing is to test a human-computer interface, but without a human in the loop. In the final consequence this requires to simulate a human. Unfortunately, the problem of simulating a human isn't solved at all in computer science. So you are short of a key component in automatic GUI testing. And you probably will be short of this component for a long time.

My prediction therefore is that for a long time to come useful GUI testing will require a real human in the loop. And humans will be the most scare resource in GUI testing. So test planning and execution will have to center around some effective usage of the people. Or more people.

The need for using more people might result in a very pervert system. For some time you can hire very cheap labour for extemely bouring tasks. E.g. you can find offshore companies who retype telephone books. You provide them with physical copies of the phone book(s), and these people read them and type them in. Or if you are in need to move data from one closed database to some other you can print out the whole data on paper and these offshore companies manually re-enter it into your new database. One can now think of a pervert system, where a semi-automatic GUI test system captures large amounts of data, automatically sends this data to such a company, and people at the company compare the data pixel by pixel, event by event, GUI response by response and give a thumbs-up or down.

Posted by: ewin on December 27, 2005 at 12:50 PM

First, I've been on vacation the last week (all of Sun shuts down between Christmas and New Years). So I'm just getting to looking at this. My, have y'all been busy with good comments

Two of you have similar suggestions. Basically capture textual data out of the program. For example, to capture the Components state as an XML thingy. Or to trace the calls to 2D as some kind of log. Those are interesting, but they won't solve the biggest problem, only a small part of it. We are testing Java itself, and so we have to test the full stack all the way to where pixels come out on the screen or onto paper. Even a mock Graphics2D object isn't enough, as there is a whole slew of stuff Java2D does under the covers for which we have to verify the results.

We do follow some similar strategies. We call it "layered testing", where some of our tests check programmatic states. e.g. If you call setText, and later call getText, does the value reflect what you set? Or more importantly, if you instantiate a Font and check all its characteristics, do they make sense? That latter is important because the 2D team does occasionally change font settings, and we can validate that the fonts still have sane values.

But you guys out there would be rather upset if we didn't test rendering quality. Right? Hence, we do have to check the final result of rendering, and that's what I've been pondering ... how can we do so without huge cost to our organization.

leouser suggested a screen capture and comparison against some known image. We already have such a system in-house which I called the "media server". In our in-house terminology we call the "known good image" a "golden image", and the "media server" provides a way to categorize/organize the golden images by test, platform, etc. The original system used a database to store golden images and we found the overhead of maintaining that database was too great. During development of a major release the 2D/AWT/Swing teams are changing so much (trying to satisfy all you guys complaining we don't match native rendering quality) that the old golden images are no longer golden, and as a result we have to recapture the golden images once a month or more.

ewin kind of hits the nail on the head, and I agree. It is a tough problem like I said, and you're essentially trying to emulate human judgement. For example a major feature for Mustang (1.6) is to improve font rendering to more closely match native font rendering. Well, how do you test that? The 2D SQE team studied the issue for awhile and came to a similar conclusion, that there isn't much we could do to automate the validation of "it's better".

It is similar to the overhead issue with maintaining golden images. If we had software which knew there were pixel-level differences, but those pixel level differences were okay and appropriate ... well ... we'd be done. But I think that software resides between our ears, and can't be encoded in a computer.

I one time had a cockamamie idea about neural networks. Basically a neural network is a trainable pattern recognizer. You feed it some inputs, and for each input you tell the neural net whether it properly recognized it. Over time it is trained and can recognize correct or incorrect patterns. But I couldn't gather enough of an idea of whether it could be used for recognizing graphics or not, much less recognizing moving graphics. Probably you'd have to train a new neural network for each rendering you want to recognize, and even at that you might have to retrain it when the Java2D team fixes a bug that changes the rendering details.

Posted by: robogeek on December 31, 2005 at 03:59 PM

hmmm.... if you could teach a neural network about taste then maybe. I guess a flaw with the good image tested against a fresh image is that it doesn't allow for slight variations in value. If something is off a few pixels shouldn't this be ok? It would depend on situation to situation. If you could specify some kinda of 'variance allowed' value then the test becomes more flexible and more useful. leouser Posted by: leouser on January 01, 2006 at 02:32 PM

Like you said that depends on the situation -- are a few pixels variance okay? It's fairly easy to compare images ... even I, who know zip about graphics was able to write an image comparison function. An image is just an array of pixels, in other words an array of numbers. Nothing magical, so you can write a function that goes through the pixels and seeing whether they're the same or not. What I wrote was a DiffImage class which takes two images, and calculates a "difference image" which is an image the same size, but the pixel values come from getting the absolute value of the difference in pixel value. If they all come out as '0' for the difference, then the images are the same. It is simple to add to that function a parameter allowing for 'n' pixels worth of difference, or a threshold of difference in pixel value. That's about the limit of what I, who know zip about graphics, can do. Thinking about it right now a lot depends on where the different pixels are. e.g. if you drew a line somewhere, then pixel differences along the edge of the line might be okay, but a pixel difference completely elsewhere could be a problem. Or some of the new features in Java2D are what they call sub-pixel rendering, which I barely understand, but are about making the edge of something look smooth regardless of its angle. At the edge they'll use pixel values that vary to help create the smooth illusion, to avoid getting jaggies. But then that introduces a little variability in the correct pixel values. That variability is acceptable along the edge of some rendered shape but is not acceptable within the body of that rendered shape. As the total rendering gets more complex it's going to be really hard to describe which parts have pixels with acceptable variability, and which pixels do not. I hope this is making sense, as I said I know zip about graphics, and have been getting this by osmosis.

Posted by: robogeek on January 01, 2006 at 04:53 PM

About the Author(s)

David Herron : David Herron is a writer and software engineer focusing on the wise use of technology. He is especially interested in clean energy technologies like solar power, wind power, and electric cars. David worked for nearly 30 years in Silicon Valley on software ranging from electronic mail systems, to video streaming, to the Java programming language, and has published several books on Node.js programming and electric vehicles.