Deterministic Game Testing

Noel Llopsis recently wrote a great two-part article in Game Developer on deterministic playback systems for game state testing. It’s a fantastic read that hits a lot of the high points for testing. After reading it, I compared it with some of the systems we use at Emergent. In our system, we run a battery of tests every 24 hours on an automated system, so my comments are largely about automation and less about QA and user input testing which is well-covered in the articles. There are some differences which are warranted that I wanted to share because I think more information on testing and determinism is a good thing. Games and specifically 3D graphics are hard to test.

Using Fixed Timing

Noel mentioned recording the time and frame ID for each frame as a good way to control issues with timing across machines. For automated testing, I prefer using a fixed time step. The upside is that you don’t have to record the time. Once testing is turned on, you just override the clock and each frame takes 1/60 of a second. As noted in the article, this can also accelerate things if you turn off VSynch as well. For us, most of our tests are simply files that reproduce a single feature. The frame time is much less than 1/60 of a second, so we can slam through 100+ frames for a test file in less than a second. It’s a huge win for an automated test suite with thousands of tests that takes in excess of 20 hours to complete. The downside, obviously, is that you have no variability in your times and frame deltas. A fixed time won’t catch odd edge cases in your math. Given that a large portion of our tests are focused on rendering artifacts, as I’ll note below, this isn’t a huge deal for us, but you should consider it when planning any system for yourself.

Image Differencing vs. Perceptual Differencing

The first part of the article mentions a perceptual image differencing utility that determines how visible a change in the image will be to the human eye. In our tests, we use image differencing to compare known good results with the current test run. This technique will catch even the smallest change of a single bit in a single channel of the color of a single pixel. When there’s a failure, we actually have to amplify the results most of the time to see what went wrong. For game visual testing, a perceptual differencing is probably more than sufficient.

For our purposes as a middleware provider and because each file in our test only runs for about 2 seconds, the differencing actually serves us much better because it will catch small math errors throughout the pipeline which may only manifest as a tiny difference on screen. For example, a perceptual utility might not flag an error for an animated character that is one frame off because only a small number of pixels differ in the animation. If that error were to accumulate, it could cause problems. We should chase that down. The other issue as a middleware provider is that we have no idea what our customers will rely upon or are currently relying upon in our code. That minor math error may be a big problem for their assumptions, so we try to chase them down.

Using image differencing does lead to some real problems with false negatives in testing due to hardware changes which leads to my next point.

Use Fixed Hardware

Even if you think you’re doing everything the same, the hardware can still cause you problems. For example:

  • There are differences in internal floating-point precision between some Intel and AMD chips.
  • Ditto for graphics cards.
  • Filtering precision can differ for graphics cards.
    • Color interpolation may occur in 32-bit space vs. 16-bit space for textures.
    • Mipmapping calculations may differ from card to card.

I could add more, but the point is that you can’t rely on anything if you’re testing raw pixel outputs. You need to use fixed hardware. For consoles, this is no problem. For PC, however, make sure you are using the same CPU, GPU, and drivers. Obviously, this reduces your hardware coverage, but automated testing can’t solve everything.

Multithreading

Noel noted multithreading as one of the biggest pains in his article. I’ve left it for last as well. There’s a bunch of pain there, and I only have tips.

  • If you leave threading turned on, look out for thread local data. It will introduce non-determinism that you didn’t think of. One of the most common culprits is random number generation. Even if you seed with a consistent value, the seed of the RNG is usually stored per-thread for performance. If the number of calls to the RNG varies per-thread, you may have false negatives in your testing.
  • Consider a serial mode for your threading systems. We have a system called Floodgate for stream processing in Gamebryo. When we run with “-test” for some applications, we simply flip it into serial execution mode. Instead of running the tasks on worker threads, it simply runs them when submitted. It’s much slower and really a different configuration, but it’s much more testable.

Wrap It Up, B

I think that’s plenty of words, and I can see why Noel ended up writing a two part article. Regardless, I hope my additional ramblings add a bit of color to his work.

This post has not been fully tested. dba

Advertisements

3 Responses to “Deterministic Game Testing”

  1. shaunkime Says:

    One of the moments I like to share was from very early on in the development of the testing system. We were still getting the kinks worked out and one day 90% of the test cases failed. We only had pretty simple test cases back then. I stared at the screen for a long time time trying to rationalize what was different. In the end, I chalked it up to driver issues and blessed the images. This was one of the dumbest things that I’ve ever done.

    It turns out that our backbuffer format detection code had a bug introduced where it now selected 16-bit backbuffers by default. A couple of days later I saw something that looked like mach banding on a sphere and eventually tracked down the issue.

    Even your automated systems are only as good as the weakest link in the chain… humans. Someone has to act on the results that they are getting back from the system. If your reporting stinks, the value of the automated testing system stinks. This taught me an important less, though.

  2. whatmakesyouthinkimnot Says:

    It’s worth noting that Shaun has gone on to write the Bless-O-Matic 3000 to help with this process and improve the reporting and human interaction in our testing by an order of magnitude.

    dba

  3. […] – bookmarked by 3 members originally found by thekossack on 2008-09-29 Deterministic Game Testing https://whatmakesyouthinkimnot.wordpress.com/2008/06/24/deterministic-game-testing/ – bookmarked by […]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: