ROAM Algorithm -- AGP Chunking

[ROAM Algorithm Homepage] [ROAM Algorithm Version 2 work in progress]

Chunking the output gives a hundredfold speedup, for some loss in quality

There are several levels of increased capability when chunking, but they come with increased challenge in implementation and in just understanding the ideas. Let me start with the very simplest case, uniform chunks, and see how far we can go with that, and what the pros and cons are.

First of all, although the primary data structure is now diamonds, you still need to split the diamond into its two bintree triangle sub-parts for the purpose of rendering. Here's a quick picture to see why (as always, this will be ASCI art and requires a fixed- point font to display properly). Take a diamond d:

                o-------o-------o
                |.     /|\     /|
                | .   / | \   / |
                |  . /  |  \ /  |
                |   a   d   b   |
                |  / .  |  / \  |
                | /   . | /   \ |
                |/     .|/     \|
                o-------o-------o

Now this diamond comes into existence because either of its parents (a or b) was split. Typically one will split first, say b, then sometime later a may also split. But before a is split, only the rightmost triangle of diamond d will be drawn. So you can see that a diamond can not be treated as an indivisible unit for rendering purposes.

So now we want a speed boost. This involves two things: (1) reduce the amount of CPU work to compute the mesh updates, and (2) present the mesh to the graphics hardware in the best possible layout. The first objective is accomplished by thinking about triangles in groups, not as individuals. The second objective is obtained by using vendor's extensions (or special settings in the case of Micro$oft's API), which for current hardware means AGP chunks, specifically (in OpenGL) the Vertex Array Range on Nvidia hardware (I am not so familiar with ATI's variant on this, but I know they have one). I will speak in terms of the Nvidia extensions, but this will port to other hardware.

The simplest kind of chunk to make is a uniformly refined one. But we still need to maintain continuity (no cracks etc). If we only split each triangle once, here's what the figure earlier would look like (the little dots in the previous figure were not edges, just the ghosts of edges yet to come...here they come):

                o-------o---o---o
                |\     /|\  |  /|
                | \   / | \ | / |
                |  \ /  |  \|/  |
                |   a   d---b---o
                |  / \  |  /|\  |
                | /   \ | / | \ |
                |/     \|/  |  \|
                o-------o---o---o

There is a problem going from the left side of diamond d to the right side. The left is unsplit yet the right is split, so a crack will form there. What if we split each triangle twice instead of once?:

                o---o---o---o---o
                |\  |  /|\ /|\ /|
                | \ | / | o | o |
                |  \|/  |/ \|/ \|
                |---a---d---b---o
                |  /|\  |\ /|\ /|
                | / | \ | o | o |
                |/  |  \|/ \|/ \|
                o---o---o---o---o

Now everything seems to work fine. The general rule is this: split the insides of any bintree triangle chunk any way you like, but make sure the edges are all split in exactly the same way regardless of bintree refinement. The simplest way to fulfill this requirement is to make sure each edge is refined the same amount. Look what happens when a triangle is split once, twice, three times or more, paying attention to what happens on all three tri-chunk edges:

  split once:
                          o
                         /|\
                        / | \
                       /  |  \
                      /   |   \
                     /    |    \
                    /     |     \
                   /      |      \     BAD -- side edges split once, base
                  o-------o-------o           edge split twice

  split twice:
                          o
                         /|\
                        / | \
                       /  |  \
                      o   |   o
                     / \  |  / \
                    /   \ | /   \
                   /     \|/     \     GOOD -- all edges split twice
                  o-------o-------o

  split thrice:
                          o
                         /|\
                        / | \
                       /  |  \
                      o---o---o
                     /|\  |  /|\
                    / | \ | / | \
                   /  |  \|/  |  \     BAD -- base split into 4 != sides 2
                  o---o---o---o---o

  split four times:
                          o
                         /|\
                        o | o
                       / \|/ \
                      o---o---o
                     /|\ /|\ /|\
                    o | o | o | o
                   / \|/ \|/ \|/ \     GOOD -- all edges split 4 times
                  o---o---o---o---o

You can verify that this BAD-GOOD alternation continues forever. Splitting and odd number of times is bad, while splitting an even number of times is good.

So now to the CPU think-reduction part of our goal. Do your ROAM just as you have before, splitting/merging etc, but when you go to draw a triangle, draw the tri chunk for that triangular region instead. This means you take the height values for all these points from the actual height map, and take the error for the bintree tri chunk, but otherwise make no changes whatsoever. Lets say that your data is already in memory in chunked form (whatever is needed for the rendering calls). Then the CPU work for ROAM "thinking" will go down by a factor proportionate to the number of triangles in a chunk. For example, if we split eight times we will get 256 triangles and will get a 256x speed boost for the "thinking" effort. This does not mean 256 times faster overall, just in the reduction of the CPU work for this one thing. We have fulfilled the first part of our goal.

The second step is sending these triangle chunks in the best possible way to the graphics hardware. Of course this will change as graphics hardware changes, but in general there will be a packing, loading and rendering step. For Nvidia hardware currently there are actually *two* different "best" ways: one is best for total triangles per second throughput, the other is best for minimizing the CPU work per triangle. Both use the VAR (Vertex Array Range) extension and AGP memory. The difference is that the call glDrawRangeElements(GL_TRIANGLES,...) allows better vertex cache reuse, but at the expense of a CPU-side index array, compared to the call glDrawRangeElements(GL_TRIANGLE_STRIP,...). I've implemented both of these (plus a lot of others that didn't work as well, including using display lists per chunk). For this discussion, let's focus on getting the most spectacular speed we can per triangle, and not worry quite so much about CPU use (after all we reduced it a LOT by using chunks, and furthermore I know that ATI allows AGP-side index arrays, so we can hope that other vendors will follow suit).

So now everything comes down to two things: (1) packing the vertices and indices in some order for each triangle chunk, and (2) managing the loading of chunks into the limited AGP memory as needed. The second task is rather simple since all the triangle chunks are exactly the same size: just keep a least-recently-used cache of chunks and replace accordingly.

Finally we are down to just ordering the vertices in the vertex array and ordering the triangles in the index array that references the vertex array. The idea is to use a simple idea with a fancy name: Siepinski space-filling curves. Here's how to make one for triangle (v0,v1,v2) where (v0,v2) is the base edge:

    split(v0,v1,v2,level)
    {
        vc=(v0+v2)/2
        split(v0,vc,v1,level+1)
        split(v1,vc,v2,level+1)
    }

Be careful when you implement this: *the order of the vertices is important*. If you draw out on a piece of graph paper what is happening, then the order of the vertices should make a curve where you don't have to lift your pen off the paper as you draw it. It is NOT okay to just go to the left and right sub-triangles while forgetting the vertex order.

This is a good order, but how do we get a vertex array or index array out of it? Basically we have to terminate the recursion, which we neglected to do so far, and basically just assign vertex indices as we go:

  1) set all vertex indices index(v) to UNUSED (e.g. -1)
     set the next_vertexi (index into the vertex array vertices[]) to 0
     set the next_indexi (index into the index array indices[]) to 0

  2) split(v0,v1,v2,0) where:

    split(v0,v1,v2,level)
    {
        if (level<8) {
            vc=(v0+v2)/2
            split(v0,vc,v1,level+1)
            split(v1,vc,v2,level+1)
            return
        }

        // level = 8 (leaf)

        putvert(v0)
        putvert(v1)
        putvert(v2)
    }

    putvert(v)
    {
        i = index(v)
        if (i == UNUSED) {
            index(v) = i = next_vertexi++
            vertices[i] = v
        }
        indices[next_indexi++] = i
    }

This is pretty much all there is to it. Chunked ROAM!

Well, let's see what the good AND the bad are for what we have just done. Good: 256 times less CPU "ROAM think", almost perfect peak performance (tris/sec) on modern graphics cards, all the nice ROAM efficiencies still intact, and your ROAM code lives on another day. Bad: since you didn't think about all those little triangles, neither at preprocess time nor at runtime, you didn't make very good decisions about exactly where to put details and your accuracy per triangle suffers (a lot!). How much is data dependent. If your terrain is a fractal, then you actually get hurt almost not at all. For real terrain data that has both smooth and rough terrain areas in the same view, you can suffer a huge amount (several times more triangles than you need for the same accuracy).

There are solutions to this quandry that get ALL the good and NONE of the bad, but that is the stuff for the full ROAM 2.0 paper. I will give a hint: what I call "progressive vertex arrays" but are mis-named "View-Independent Progressive Meshes (VIPM)" by certain game-development list guys. See the GDAlgorithms archives for my older posts and related ones by Charles Bloom and Tom Forsyth (amongst others) on this. Search for my name since I don't post that often, and look for posts in the same thread in nearby days.

As for the specific calls, I've detailed that in several places: the VTP list, the LibGen/Roam code (which I gave you a pointer to a while back), and perhaps on the GDAlgorithms list. The Roamstep4 code I think may have had some of the Nvidia calls, but didn't have the chunking management part.

So that should help get you started on ROAM chunking.

Look at Alex Pomeranz's RUSTiC work (done with me at LLNL) to get an idea of the more adaptive patches that can replace the uniform ones presented above.

-- Mark Duchaineau

Updated 2003-02-20 -- duchaine@llnl.gov