Casting a Critical Eye on GPU PTex

Standard

Storing data on the surfaces of meshes is somewhat of a pain. It involves unwrapping the surface into a 2D UV layout, which is time consuming and can lead to tricky issues such as seams and packing inefficiencies. I’m sure you know all about it.

For this reason, it makes sense to want to switch to some kind of automatic parameterization. Recently PTex has been proposed as a suitable approach for real-time graphics. There’s been a couple of GPU implementations here and here.

Now, I’ve heard PTex mentioned by a few people as being the solution to our parameterization woes, and figured I should write down the main reasons I think PTex is currently not a viable solution.

PTex only really works well with quads. The solution for triangles presumably involves trying to quadrangulate your mesh, or trying to pair up triangles into the same rectangular texture tile (with gutter regions in between). This is annoying, but I’m going to make a couple of generous concessions to PTex in the interest of space, and the first one is: let’s just assume for the purposes of discussion that all our meshes are made up of quads (or that PTex for triangles can be solved elegantly).

Tallying up the overheads

In short, the issue with PTex is the sheer amount of memory overhead you need to introduce for it to work robustly and efficiently.

MIP mapping

The idea of PTex is that each quad gets its own little texture stored in some way that can be dynamically addressed from the pixel shader. In principle you could support MIP mapping by just letting each quad’s texture have a full MIP chain, but there are a few reasons why you don’t do this (increased border regions, and NVIDIA’s implementation relies on storing data in texture arrays which means that MIP 0 of one texture may be stored in the same array as MIP 1 from another texture). Instead each MIP level of each PTex texture is stored separately and you address it by computing the MIP level in the shader. I’ll use the term “tile” for the data associated with a quad at a specific MIP level. So each quad has one tile per MIP level.

In order to take advantage of hardware trilinear filtering, we need to make sure that each PTex tile has a single extra MIP level in addition to its top level. The idea is that the shader picks the MIP level it needs, and truncates down to the nearest integer level, then finds the PTex tile corresponding to this MIP level and samples it with the fractional part of the original MIP level. This means you will only ever touch MIP 0 and 1 (if you went past 1, then it would simply address a different tile altogether, and touch MIP 0 and 1 of that tile).

In other words each tile needs only a single extra MIP level to get hardware trilinear filtering, but this extra MIP level is redundant (it’s the same data as the tile associated with the next MIP level). So, we get a 25% overhead for hardware trilinear filtering.

Border regions

In the most naïve versions of PTex, seamless filtering across neighboring quads is achieved by adding a single texel border region. In reality, your texture data will be block compressed. The colors you get out of the sampler will therefore not exactly match the input colors. In order to get the exact right colors, you need to add a whole 4 texel border so that the BC block in the border region exactly matches the neighboring BC block.

You might just use a single texel border and think it doesn’t matter too much if it doesn’t exactly match the texels of the neighbor quad, and that’s pretty much what people do for current UV parameterization (with some manual tweaks every now and then). However, with PTex you’re adding texture seams on every single edge, rather than just at artist-specified locations, so you really can’t afford for the seams to be anything but completely invisible or you’ll get artifacts all over the place.

The next complication is that we’re storing two MIP levels per tile, but we need the second MIP level to filter correctly too. So really, we need a full 4 texel border in the second MIP level, which means the top level needs 8 texels of border region.

If trilinear filtering was the only reason we needed such big borders you might be tempted to try to do clever tricks in the BC encoder logic and use just one texel border. E.g. maybe BC blocks on the edges of the quad could share anchor points, so that the border texels are exactly identical. This sacrifices compression quality for space, and may not be acceptable given that block compression degrades quality by quite a bit already.

However, I think most agree that the days of straight bi- or trilinear filtering are gone or soon-to-be-gone. We need anisotropic filtering too.  So since we probably need at least 8 texels of border to cover the anisotropic filter kernel anyway, we might as well just copy the BC blocks directly from the neighbors without messing with the encoder. It seems reasonable to proceed assuming that we’re going to use 8 texels of border.

Border region overhead varies with the size of the PTex texture. For a tile of size NxN the total number of texels in the top MIP level is then (N+16)^2. Then, for all but the smallest resolution tile for a quad we add in the overhead of a single MIP level and we end up with the total number of texels per tile: 1.25*(N+16)^2

For next gen games we’re going to have to assume that primitives will be pretty small. Ideally larger than a few 2×2 pixel blocks to avoid pixel shading inefficiencies for forward rendering, but still not massively larger than that because it would make the geometry look too blocky. Let’s be generous here and say that the representative size of a quad is about 16×16 pixels, which means we’re going to need 16×16 texels to texture it at the ideal resolution.

For 16×16 PTex tiles you get 2400 texels for the three MIP levels from 16×16 down to 4×4, using the formula above. The total number of real texels is 16×16+8×8+4×4 = 336 texels. This gives an overhead of roughly 7x.

Size quantization overhead

In order to filter textures between quads of different sizes, both the AMD and NVIDIA implementations require that PTex quads use power-of-two texture sizes. This is because it’s easy to make the border texels for the larger tile equal the upsampled version of the smaller tile whenever the ratio of texture sizes is an integer. The easiest way to ensure this is to pick power-of-two texture sizes for each quad.

But the ideal texture size for a quad is unlikely to be exactly a power-of-two size! This means we have to bump up each quad’s texture size to the next power-of two. You may only need 17×17 texels, but you’ll get 32×32!

If we assume that the ideal size for a quad is roughly a uniform distribution around the two nearest power-of-two sizes, the average overhead is about 1.71x. Add this in to our existing number and we get approximately 12x overhead. I’m assuming square tiles here, btw, to simplify the comparison (NVIDIA’s sample implementation in particular deals poorly with non-squares anyway, since each different aspect ratio adds a draw call).

Isn’t this a bit pessimistic, what about other tile sizes?

Okay, so I tried to be conservative with the 16×16 tile size, but let’s say you have extremely low-resolution meshes for whatever reason (e.g. maybe you use tessellation to dice it up on the GPU).

Running the same numbers again for a few tile sizes, we end up with overheads like so:

Tile size

Border Overhead

MIP+Size Overhead

Total Overhead

16×16

5.7x

1.25*1.71x

12.3x

32×32

3.1x

1.25*1.71x

6.7x

64×64

1.9x

1.25*1.71x

4.2x

128×128

1.4x

1.25*1.71x

3.1x

256×256

1.2x

1.25*1.71x

2.6x

512×512

1.1x

1.25*1.71x

2.4x

At larger sizes, most of the overhead comes from the inability to select arbitrary tile sizes, as well as the 25% overhead for the extra MIP level, but I actually expect real-world games would require the smaller tile sizes where the overhead from border regions dominate.

Why does this work for movies then?

Movie renderers have a lot of flexibility that real-time renderers don’t. For example, they store no border regions, they just use more complicated shading logic instead (read the original PTex paper – it’s very clever, but doesn’t map to GPU filtering hardware). For the same reason, they don’t store redundant MIP levels to use hardware filtering, nor do they need to restrict their texture sizes to powers of two so the overhead caused by the size restrictions is gone too. Basically none of the sources of overhead exist for offline renderers.

Could PTex work for games with tessellated meshes?

PTex could potentially work okay for models with very big polygons. One situation where this might seem like it could be the case is for a renderer that uses tessellation and displacement mapping pervasively, so that all the base meshes use very large polygons, with tessellation and displacement adding the fine-scale detail back. However, as I have discussed in previous post a couple of years ago, DX11-style tessellation is by no means a no-brainer for “everything” (even though the solution I suggest in that post – which amounts to using the standard trick of reducing minification aliasing by band-limiting the input signal using MIP-mapping  – seems workable, even if it’s not ideal).

There are plenty of issues with pervasive tessellation, and I won’t rehash all of them here. Suffice to say I’m not aware of anyone really adopting it as a mechanism for drastically reducing the polygon counts for the base meshes in general. It seems to mostly be used on high-polycount meshes for making earlobes slightly smoother and adding some small scale details up close, as well as some “special” surfaces like terrain or water.

In any case, tying the viability of PTex to a workable tessellation strategy seems like a pretty serious caveat. So if that’s the argument for PTex, it should be stated that way by its proponents.

What would we need to change to make PTex work more generally?

It’s tempting to try to avoid many of these issues by just doing manual filtering near edges, but for high quality next-gen games the requirement for anisotropic filtering means you need large filter kernels, which is expensive to fetch and filter, and also means it’s highly likely that at least one thread in a thread group needs to go through the costly path.

Edit:

I originally forgot to mention this old trick (that you may have seen in the context of virtual texturing):

One thing you can do to improve the border overhead is to store only one border “strip” for each pair of neighboring tiles. Basically instead of sampling into your left border region, you’d sample into the right border region of your left neighbor (by shifting/rotating the U-coordinate before looking up the tile, and then shifting it back just before sampling). You’d always keep the border of the higher resolution tile, if they don’t match.

You can skip quite a few border regions this way at the cost of some translation overhead to figure out where to sample from. So in a sense, this is a compromise between full hardware filtering and manual filtering. You do some UV transforms for pixels near edges, but still rely on the hardware to do the actual filtering. It would also be tricky to make this work for e.g. NVIDIA’s implementation, since it stores all the tiles in the same array – a variable number of border “strips” would increase the number of draw calls. For AMDs implementation it’s less problematic, although it would still lead to problems (the more different sized quads you store, the more troublesome the atlas packing becomes).

So you could cut the border region overhead by about a factor of two if you’re lucky (again, it always keeps the “high res” border if there’s a mismatch) – it’s a big win, but there’s still many compounding sources of overhead.

End edit.

We might be able to get rid of at least the cost of the size restriction by letting some texture tiles snap to a lower power-of-two instead of always rounding up. This would reduce the overhead, at the expense of some quality in the form of blurrier textures. For example, clamping tiles to the nearest power-of-two instead of rounding up, would change the overhead factor from 1.71x to 1.3x, but now half your textures will be slightly too blurry. This is an improvement, but isn’t really sufficient on its own to bring down the overheads enough.

Basically, I think the GPU needs to know about mesh primitive adjacency and do filtering for us, even across edges. This includes anisotropic filtering. You’d pass in the primitive ID and a GPU generated intraprimitive coordinate (e.g. barycentric coordinates for triangles). If the GPU is already doing filtering, you might as well relax the power-of-two restriction too. And of course, it needs to support triangles as well as quads. At this point I’m not sure if you can call it PTex or if it’s become some hardware implementation of Mesh Colors, but it’s not really important what you call it.

Conclusion

I generally sympathize with the desire to store data directly on the surfaces of meshes without a complicated and tedious parameterization to a 2D UV-space, but PTex is not quite there yet. Even on a PC with 64-bit address space and gigs and gigs of RAM (and clever management of GPU memory), overheads of up to 12x is too high. Even with various mitigating strategies, the overheads remain substantial.

As much as it annoys me that we still don’t have an efficient way to store and retrieve filtered data on the surfaces of meshes, I think realistically we have to live with the 2D parameterization workaround for now.

9 thoughts on “Casting a Critical Eye on GPU PTex

  1. You might want to think about the problem a bit more — it is possible to support ptex on current GPU hardware (Just with OpenGL 3.1, no extensions) with no seams, good filtering, good performance, and *way* less overhead than what you are coming up with. See the current release of Autodesk Mudbox for an existence proof of this that runs on Windows, OSX, and Linux on DX10 level gfx hardware (1 or more gig of GPU ram recommended, 8+ gigs of cpu ram.)

    On the other hand, you’re right that handling triangles with ptex is a serious pain.

    Cheers,
    Ian.

    • Care to clarify? Without knowing what it does I can’t really say one way or the other if they avoid the issues.

      Can’t seem to find any good description of ptex in mudbox. I’m not too familiar with mudbox, but I can imagine a tool like that might for example not have to deal with block compression. Also, an authoring tool might just do manual filtering to avoid seams altogether since performance isn’t quite as critical as a final runtime renderer. Does it allow arbitrary texture sizes, or does it stick to power-of-two restrictions?

      The only two GPU implementations I’ve seen described in detail (Nvidia’s and AMD’s) seem to share the problems I mention.

      • As I said, if you think about the problem, you will be able to figure it out. (I’m leaving it as an exercise for the reader…. 🙂 )

        As for Po2, yes, but only because the PTEX spec restricts similarly. The techniques used by Mudbox internally do not have this restriction. And we do not filter “manually” — drawing the textured polys runs at full speed.

        We don’t do block compression, but I see no reason why it wouldn’t work. (In an authoring tool, performance would suck — those algorithms are slow to compress, fast to decompress — and our textures are changing constantly — we would have to disable compression on those textures that are being modified, compress in the background later — doable, but we have other fish to fry at the moment. And not much use to our film customers — they are working mostly at 16bit float per channel for color channels, and 32 bit float for displacement textures. And their texture sets are massive — for example King Kong had over 40 gigabytes in the diffuse color channel alone — more recent productions dwarf that.)

        For a bit more info on Mudbox, see;

        Massive texture handling;

        This video shows Ptex workflows.
        We also allow you to select regions of the mesh and locally increase/decrease texture resolution. That’s where Ptex really shines — doing that with conventional UVs is very difficult Ptex makes it very easy.

        Virtually all of the heavy computation in the texture painting part of Mudbox is done on the GPU in fragment shaders.

        (No Rickrolls, I promise 🙂 )

        Cheers,
        — Ian Ameline
        (Mudbox Tech Lead, but speaking for myself, not my employer.)

  2. Hmm.. Cryptic reply. Maybe I’m missing something blindingly obvious here, but as far as I can tell there are some hard limitations here and something has to give. Either you’re doing manual filtering, or you have to work within the constrains of hardware filtering, and those constraints are really the only thing that play into my argument.

    If you’re doing hardware filtering then you have to have enough padding around each tile to cover whatever size the hardware filter kernel is. You can share this gutter region between tiles (at the expense of some more transformations to find out where to sample from) and this can save you 50% of the gutter overhead, but by the time you issue a sample request you have to have enough texels around that point to support the filter kernel. This seems to me to be a mandatory aspect of how the hardware works. Right?

    So:
    * Smooth filtering across seams requires some sort of redundant texels along borders
    * For wide filter kernels you need more redundant texels.
    * For anisotropic filtering you need very wide kernels.
    * For block compression you need to round up the border regions to the nearest multiple of the block size (4×4) so that you don’t get seams due to compression artifacts being different on neighboring texels.
    * For hardware MIP mapping you need to have at least one extra hardware MIP level.
    * For each hardware MIP level the border region requirements double because you need the same sized border region (in texels) for smallest MIP chain, which will therefore occupy twice the size in UV-space. It’s probably best to have one redundant MIP level, than to bloat border texels.

    Power-of-two restrictions causes substantial overheads and would be enough all on its own to make this unreasonable for games, IMHO. I mean, imagine having two quads next to each other that you want to texture uniformly, one is ever so slightly larger than the other (a few percent uniform scale, say), does it really make sense that the cost of texturing the larger one is 4x higher because you needed to constrain it to a power-of-two size?

    • You are missing something (perhaps a couple of somethings). It’s not exactly blindingly obvious, but you are missing it.

      For typical production meshes, we are getting Texture usage rates not too much worse than hand-layed out UVs. There are some more recent GL extensions that let us tell the gfx pipe when rectangular areas of textures are unused. With those extensions, we can get very close to matching the memory efficiency of typical hand-layed out UVs. (usually around 70 to 80% or so — game devs may pay more attention to this and get better usage rates than we see in the film world.) And we are using the filtering in the hardware samplers.

      I agree the Po2 issue means that PTEX is still not great for games — but Mudbox lets you paint at very high res with PTex, without having to worry about UVs. You can finalize the UVs very late in the process and transfer the textures over only once (with just one resampling at that time). That workflow works well for games. Further, you can (if you want) use different UV layouts for different texture channels.

      Cheers,
      Ian.
      (All bets are off for those texture usage claims if the mesh has more than a very small handful of triangles.)

      • I probably should’ve been clearer that I’m talking about games here. A lot of problems go away when you’re doing authoring tools. E.g. no block compression (it’s not even desirable while authoring!) means you don’t have to worry about a given pixel all of a sudden looking different if you copy it somewhere else in the texture because its neighbors change.

        I’m guessing you’re talking about PRT here to unload sparse regions of memory (a la AMDs PTex implementation). That’s all well and good (though not portable), but even if you assume that texels_used / texels_allocated is the same as memory efficiency (it’s not: you want texels_desired / texels_allocated), 70 to 80% is not awesome. Not bad, but not great. I’ve seen (proprietary) texture atlasers to way better than that.

        I’m guessing that you use reduce gutter regions by coalescing blocks before packing them into the texture (and then deallocating empty regions). I guess that may technically count, but it’s not really much different from any old UV atlasing that we already do (and what I’m arguing we regrettably need to keep doing – split the mesh up into large charts and pack those so that most edges have natural adjacency in UV space).

  3. The 70 to 80% assumes generous border regions where needed (and the texels in the border regions are not counted as “wanted”) There are other issues in an authoring app — some coherence with placement in textures and position in world space is nice for texture painting performance, and being able to re-layout the faces in texture space very fast is also something nice. We could get tighter packing but there would be other costs, and what we have is “good enough”. 🙂

    Your assumption re atlasing and coalescing is reasonably close to what we are doing. The use of PRT Api extensions is optional. Nice if they’re present, not too bad if not.

    I will agree — PTEX is unlikely to replace UVs in games. But I still contend that the situation is not nearly as bad as you made it out to be originally — you do not need borders around every face — not even close.

    — Ian.

    • I get the feeling that we’re almost arguing semantics now. My argument against PTex is an argument against “pure” PTex where you store a completely separate texture per quad. I count both AMDs and Nvidia’s implementation as part of that category as they only do trivial “packing” of the tiles so they stick to the original idea in spirit.

      Sounds like your implementation is more of a hybrid where you’re moving towards more of a traditional atlas approach where quads are merged into larger patches. And that’s fine. In fact, that’s what I’m arguing that we’re going to need to do since the overheads of pure PTex are so high.

      I’d probably argue that you should keep going all the way to a full blown generic atlasing solution instead of constraining yourself to quads of power-of-two sizes, but I can see why an authoring tool in particular may not want that (harder to adjust on the fly, etc.). You can always bake it to a different representation at the end.

  4. I think you’re missing a significant point. Take the Rage engine, for example, where artists can paint textures over every little aspect of a map. That’s not at all practical from a runtime perspective, it’s completely uneconomical. And from a pure runtime execution/GPU perspective, you would be right in that PTex would end up requiring you to jump hoops just to pack what you have up into a UV texture coordinate atlas.

    The point I think you’re missing is *content creation*. The point of these technologies is to make it easier for the artists creating the content. From that standpoint, something like Rage or PTex makes things very practical in allowing artists to go further. Creating a UV map is a very painstaking process where even the best mesh parameterization algorithms and tools result in artists fiddling to try to avoid seams and distortion. The results tend to be very imperfect.

    At the same time, these techniques are very impractical in games for reasons beyond texture space. If you imagine dynamically fracturing a textured object with PTex, that involves an on-the-fly mesh topology change, and unless we’ve already converted the PTex representation to UV coordinates and tossed the PTex data aside, the only way to make such topology changes would be to reinterpolate texels: a very expensive process. So there’s a lot more to PTex making it impractical for direct use in games, and I’d agree overall that we’re very far from seeing its use in these most demanding realtime scenarios.

Leave a reply to Ian Ameline Cancel reply