Dates and details on the GeForce 6800 family
TR: When will the GeForce 6800 Ultra arrive in stores?
Tamasi: By Memorial Day the 6800 Ultra will be available, and by July 4th, the full line of the 6800 series will be broadly available.
TR: On the non-Ultra, how much memory will it have?
Tamasi: The $299 card?
Tamasi: That’s actually up to the add-in card guys. There will be versions, I suspect, with 128 and 256MB, but that’s more up to the add-in card guys than us, really.
TR: Will that card have a 256-bit path to memory?
Tamasi: Yes it will.
TR: Will it be DDR, DDR2, or DDR3 memory?
TR: That combination of specs sounds like a tall order at $299. Can you guys make money selling it at that price?
Tamasi: If we couldn’t, we wouldn’t. [Laughter.]
TR: We’ve heard that the GeForce 6800 Ultra GPU is 222 million transistors. How do you guys count transistors? Do you count all SRAM/cache, etc?
Tamasi: The only way we really know how to give an accurate transistor count is to count up all transistors on the chip, and that’s everything. So that number includes caches, FIFOs, register files. It’s all transistors. It’s not just logic transistors.
TR: Are you willing to divulge die sizes?
Tamasi: No, we don’t typically divulge that stuff. It’s big. [Laughter.]
TR: Are you counting the same way for this one as for the NV30 series and past GPUs?
Tamasi: Yep. We’ve counted transistors the same way since we’ve talked about transistor counts. In fact, I’m not sure why anyone would ever throw out a transistor count for a chip that wasn’t actually the transistor count of the chip.
TR: We noticed some interesting things about GeForce 6-series antialiasing in our review. Is the GeForce 6800’s 8X antialiasing mode 4X supersampling plus 2X multisampling?
Tamasi: The current mode that’s actually in the control panel is a 4X super/2X multi, and that will work in both OpenGL and D3D. We actually do have a 4X multi/2X super mode that a driver, probably within the next few weeks, is going to enable as well.
TR: Does GeForce 6800 antialiasing do anything at scan-out that won’t be picked up in screenshots? If so, what is it doing and in which modes?
Tamasi: The resolve passwhen you typically multisample you have to do a resolve passcan either be done as another pass in the frame buffer or at scan-out. In the case of, like, if you’re doing, say, 4X multisampling, that resolve pass is actually done what we call “on the fly.” We don’t take a separate pass and write another buffer.
So if you take screenshots, you need to… there’s a couple of utilities that will do the right thing and a couple of them which will not do the right thing. In fact, our drivers now basically do the right thing. In other words, when you grab a frame, it will give you a a post-resolve image as opposed to a non-multisampled image.
TR: Now, does that apply in all your multisampled modes?
Tamasi: Yeah. This resolve on the fly technology works for any multisampling mode.
TR: What about screenshots from 3DMark03? When you use its image quality tool, does it produce the correct output?
Tamasi: If you select AA with 3DMark, then you’ll get the correct frame grabs.
TR: ATI has touted “gamma-correct blending” for antialiasing in the R300 series. Does the GeForce 6800 Ultra have this feature, and if not, why not?
Tamasi: It does, and I want to be really specific about this, because there’s a lot of confusion about it. There’s a great deal of difference between gamma correction and gamma adjustment. What ATI does is do a gamma adjustment to gamma 2.2, which can be correct depending on your display, and that’s essentially what we do, as well. Gamma correction would typically would mean you could do an adjustment to any gamma, and that would require a shader pass.
TR: The GeForce 6800 Ultra’s pixel shader performance is way up from your previous-generation GPU.
TR: Are the NV40 pixel shaders derived from NV30-series shaders, or are they a clean-sheet design?
Tamasi: It’s a clean-sheet design. About the only thing they have in common is you could draw a block diagram and some of the blocks might look similar, but the code is all new.
TR: One of the GeForce 6800’s more important new features is Shader Model 3.0. Can you tell us briefly about Shader Model 3.0? How it will benefit gamers?
Tamasi: A couple of ways. There’s two big hunks of Shader Model 3, vertex and pixel shading.
On the vertex side, what Shader Model 3 brings is really three things: a much richer programming model, so you get longer programs, you get more interesting flow control. So from a developer’s perspective, they can do a lot of interesting things in Shader Model 3 that either they couldn’t do before in Shader Model 2 at all, or they can do much more efficiently in Shader Model 3. So, for example, complex character animation. When you’re skinning a character, you can actually branch and skip over pieces of code that would be unused in Shader Model 3, which would be a nice performance win, whereas in Shader Model 2 you’d have to execute that.
There are some new features in vertex shader 3.0. There’s a thing call vertex texture fetch which allows applications to actually access texture memory from vertex processing, which can be used for a lot of things including real displacement mapping, where you access height field and then displace vertices in the vertex shader.
One of the, probably, most overlooked but maybe most interesting features is one called geometry instancing, which essentially allows developers to batch up what previously would have been lots of small transactions, lots of small models, into very large indices of models and transmit those efficiently across the bus and into graphicsparticularly applications that do what we call lots of “little dude rendering.” Real-time strategy games are a great example of this, where you might have hundreds of relatively low-polygon-count models running around. Previously, you’d have to basically make a draw call for each one of those models, and that can be really inefficient. You know, it can load your CPU down, and you can have poor graphics utilization. Using geometry instancing, you can basically batch all that up into many times fewer draw calls. Typically tens and sometimes hundreds of times fewer draw calls, which will reduce your CPU utilization, allow your frame rates to improve, as well as improve your efficiency with your graphics processor.
That’s on the vertex side. On the pixel side, it’s much the same. You have a much richer programming environment, so you have very, very long programs, many orders of magnitude more instructions than Shader Model 2 provides. You have a real flow control model, so you get support for loops and branches and a call stack, just like you get in a real programming environment, and of course for Shader Model 3, the required precision is FP32, so you don’t get any artifacts that might have been due to partial precision. You can still get access to partial precision, but now anything less than FP32 becomes partial precision. Essentially, the required precision for Shader Model 3 is FP32.
What do gamers get out of this? Well, they’re going to get titles or content that either looks better or runs faster or both.
TR: I’d like to clarify something about Pixel Shader 3.0 programs. Some of the literature mentions instruction length limits “greater than or equal to 512,” while others say the limit is 65,536 instructions. What’s the story?
Tamasi: The minimum number of slots is 512, but if you support looping, you can execute many more instructions than that. So it’s a combination of… it’s basically flow control is the big reason for that. Shader Model 3 allows you to do flow control, so you can do loops and branches, and Shader Model 2 does not. There is a new profile, which ATI kind of announced at GDC, which is their 2.0b profile, which basically supports what they claim to be 512 instructions, but there’s no flow control, no changes in precision, no loops, no branchingnone of the new features, so to speak, of Shader Model 3, just 512 instructions in one pass. They don’t basically support loops or branching. Our hardware supports the full Shader Model 3 model, so you get 512 slots, so to speak, and with loops and branching you can execute 65,000 instructions.
TR: About dynamic flow control in real-time pixel shaders. Branching and conditionals seem to have the potential to produce some relatively costly pipeline stalls. What direction are you guys giving game developers about how to avoid these scenarios?
Tamasi: Basically, use them carefully. [Laughter.] You’re absolutely right. If you don’t use branching properly, it can be a performance loss, but there’s lots of scenarios where it can be a performance win. In fact, our mermaid demo uses branching quite effectively. The shader for the mermaid itself is actually one large shader, and it branches to determine whether it’s skin or the scale of what we call the fish-scale costume. We’ve been quite explicit about, you know, make sure you’re using branching to your application’s benefit. You’re right in that it’s not “free.” In fact, it’s not free on a CPU, either. It’s just that when you talk about a parallel pipeline like a graphics processor, executing a branch becomes a little bit trickier.
Shader Model 3.0 in real-time apps
TR: One of your examples of a complex pixel shader at Editor’s Day was a skin shader for Gollum from Lord of the Rings with subsurface scattering. The presentation said that shader required 135 instructions, 14 texture accesses, and 259 FLOPS per pixel to compute.
Tamasi: Yeah, that was just the subsurface scatter component.
TR: What kind of shader lengths are viable for real-time applications with the GeForce 6 series? Can you give me a ballpark?
Tamasi: Hundreds of instructions. Frankly, it depends on the nature of the math, what you’re doing, how many texture accesses, that kind of thing, but to give you a feel of it, at that same Editor’s Day, the folks from Epic gave a demonstration of Unreal Engine 3, and they commented that most of their shaders are between 50 and 150 instructions long.
TR: I’m curious about this: Developers will probably be writing shaders in a high-level shading language like HLSL, which will them be compiled for the target hardware, if I understand correctly. What would a developer writing in HLSL do differently if his target were Shader Model 3.0 versus Shader Model 2?
Tamasi: Basically, they’ll write in HLSL, and really there’s two levels of compilation, is the right way to think about it. There’s the API, DirectX, will do what I would call kind of a pre-compilation to whatever runtime target, whether it’s Shader Model 2.0, Shader Model 2.0b, Shader Model 3.0. Then, once the API does that work, then there’s actually a compiler in the driver. Anybody who builds hardware has a compiler in their driver which will take the API instruction set and turn it into essentially machine code for the hardware.
So from a developer’s perspective, they write in HLSL, and if they want to support Shader Model 3, they’ll write code that requires loops and branching and has long shaders, and the API will deal with that. If they want to target hardware that supports something less than Shader Model 3, they’ll have to write HLSL code with that in mind. And basically, there’s a profile for that that Microsoft provides. It was actually part of DirectX 9 initially. Shader Model 3 was actually in the API in DirectX 9. DirectX 9.0c will essentially enable it from a hardware perspective.
TR: Can you give us some quick examples of effects possible in real time with Shader Model 3.0 that aren’t possible with Shader Model 2.x?
Tamasi: There’s a lot of sophisticated shadowing and lighting algorithms that you can do that would either be, not necessarily impossible, but just very impractical with Shader Model 2. For example, you can early exit in Shader Model 3 from a shader that might require execution of hundreds or potentially many hundreds of instructions in Shader Model 2, which might be impractical from a performance perspective. You can do true branching, which can simply, you can do things that you can’t do in Shader Model 2.
One of the examples, from our own developers, is the physics demonstration that we gave at Editor’s Day that actually provides, with Shader Model 3, a feedback path between the pixel and the vertex processing. In that particular demonstration, what the developer did was displace a geometry field to create essentially a mountainous scene, and then they compute the physics for the particle system entirely in the graphics processor. They actually compute what we would call motion vectors in the pixel shader and they feed those motion vectors back into the vertex processor and use vertex texture fetch to read the motion data to move the particle system around. So it’s a completely GPU-driven particle system, for example.
There’s a lot of things like that that are possible with Shader Model 3, but frankly, I think the biggest win for Shader Model 3, and from what you’ve read from developers or if you’ve talked to them you’ll hear pretty much the same thing, is that Shader Model 3 fundamentally just makes it easier on developers. As far as I can tell, that’s the biggest win for everyone, because it gives them a real programming model that they’re used to. When’s the last time you wrote a C program that didn’t have a branch in it? So they get a real programming model. They don’t have to worry about instruction set limits and what I call “coding inside out.” They can just kind of write their shaders and not have to worry about, “Gee, is this 96 instructions?” or whatnot. And frankly, the feature set is complete enough that they can just kind of code away and get the effect that they want. And frankly, it can be completed simpler and easier in Shader Model 3, so from a productivity perspective, they’re going to be much happier.
That, I think, in combination with the fact that NV4x does 64-bit floating-point framebuffer blending and texture filtering has really make it a lot easier for developers to do high-quality shading content.
TR: What about some examples of shaders where FP32 precision produces correct results and FP24 produces visible artifacts?
Tamasi: You don’t have to listen to me, you can listen to the statements by Tim Sweeney. They’ve got a number of lighting algorithms that produce artifacts with FP24. In general, what you’re going to find is that the more complex the shader gets, the more complex the lighting model gets, the more likely you are to see precision issues with FP24. Typically, if you do shaders that actually manipulate depth values, then again you might see issues with FP24.
And I think lastly, the big issue is that there is no standard for FP24, quite honestly. There is a standard for FP32. It’s been around for about 20 years. It’s IEEE 754. People, when they write particularly a sophisticated program, they kind of expect it to produce precision that you’re somewhat familiar with, and single-precision floating point on CPUs has been FP32 for years. I think from that perspective it’s much more consistent. They don’t have to worry about special-casing things. They don’t have to worry about, “Gee, whose FP24 is it?” since there is no standard. If someone implemented FP24 this way, it might be different on someone else’s hardware, that kind of thing. But generally, the more complex the lighting algorithm, or they actually manipulate depth, the more likely you are to run into precision issues with FP24.
Far Cry and shader models
TR: We’ve seen the Far Cry screenshots you all released with Shader Model 3.0 effects.
Tamasi: Actually, those are Shader 2 or Shader 3. That’s right.
TR: One of the effects we’re seeing is a “pseudo displacement mapping” effect, isn’t it?
Tamasi: Yeah. “Virtual displacement mapping,” “parallax mapping,” there’s been a number of terms for that.
TR: Any idea how many instructions long the shader program is that produces this effect?
Tamasi: That effect actually is reasonably inexpensive from a number of… I think it’s less than ten for that one particular piece of that effect. It’s actually less than ten shader instructions to do that.
TR: Will we see a Shader Model 2.0 path for GeForce FX with this same effect in Far Cry?
Tamasi: Yeah, the images that you’ve seen from Far Cry, the current path, those are actually Shader Model 2.0, and anything that runs Shader Model 2.0 should be able to produce those images.
TR: Looking at some of your presentations, it appears each NV40 pixel shader unit, and I guess there are two in each pixel pipeline, can work a couple of different ways: it can perform a three-component vector operation and a single-component scalar op in one clock cycle, or it can perform a a pair of two-component vector operations per clock. Do you have any examples of what type of graphics operations could take advantage of this capability?
Tamasi: Well, there’s a new rage, so to speak, in terms of shading effects, what we would call post-processing effectsglows and blurs and things of that nature, or other lens effects. Most of those effects tend to be two-dimensional, because you’re typically operating on the entire image, and therefore, if it’s two dimensional, it just has XY coordinates. So, from a coordinate system perspective, those are two-component type operations, and those are all nice wins when you can do parallelized operations.
TR: Inside of the pixel pipeline, you’ve got two of the FP32 pixel shaders in each pixel pipe. Can both of them do parallel vector operations per clock?
Tamasi: Yep. The way to think about it is that you can dual (or more) issue instructions per shader unit, and then you can co-issue between them as well, so, in fact, you can have four, or in some cases more than four, instructions being issued on a single pixel pipelinetwo in shader unit one and two in shader unit twotwo independent instructions in shader unit one and another two independent instructions in shader unit two. We also have mini-ALUs in each of those shader units, as well, which also can have instructions issued to them. We gave a shader example that actually had up to seven instructions being executed in parallel in one pass.
TR would like to thank Tony Tamasi for his time and patience in answering our questions.