Impressive DirectX 12 Tech Demo Shows Performance Gain; Microsoft Promises Large Boost in CPU Muscle

During a livestreamed presentation at Build, Direct3D Development Lead Max McMullen unveiled Direct3D 12, which is the graphical beating heart of the upcoming DirectX 12 API.

McMullen explained in detail the advantages of the new graphics pipeline and how it's vastly superior to what developers are dealing with in DirectX 11.

In the slide above you can see today's most common graphics pipeline in yellow, made of a vertex shader, a rasterizer, a pixel shader and a blend state. That's what an application is using to execute a draw call and generate the pixels on the screen.

On the right Hardware State 1 represents shader code, or GPU instructions executed on the shader cores. Hardware State 2 is the combination of the rasterizer state and whatever control flow is necessary for linking the vertex shader to the rasterizer to the pixel shader. Hardware State 3 links the Pixel Shader with the Blend State. It takes the pixel shader output, splits it into blend functions and finally sends them to the render target.

If a driver wants to issue GPU commands for the application setting a vertex shader it needs to program both Hardware State 1 and Hardware State 2. If the application changes the rasterizer state, the driver needs to program Hardware State 2 again. Then when the application changes the pixel shader, it affects all three states. Finally, the blend states affects 1 and 3. This overlapping model causes a lot of redundancy and is far from ideal.

The ideal situation, instead, is that the application sets the pipeline state and the driver already knows what the application intended, enabling it to program all the hardware states only once. There's no control flow, no extra logic recording commands.

The D3D team found out that for a typical modern game there are 200 to 400 complete pipeline states per frame. In a minute there are 400 to 1,000 unique pipeline states, so they decided to group them all together in Direct3D 12.

Instead of having to do all the deferred tracking and finally resolving the state of draw time, when you set a pipeline state with D3D 12 the driver just automatically knows what set of hardware state to program.

Another problem is resource binding. It existed even before D3D was implemented, and a lot of the DirectX systems have been built upon the assumption of its existence. The drivers use it to work behind the scenes to help an app emit rendering commands efficiently.

Binding is used to solve "resource hazards" like switching a resource between a render target and a texture. On top of that there's a lot of resource lifetime tracking going on behind the scenes that takes up CPU resources. Resource residency management also causes an application to use a lot of GPU memory.

Bind points allow the drivers and the OS to page things in and out of video memory as commands flow to the GPU, but this comes at the cost of a lot of resources. In addition to that, the runtime mirrors all the states for future reference and to communicate them to the middleware used.

Resource hazards basically represent any state in which extra work needs to be done in the GPU to ensure data is coherent. The most common example is the switch between render target and textures, but other cases include tiled resources or the use of compute. Direct3D 12 gets rid of all the fine grain work that does that kind of tracking all the time via application explicit control, in order to pay the resource cost only when the application actually wants to make a transition between a render target and a texture or a similar case.

What it does is straightforward, declaring a resource, a source usage for that resource and a target usage, and then a call tells the runtime and the driver about the transition. The application can even transition multiple resources in a single API call, allowing the driver to emit the optimal sequence of commands to make sure the data is coherent for both resources at once.

So basically instead of doing all the tracking all the time, D3D 12 does it only once per frame, or at whatever frequency the application does transitions.

The same solution has been enabled to give the application explicit control over resource lifetime and residency, letting it easily track once per frame what resources are no more necessary and remove them, or set them to be reused.

Those changes enable D3D 12 to get rid of implicit tracking of hazards, reference counting for all the resource types in the bind slots and all the conditional flow control to make that work behind the scenes. Instead the resources are used only when the rendering algorithm of the application calls for it. That also removes the need for state mirroring altogether, as the application can communicate the current state to the middleware directly.

The D3D team did some analysis and found out that typical applications generate the exact same sequence of commands and bindings from one frame to the next in most cases. If an application binds 16 resources to draw an object in a frame, in the next frame it's gonna use the CPU to regenerate the same exact set of 16 bindings for that same object.

Caching that set of binding and just reusing them is much more efficient. In addition to that, whenever there's a partial change the entire set of binding has to be copied over to a different location and fully reset.

D3D 12 enables a solution for that called "Descriptor Heaps and Tables".

Not only it allows to cache and reuse static bindings that before had to be generated by the CPU with every frame, but it also can stream changes to bindings dynamically without paying the CPU cost to copy them over and reset them completely. It can even do both things at the same time.

A descriptor is simply a small chunk of data that describes the resources, including type, format data, mip count and a reference to the pixel data. DirectX 12 gives the application explicit control of when descriptors are created, copied and when they can be reused thanks to descriptor heaps, that are basically just giant arrays of descriptors.

The descriptor tables aren't API objects, but just a lightweight piece of state stored in the GPU shader cores themselves indicating which descriptors to use for any given draw call.

While low power GPUs can have only one descriptor table per shader stage, high powered GPUs can have multiple. So they can have one describe the high frequency elements that are changed with every draw, another for the static elements across the whole scene and others in between.

That way the application doesn't need to copy the entire set of descriptors from one draw to the next, but can just copy those that are actually changed. Below you can see the pipeline for Direct3D 12 with the descriptor heaps and tables on the right.

As mentioned above, the D3D team found out that typical applications and games normally send nearly identical set of commands to the GPU from one frame to the other. The similarity is between 90 and 95%. Only 5 to 10% of those commands are deleted or added between frames.

The changes made to the pipeline state object (described above) and the addition of descriptor heaps and tables dramatically simplify the recording and playback of commands, and Microsoft believes it can build a way to reuse commands that is both reliable and performant in D3D 12, contributing a lot to the efficiency of the CPU usage, and that's called "Bundles"

Bundles can be recorded once and reused multiple times in different frames or within the same frame, and can be assigned freely to any thread of the CPU. Furthermore, by setting a different set of bindings for the same bundle, it can be executed multiple times and actually look different on the screen.

The slide below shows the code to draw the same scene in DirectX 11 without bundles and in DirectX 12 with bundles. With the second solution the total number of CPU calls to the API is dramatically reduced, but the GPU is executing exactly the same sequence of commands, and that's much more efficient.

Below you can see the same scene rendered with and without the use of Bundles. The framerate is pretty much the same but but the CPU time in milliseconds spent to render that frame is much lower, moving from 0.75 milliseconds to less than 0.1 milliseconds.

In addition to making the whole process more efficient, DirectX 12 can also make it more parallel, allowing all the features above and more to be executed on multiple threads (on different cores of the CPU) at the same time, using more of the CPU to feed commands to the GPU faster, also raising the efficiency of the GPU in turn. Developers can also decide what processes execute in parallel depending on what makes the most sense for their content.

The main weakness of DirectX 11 and previous APIs of the series was that most of that workload was done on the main application thread, while other threads were mainly inactive, and DirectX 12 solves that issue.

At the bottom of the post you can see a 3D Mark demo running in DirectX 11 first, and then in Direct X 12. At the top of the screen you can see the CPU usage of each thread, and the total CPU time per frame in milliseconds.

During the DirectX 11 demo the graph was also switched to show only the graphical workload, subtracting the application workload, demonstrating that the application itself was getting parallelized, but all the graphics workload and all the GPU is happening exclusively on the first thread. That's obviously far from ideal.

When the demo switches to DirectX 12, using bundles, not only it's much faster, but everything is correctly parallelized and load is evenly split across cores. The demo also shows the same process without bundles for the sake of comparison, and the difference is very visible, with bundles providing about twice as much CPU performance.

_______

Writer's take: It's relevant to mention that after the presentation some alleged that this would lead to multiplying the performance of the Xbox One's GPU or of the Xbox One in general by two. That's a very premature (and not very likely) theory.

What is seeing a 2x performance boost is the efficiency of the CPU, but it won't necessarily scale 1:1 with the GPU. First of all, no matter how fast you can feed commands to the GPU, you can't push it past its hardware limits, secondly, the demo showcased below (the results of which were used to make those allegations) has been created with a graphics benchmark and not with a game, meaning that it doesn't involve AI, advanced physics and all those computations that are normally loaded on the CPU and that already use the secondary threads.

Thirdly, the demo ran on PC, and part of the advantage of D3D and DirectX 12 is to bring PC coding closer to console-like efficiency, which is already partly present on the Xbox One.

This isn't to say that there won't be an improvement, or that it won't be sizable. It will most probably will be very relevant, but it's way too early to pull out numbers and percentages as the variables in play are much more complex for full games and for the console than in a retooled 3D Mark demo running on PC.