We recently had the opportunity to speak with Neil Trevett, who fills positions as both the Khronos Group’s President and Nvidia’s VP of Embedded Content. Consumers might not hear the Khronos name too often, but the organization is responsible for setting and updating a number of key standards: among them OpenGL, OpenGL ES, and most recently, OpenCL.
It was that last standard we wanted to talk about. In December, Khronos completed the first version of OpenCL, and all major players in the graphics market—Intel, Nvidia, and AMD—ratified it.
Once all of those companies release compliant drivers, developers will be able to write apps that tap into the parallel computing resources of any compliant GPU from any vendor. That’s a pretty major departure from older GPU compute application programming interfaces (APIs) like C for CUDA and Brook+, which are each tied to a particular vendor’s hardware (Nvidia for the former, AMD for the latter).
To break the ice, we asked Mr. Trevett to update us on what’s going on with OpenCL. Is Khronos doing anything new with the API? Here’s what he told us:
As you know, OpenCL 1.0 was released back in Siggraph Asia last year, so actually it’s only been around six months since the 1.0 specification was announced. You’ve probably seen the announcements that Apple made around their WWDC event. They’re beginning to explain how the Snow Leopard OS is going to use OpenCL to unleash the power of the GPU for a wide range of applications inside Snow Leopard.
At Nvidia, we are the first GPU company to ship beta OpenCL drivers. Actually, now we’re shipping fully conformant OpenCL drivers for our range of GPUs. So Nvidia is committed to timely shipment of . . . OpenCL implementations on our GPUs.
We are working, of course, on the next OpenCL specification. Because OpenCL is so new, we are in the mode of taking input from the developer community before we make any final decisions on what’s going to be included in the next generation of OpenCL and the precise timing. We’re not going to wait too long, but we do need to let the developer community kick the tires on OpenCL 1.0 before we head off with a next generation. That’s going to happen over the next few months. Siggraph is a good opportunity to get interaction with the developer community.
Will we see many OpenCL-enabled consumer apps from major application vendors?
Yeah, absolutely. I think it’s interesting; you can split the types of apps down to their individual categories. But I think as GPU compute becomes more widely available, I think over time you’re gonna see these historical categories begin to break down. I think you’re gonna see a very innovative ebb and flow between the different application categories, and see new types of applications emerge that weren’t possible before they could tap into the parallel computing inside GPUs.
So right now, these traditional parallel computing communities are coming to OpenCL. We had the high-end [high-performance computing]—the labs and engineering departments—doing large compute projects. They’re using OpenCL all the way down to consumer applications. The most obvious parallelization opportunity is of course with images and video. So I think you’ll see a wide range of imaging applications plugging into the parallel GPU. You can see the beginnings of that with things like Photoshop that have traditionally used CUDA. You can see a wide range of imaging applications tapping into OpenCL; video even more so—different transcoding, video enhancements, quality enhancements, even image-recognition types of applications. So your videos will be auto-metadata-tagged eventually with image recognition algorithms running on the GPU.
I think [this is] the first wave of making supercomputing performance available on every desktop and laptop, and it’s gonna take more than six months for the developer community to really get a feel of what’s possible. And I think it’s going to unleash a wave of innovation that we haven’t seen before.
What about the short term? We’ve recently seen video transcoders from Elemental and Cyberlink that use GPU computing through proprietary APIs. Are those apps going to be ported to OpenCL? Will we see other players join in?
I shouldn’t put words in vendors’ mouths. There are a lot of vendors using CUDA today. Some of them might stick with CUDA, a large number of them I think will move to OpenCL so they can tap into GPU compute across a broad range of platforms. From Nvidia’s point of view, we’re happy for them to use CUDA or OpenCL; we’re giving the choice to the application developers. It all taps down to the CUDA architecture running on our GPUs. So, it’s a just a choice of different programming techniques that we can offer to the developer community.
I think having a standard API that is portable across multiple vendors’ silicon will grow the total market for applications that use GPU compute. I think it’s a necessary evolutionary step to making parallel computation just pervasively available everywhere. Of course, it’s gonna happen first on the desktop, but you might’ve noticed that OpenCL also has an embedded profile—OpenCL “ES” if you like—in the 1.0 specification. So, over the next few years, you’re gonna see OpenCL embedded profiles used alongside OpenGL ES. So it’s not just high-end servers and high-end desktops; it’s gonna be laptops, netbooks, and mobile devices over the next few years that tap into parallel computation.
So, we’ll see OpenCL in cell phones. Would that involve, say, the graphics portion of a device’s system-on-a-chip?
Yeah. It’s not here today, it’s definitely— We’re preparing for the future here, but I think it is inevitable. You can look at the evolution of mobile graphics silicon. It is tracking the desktop silicon, so at some point in the not-too-distant future, the GPUs will be programmable enough to support CUDA or OpenCL programmability. And that’s going to enable another wave of innovation, having the power of a supercomputer in the palm of your hand in a device that has multiple sensors, such as video and still cameras, and will be always connected. [It] is going to enable so new classes of applications that you haven’t seen before.
To sum up, OpenCL use may grow slowly at first, and initial applications might not necessarily be groundbreaking. As developers get acquainted with the API and Khronos keeps improving it, though, Trevett thinks we can look forward to exciting new things (like automatic metadata-tagging of videos) and a spread into the world of handheld devices.
That’s all well and good, but OpenCL isn’t the only API in town. We just mentioned C for CUDA and Brook+, and Microsoft is also cooking up DirectX 11 Compute Shader—a vendor-independent API that also promises GPU computing for all. At Computex in June, AMD and Nvidia both demonstrated an automatic, profile-based video transcoding feature in Windows 7 that used DirectX Compute Shader. Let’s find out what Khronos thinks about all of these APIs.
OpenCL vs. other APIs, multi-core CPUs
We didn’t beat around the bush. We asked Trevett how the different APIs for graphics processor computing—C for CUDA, Brook+, DirectX Compute Shader—are going to co-exist with OpenCL. Here’s how he responded:
That’s actually interesting. The graphics APIs have been roughing it out for over a decade now. . . . It’s actually not as hard as people think to move from one API to the other, but people do care quite a lot about the APIs that they use. I think it’s actually less of a big decision for the parallel programming community, and there are already multiple languages for programming the CPUs—C, C++, C#, Java, [etc]—and that’s fine. People have the choice to pick a language that best suits their particular situation and their technical requirements.
So, I think it’s actually not a problem. I actually think it’s a positive and healthy thing that there are multiple programming languages out there for people to choose from to tap into parallel programming. For some application developers, platform portability will be the key driver, others with more specifications, they might choose to go with a vendor-specific language like C for CUDA. It doesn’t matter, actually, as long as they’re enabled to tap into parallel-compute goodness. That’s sort of what really matters at the end.
But the other interesting dynamic, though, and something that might factor into the choice that these individual developers might make—you’ve probably had this conversation with our CUDA team—is that OpenCL and C for CUDA are actually at very different levels. OpenCL is the typical Khronos API. Khronos likes to build the API as close as possible to the silicon. We call it the foundation-level API that everyone is going to need. Everyone who’s building silicon needs to at some point expose their silicon capability at the lowest and most fundamental, and in some ways the most powerful, level because we’ve given the developer pretty close access to the silicon capability—just high enough abstraction to enable portability across different vendors and silicon architectures. And that’s what OpenCL does. You have an API that you have control over the way stuff runs. It gives you that level of control.
Whereas C for CUDA, it takes all of that low-level decision making and automates it. So you just write a C program, and the C for CUDA architecture will figure out how to parallelize. Now, some developers will love that, because it’s much easier, and the system is doing a lot more figuring out for you. Other developers will hate that, and they will want to get down to bits and bytes and have a more instant level of control. But again, it’s all good, and as long as the developers are educated as to what are the various approaches that the different programming languages are taking, and are enabled to pick the one that best suits their needs, I think that’s a healthy thing.
But, perhaps more importantly, how does OpenCL compare with DirectX 11 Compute? Trevett addressed the subject twice, noting the following at the beginning of our interview:
It’s interesting to compare and contrast DirectX Compute Shaders with OpenCL. The approach we’ve taken with OpenCL is that you don’t have to use OpenCL with OpenGL obviously if you were using compute in a visual application. But the advantage of having OpenGL as a standalone compute solution is that you can get portability across a lot more different types of silicon architectures, CPUs as well as GPUs. . . . OpenCL is a very robust compute solution rather than compute within the context of the graphics pipeline, which is more the approach that DX 11 Compute Shaders have taken.
When we pressed him for details later on, he added the following:
I think DirectX 11 Compute is still under NDA, so I don’t want to go into that yet. Other than the obvious thing we mentioned before, which is that OpenCL is a standalone, complete compute solution you can use for protein folding and particle analysis never touching the pixel, and you have the option of interopping it very closely with OpenGL, so you can use it for image processing and feeding into and feeding out of the OpenCL pipeline.
Versus the approach that DirectX 11 Compute takes, which is . . . “super shaders”, which are like general-purpose C shaders. But those shaders exist within the context of the DX graphics pipeline, so it’s intended to soup up your graphics applications but you’d probably find it more difficult to write, you know, a general-purpose animation package. There’s a difference in approach.
Finally, we were curious about OpenCL and GPU computing in general versus the CPU. Let’s imagine a system with four CPU cores and a relatively slow integrated GPU: for a task like video transcoding, would it be better to use the GPU through OpenCL or the CPU? Will consumers have to face that trade-off, needing to choose between the GPU and CPU to get the best performance in certain apps, or will it be so clear-cut that they’ll want to use the GPU every time?
It depends on a number of things. The high-order bit is that it depends on the application and the amount and type of parallel processing that’s available within an application. And imaging applications and video applications and other applications where you’re just dealing with large parallel data sets—not necessarily pixels, but for consumers, images and videos are the obvious big parallel data sets that people deal with every day—there’s a degree of parallelism there that is easily distributed over the hundreds of cores that you get in a GPU.
If you have a different type of application, where the parallelism is either not present, meaning there’s simply nothing happening in parallel, or the parallelism is a lot more difficult to extract—regardless of the API or programming language you’re using, it’s just hard to parallelize—then that application will have more affinity to running on a CPU.
Now over time, the two will begin to merge. We’re getting multi-core CPUs and the GPUs are getting more and more programmable. So over time, applications in the middle will have a grown choice. They could run essentially on either. So, again, we’re in the pretty early stages of this market developing, so I think the first wave of OpenCL applications, we’re probably gonna find applications that choose one or the other, probably. You will find some applications with not too much parallelism that will want to run on four-core or eight-core CPUs. Applications like imaging and video, it’s obvious that it’s gonna get a pretty big-time speedup running on hundreds of cores on a GPU.
So, the first roll of applications will make that hard choice at programming time. But as the silicon architectures get more advanced, and the APIs evolve and get more querying capabilities, so the application can tell dynamically what’s in the machine and what the machine’s already doing. I mean, if the GPU’s hard at work playing a video game and then the user wants to kick off video transcoding, some dynamic balancing decisions will be made. And over time, the APIs will begin to enable the application in real time to figure out where they can best run on a machine. And over time you will find applications that do dynamically decide where they’re gonna run and make best use of the resources as they are available in real time on a device. Most developers and APIs aren’t quite there yet, with maybe that level of dynamic load-balancing, but I think that’s the ideal that everyone will be working towards.
Here, Trevett’s answer was especially interesting in light of Nvidia’s latest PR campaign, which has involved talking down the importance of the CPU and hailing the GPU as a sort of computing panacea. Khronos and Trevett seem to be taking a more pragmatic view, hoping OpenCL can dynamically tap into the computing resources of any capable processor. With the line between CPU and GPU likely to blur only further in the future, that approach probably makes sense. (Just in case you forgot, Intel is just months away from releasing its first x86 CPUs with built-in graphics cores, and we expect to see the chipmaker launch Larrabee, an x86-derived GPU, next year.)
With all that said, OpenCL looks to have a bright future ahead of it. Trevett suggested that DirectX Compute Shader is more limited, especially since Microsoft has tied to Windows, so developers could flock mostly to Khronos’ API for their GPU compute needs. That would give us a wealth of general-purpose apps that can get a boost from Intel, Nvidia, and AMD GPUs and run across different operating systems. Down the line, developers should also be able to get their GPU-compute-enabled apps running on handhelds and cell phones. Exciting stuff. Now, we all we have to do is wait for developers to make some cool things with these new tools.