Single page Print

A look at hardware video transcoding on the PC

Performance and image quality with black boxes and OpenCL

Over the past few years, we've witnessed the meteoric rise of hardware-accelerated video transcoding. That is, we've seen a surge in the availability of graphics cards and processors that can decode and subsequently re-encode compressed video, whether using GPU shaders or dedicated transcoding logic. These days, every new GeForce, Radeon, A-series APU, and Core processor has some sort of hardware transcoding mojo. Vendors of video conversion software are working hard to support them all.

For everyday users and even enthusiasts, making sense of that jungle of disparate offerings can be tough. Because both CPUs and GPUs come with their own dedicated transcoding logic, some systems offer multiple paths to hardware acceleration. Users may find themselves having to choose between, say, the QuickSync logic in their Intel processor, the VCE logic in their shiny new Radeon, and good old software encoding. And there are lingering questions about image quality.

Which encoder is the fastest? Which one offers the best image quality? Is all conversion software created equal?

Those questions have nagged us for a long time, and we wanted to answer them. So, we whipped together an Ivy Bridge system and outfitted it with graphics cards featuring AMD's and Nvidia's latest dedicated transcoding hardware. We then tested all of that gear in a trio of major video conversion utilities: CyberLink's MediaEspresso, ArcSoft's MediaConverter, and a special build of Handbrake with an OpenCL-accelerated x264 encoder.

As part of our testing, we compared encoding times for the various hardware acceleration options in each program. We also looked at the image quality of the output files, cranking out a flurry of screenshots and hunching over our screens to discern even minute visual differences. We also took into account file sizes, to see if any of the encoders took shortcuts, and power consumption, to determine which solutions were the most energy efficient. Read on to see our findings.

A brief history of hardware video transcoders
Once upon a time, video encoding was the realm of the microprocessor. Improving performance meant adding more cores, ramping up clock speeds, optimizing for extra threads, and perhaps supporting some new instruction set extensions. Encoding speeds increased slowly, in a pretty linear fashion, with the arrival of new CPUs. That was the norm for many years.

Then, in 2008, a small software firm called Elemental Technologies opened Pandora's box. Using CUDA, Nvidia's general-purpose GPU programming interface, the firm developed a program that offloaded H.264 video encoding to GeForce graphics processors. Elemental's early benchmarks showed a high-end GeForce could speed up video encoding nearly threefold compared to a dual-core Intel CPU. The approach made a ton of sense, of course. GPUs are highly parallel by definition—much more so than CPUs—and video encoding is one of those tasks that benefits greatly from parallelization. GPUs can't do everything a CPU can do (and, sure enough, some of Elemental's encoding work still had to be run in software), but GPU offloading yields some very real performance gains.

The result of Elemental's efforts was Badaboom, a $29.99 app with a big, friendly interface to help users shrink their videos to fit on iPods, iPhones, and other mobile devices. Early, pre-release versions of Badaboom were clunky and unstable, but with the release of version 1.0 in October 2008, the software became a viable option. Badaboom didn't just use graphics hardware to encode video, either. It also tapped into the GPU's H.264 and MPEG-2 video decoding logic to enable hardware accelerated transcoding.

The one that started it all: Badaboom 1.0.

Barely a month after Badaboom's public release, AMD counterattacked with a GPU-accelerated encoder of its own design. The principle was the same: tap into graphics shaders using a GPU compute API, and use the chip's parallel processing resources to offload some of the video processing pipeline. Just as Badaboom supported only GeForces, AMD's Avivo Video Converter worked only on Radeons.

It didn't take very long for major software vendors to join in. In November 2009, CyberLink announced MediaShow Espresso (later to become MediaEspresso), a program similar to Badaboom that supported hardware acceleration with both AMD and Nvidia graphics hardware. At last, convergence had arrived. The problem was, running general-purpose code on AMD and Nvidia GPUs meant using a different programming toolkit for each. That meant different code paths and more work for CyberLink and other developers who wanted to support both vendors.

At that point, the OpenCL 1.0 specification was about a year old, and the web was abuzz with promises of vendor-agnostic, write-once, run-anywhere GPU computing. The hype would turn out to be somewhat unfounded, because OpenCL still requires programmers to optimize code for different hardware architectures, but the facts did little to dash people's hopes. Everyone eagerly awaited OpenCL-accelerated video encoders that would work on any supported graphics (or even non-graphics) hardware, regardless of vendor or make.

Then, something funny happened.

Over the next few years, Intel, AMD, and Nvidia all started to embed dedicated video encoding logic inside their chips. Intel's logic bore the name QuickSync, and it debuted inside Sandy Bridge CPUs in January 2011. AMD slapped a similar encoding block, dubbed Video Codec Engine (or VCE), into the Radeon HD 7000 series graphics processors earlier this year. Nvidia was the last to jump on the bandwagon with NVENC, yet another similar bit of hardware that premiered in the company's Kepler-based GeForce 600-series GPUs this spring.

QuickSync, VCE, and NVENC don't use graphics shaders. They're self-contained black boxes that occupy a discrete area on the silicon and serve only one purpose: to encode H.264 video. The upshot, of course, is high performance even on low-end hardware, since a huge shader array isn't required. But there are downsides. Developers can't program these black boxes using general-purpose languages or open APIs, so they don't have complete control over the video encoding pipeline. Therefore, there's no guarantee that the output of each black box will be the same, even when the same parameters are provided. On top of that, software vendors face the hurdle of having to support three different (and incompatible) types of video encoding hardware, usually in addition to legacy, shader-based implementations for older GeForces and Radeons. That means more code paths, more debugging, and more potential inconsistencies in output. CyberLink's MediaEspresso and ArcSoft's MediaConverter, which we tested for this article, each support a different mix of hardware encoders.

In other words, things went from complicated to... well, even more complicated.

What happened to those OpenCL-accelerated video encoders we were all daydreaming about? Well, they're still in the works, believe it or not. The folks behind the popular x264 software encoder have been quietly plugging away at an OpenCL-accelerated version of their lookahead pipeline. Lookahead only accounts for 10-25% of the total encoding time, according to x264 lead developer Jason Garrett-Glaser, but the process allows for nearly unlimited parallelism and is relatively easy to implement in OpenCL. Re-writing all of the x264 encoder in OpenCL, by contrast, would be "very hard." Garrett-Glaser says the accelerated lookahead can increase performance by up to 40% on AMD's new Trinity APUs and by a factor of two on the latest Radeon graphics cards.

A publicly available, OpenCL-enabled version of x264 should be out as early as next month. Luckily, we didn't have to wait—we secured a build of Handbrake that includes a pre-release, hardware-accelerated x264 encoder, and we've posted the results alongside our data from MediaEspresso and MediaConverter.