Over the past few years, we’ve witnessed the meteoric rise of hardware-accelerated video transcoding. That is, we’ve seen a surge in the availability of graphics cards and processors that can decode and subsequently re-encode compressed video, whether using GPU shaders or dedicated transcoding logic. These days, every new GeForce, Radeon, A-series APU, and Core processor has some sort of hardware transcoding mojo. Vendors of video conversion software are working hard to support them all.
For everyday users and even enthusiasts, making sense of that jungle of disparate offerings can be tough. Because both CPUs and GPUs come with their own dedicated transcoding logic, some systems offer multiple paths to hardware acceleration. Users may find themselves having to choose between, say, the QuickSync logic in their Intel processor, the VCE logic in their shiny new Radeon, and good old software encoding. And there are lingering questions about image quality.
Which encoder is the fastest? Which one offers the best image quality? Is all conversion software created equal?
Those questions have nagged us for a long time, and we wanted to answer them. So, we whipped together an Ivy Bridge system and outfitted it with graphics cards featuring AMD’s and Nvidia’s latest dedicated transcoding hardware. We then tested all of that gear in a trio of major video conversion utilities: CyberLink’s MediaEspresso, ArcSoft’s MediaConverter, and a special build of Handbrake with an OpenCL-accelerated x264 encoder.
As part of our testing, we compared encoding times for the various hardware acceleration options in each program. We also looked at the image quality of the output files, cranking out a flurry of screenshots and hunching over our screens to discern even minute visual differences. We also took into account file sizes, to see if any of the encoders took shortcuts, and power consumption, to determine which solutions were the most energy efficient. Read on to see our findings.
A brief history of hardware video transcoders
Once upon a time, video encoding was the realm of the microprocessor. Improving performance meant adding more cores, ramping up clock speeds, optimizing for extra threads, and perhaps supporting some new instruction set extensions. Encoding speeds increased slowly, in a pretty linear fashion, with the arrival of new CPUs. That was the norm for many years.
Then, in 2008, a small software firm called Elemental Technologies opened Pandora’s box. Using CUDA, Nvidia’s general-purpose GPU programming interface, the firm developed a program that offloaded H.264 video encoding to GeForce graphics processors. Elemental’s early benchmarks showed a high-end GeForce could speed up video encoding nearly threefold compared to a dual-core Intel CPU. The approach made a ton of sense, of course. GPUs are highly parallel by definition—much more so than CPUs—and video encoding is one of those tasks that benefits greatly from parallelization. GPUs can’t do everything a CPU can do (and, sure enough, some of Elemental’s encoding work still had to be run in software), but GPU offloading yields some very real performance gains.
The result of Elemental’s efforts was Badaboom, a $29.99 app with a big, friendly interface to help users shrink their videos to fit on iPods, iPhones, and other mobile devices. Early, pre-release versions of Badaboom were clunky and unstable, but with the release of version 1.0 in October 2008, the software became a viable option. Badaboom didn’t just use graphics hardware to encode video, either. It also tapped into the GPU’s H.264 and MPEG-2 video decoding logic to enable hardware accelerated transcoding.
Barely a month after Badaboom’s public release, AMD counterattacked with a GPU-accelerated encoder of its own design. The principle was the same: tap into graphics shaders using a GPU compute API, and use the chip’s parallel processing resources to offload some of the video processing pipeline. Just as Badaboom supported only GeForces, AMD’s Avivo Video Converter worked only on Radeons.
It didn’t take very long for major software vendors to join in. In November 2009, CyberLink announced MediaShow Espresso (later to become MediaEspresso), a program similar to Badaboom that supported hardware acceleration with both AMD and Nvidia graphics hardware. At last, convergence had arrived. The problem was, running general-purpose code on AMD and Nvidia GPUs meant using a different programming toolkit for each. That meant different code paths and more work for CyberLink and other developers who wanted to support both vendors.
At that point, the OpenCL 1.0 specification was about a year old, and the web was abuzz with promises of vendor-agnostic, write-once, run-anywhere GPU computing. The hype would turn out to be somewhat unfounded, because OpenCL still requires programmers to optimize code for different hardware architectures, but the facts did little to dash people’s hopes. Everyone eagerly awaited OpenCL-accelerated video encoders that would work on any supported graphics (or even non-graphics) hardware, regardless of vendor or make.
Then, something funny happened.
Over the next few years, Intel, AMD, and Nvidia all started to embed dedicated video encoding logic inside their chips. Intel’s logic bore the name QuickSync, and it debuted inside Sandy Bridge CPUs in January 2011. AMD slapped a similar encoding block, dubbed Video Codec Engine (or VCE), into the Radeon HD 7000 series graphics processors earlier this year. Nvidia was the last to jump on the bandwagon with NVENC, yet another similar bit of hardware that premiered in the company’s Kepler-based GeForce 600-series GPUs this spring.
QuickSync, VCE, and NVENC don’t use graphics shaders. They’re self-contained black boxes that occupy a discrete area on the silicon and serve only one purpose: to encode H.264 video. The upshot, of course, is high performance even on low-end hardware, since a huge shader array isn’t required. But there are downsides. Developers can’t program these black boxes using general-purpose languages or open APIs, so they don’t have complete control over the video encoding pipeline. Therefore, there’s no guarantee that the output of each black box will be the same, even when the same parameters are provided. On top of that, software vendors face the hurdle of having to support three different (and incompatible) types of video encoding hardware, usually in addition to legacy, shader-based implementations for older GeForces and Radeons. That means more code paths, more debugging, and more potential inconsistencies in output. CyberLink’s MediaEspresso and ArcSoft’s MediaConverter, which we tested for this article, each support a different mix of hardware encoders.
In other words, things went from complicated to… well, even more complicated.
What happened to those OpenCL-accelerated video encoders we were all daydreaming about? Well, they’re still in the works, believe it or not. The folks behind the popular x264 software encoder have been quietly plugging away at an OpenCL-accelerated version of their lookahead pipeline. Lookahead only accounts for 10-25% of the total encoding time, according to x264 lead developer Jason Garrett-Glaser, but the process allows for nearly unlimited parallelism and is relatively easy to implement in OpenCL. Re-writing all of the x264 encoder in OpenCL, by contrast, would be “very hard.” Garrett-Glaser says the accelerated lookahead can increase performance by up to 40% on AMD’s new Trinity APUs and by a factor of two on the latest Radeon graphics cards.
A publicly available, OpenCL-enabled version of x264 should be out as early as next month. Luckily, we didn’t have to wait—we secured a build of Handbrake that includes a pre-release, hardware-accelerated x264 encoder, and we’ve posted the results alongside our data from MediaEspresso and MediaConverter.
Comparing the incomparable
Comparing performance and image quality across three video conversion applications using different hardware presents some inherent challenges. For starters, equalizing encoding settings is difficult, especially with user-friendly apps like MediaEspresso that obscure many of the more advanced encoding parameters. Then there’s the issue of the hardware itself. If hardware transcoders don’t produce identical output, then is their performance really directly comparable?
In the end, we decided to keep things simple. We grabbed a 1080p version of the Spiderman trailer, which weighed in at 177MB, and we picked a basic set of common settings for all the encoders to use. We chose to downsample the video to 720p, at a bitrate of 4000Kbps, with a constant frame rate of 24 FPS. The audio was converted to 128Kbps, 44.1kHz, stereo AAC. The idea was to replicate a common usage scenario: shrinking a high-def video to fit on a mobile device, like a modern smartphone or tablet. Those devices may have enough power to decode 1080p video, but their storage space is limited, and they usually lack the display resolution to render a full 1080p image. (Apple’s new iPad and Asus’ Transformer Pad Infinity are notable exceptions.) Our goal was to find out which application gave us the highest-quality output in the least amount of time.
We did come across one little kink, which was that MediaEspresso seems only to allow 720p output with letterboxing. In other words, it renders the 2.35:1 frame inside a taller, 16:9 frame with black bars. MediaConverter supports both native and letterboxed modes, while Handbrake has no letterboxing option that we can see. Since our build of Handbrake differs from the other encoders in that it uses OpenCL instead of hardware black boxes, we enabled letterboxing on MediaEspresso and MediaConverter and left Handbrake in native mode.
On the hardware side of things, we had our bases covered. The Core i7-3770K processor provided not just the latest iteration of QuickSync, but also Intel’s new HD Graphics 4000 IGP, whose shaders can be programmed using OpenCL. The GeForce GT 640 and Radeon HD 7750 gave us NVENC and VCE hardware blocks, respectively, in addition to support for shader-based encoding using OpenCL or other APIs. We were particularly interested to see how much of a speedup these $100 GPUs could provide over a fast CPU running on its own.
We should mention one last caveat before we go on, which is that the GeForce GT 640 has substantially less memory bandwidth than the Radeon HD 7750. We outlined the differences in our review. To make a long story short, the GeForce’s disadvantage is due to its use of slow DDR3 memory, and that slow RAM may have affected performance in our testing. As far as we saw, however, the GeForce wasn’t at a substantial disadvantage in any of our tests; it actually outperformed the Radeon by a good margin in one of them. We’d have loved to test another GeForce, but this is the only retail card with the NVENC encoding block south of $399 right now.
Our testing methods
As ever, we did our best to deliver clean benchmark numbers. Tests were run at least three times, and we reported the median results. Our test system was configured like so:
|Processor||Intel Core i7-3770K|
|Motherboard||Asus P8Z77-V LE Plus|
|North bridge||Intel Z77 Express|
|Memory size||4GB (2 DIMMs)|
|Memory type||Kingston HyperX KHX2133C9AD3X2K2/4GX
DDR3 SDRAM at 1333MHz
|Memory timings||9-9-9-24 1T|
|Chipset drivers||INF update 126.96.36.1999
Rapid Storage Technology 188.8.131.522
|Audio||Integrated Realtek audio
with 184.108.40.20602 drivers
|Graphics||Intel HD Graphics 4000 (integrated)
with 220.127.116.1161 drivers
|AMD Radeon HD 7750
with Catalyst 12.7 beta drivers
|Zotac GeForce GT 640
with GeForce 304.79 beta drivers
|Hard drive||Samsung 830 Series 128GB|
|Power supply||Corsair HX750W 750W|
|OS||Windows 7 Ultimate x64 Edition
Service Pack 1
Thanks to AMD, Asus, Corsair, Kingston, Intel, and Zotac for helping to outfit our test rigs with some of the finest hardware available.
Unless otherwise specified, image quality settings for the graphics cards were left at the control panel defaults. Vertical refresh sync (vsync) was disabled for all tests.
We used the following test applications:
We measured total system power consumption at the wall socket using a P3 Kill A Watt digital power meter. The monitor was plugged into a separate outlet, so its power draw was not part of our measurement. The cards were plugged into a motherboard on an open test bench.
The tests and methods we employ are generally publicly available and reproducible. If you have questions about our methods, hit our forums to talk with us about them.
CyberLink MediaEspresso 6.5
MediaEspresso is the more expensive of the two commercial video conversion applications we tested. The full version will set you back $39.99 from CyberLink’s website. It has a big, friendly interface with built-in profiles for smartphones, handheld media players, game consoles, and social networking sites. Using MediaEspresso is simply a matter of dragging and dropping a video into the main window, choosing a profile from the toolbar, and clicking OK. It’s also possible to set your own profiles, which we did, since we wanted to keep things consistent across our different encoders.
MediaEspresso 6.5 supports QuickSync and shader-based encoding on Nvidia and AMD GPUs, as well as the VCE block on 7000-series Radeons and Trinity APUs. The particular beta build we tested, numbered 6.5.2811.44122, also supports the NVENC block on Kepler-based Nvidia cards like the GeForce GT 640.
We measured performance using our stopwatch, timing how long it took for the encoding progress bar to disappear once we’d clicked the “OK” button. MediaEspresso reports its own encoding time, as well, so we also jotted down that figure. Along with performance, we recorded power consumption at the wall, file sizes, and actual bit rates as reported by Windows. Each configuration produced slightly different output, so those last three data points are important.
Note that we enabled both hardware decoding and encoding, and when the option was available, we selected “better quality” instead of “faster conversion.” Our preliminary testing showed that “faster conversion” increased the delta between our selected bitrate (4000Kbps) and the bitrate of the output file. We wanted to keep the output as consistent as possible across the board, so “better quality” won out.
Here are our results, with encoding times reported in seconds and the fastest configuration highlighted. Keep in mind that, since the output differs between the various solutions, encoding speed isn’t everything. Oh, and lower encoding times are better, obviously.
|Idle wattage||37 W||37 W||43 W||46 W|
|Peak wattage||86 W||78 W||86 W||95 W|
QuickSync wins this race, pulling off the lowest encoding time by a few seconds and the lowest power consumption. Strangely, though, the output file of our QuickSync config also had the lowest actual bitrate of the bunch: only 3828Kbps, a fair bit below our 4000Kbps target. NVENC, which had the second-lowest encoding time, also missed the mark, but by a smaller margin. In both cases, the resulting file sizes were lower than with our other two configs. (The unassisted CPU and the VCE-enabled Radeon both stuck more closely to our prescribed bitrate setting, so it’s no wonder that they produced bigger files.) The file size differences were relatively minor, though, and we’re still looking at a nice slimming down from the 177MB source file.
Let’s now take a quick look at image quality. We isolated two frames in our video: one showing fast camera panning and motion blur in a scene with high contrast, and another showing a relatively still scene with a high amount of detail. We took one screenshot of each frame from each output file, as well as from the source file. Since the screenshots from the source video were larger, we resized them to match the others using Photoshop with the default bicubic interpolation setting. These are two frames in a video that’s two minutes and 34 seconds long, so they quite literally don’t show the whole picture. They also don’t account for how things appear in motion. That said, the shots do give us some very useful clues about differences between the various implementations.
Click the buttons under each screenshot to toggle between the different solutions. You might have to wait a second or two for a new image to load after each click.
Predictably, the QuickSync and NVENC output files look the worst in our action scene. Blocky compression artifacts obscure detail, make smooth lines appear lumpy, and create a sort of shimmering around moving objects. VCE doesn’t fare much better; while it produces fewer artifacts, it jacks up the gamma and makes the picture look washed-out. The software encoder does the best job here by far. That said, color saturation is off across the board. Our source video has brighter reds and more vivid yellows than all of the output files.
The still scene shows more subtle differences, and it highlights another issue with the QuickSync output. Look at the frame of the actor’s glasses against the wall on the right. The frame should appear as a smooth line, but it’s oddly jagged in the QuickSync screenshot. We noticed similar pixelation in other scenes and on text throughout the trailer. In motion, the jaggies appeared to dance around objects. Clearly, there’s something wrong here. Perhaps the scaling from 1080p to 720p isn’t being done using the right interpolation method.
Meanwhile, NVENC continues to display more artifacting—you can see a big green smudge above the intersection of the characters’ shoulders—and the VCE output remains washed out. The software encoder once again produces the best output of the bunch.
ArcSoft MediaConverter 7.5
MediaConverter is a little cheaper than MediaEspresso right now—ArcSoft has it on sale for $29.99—but it serves essentially the same purpose: hassle-free, consumer-friendly video conversion. Both programs let you drag and drop source files into the main window, and both programs have an array of pre-cooked presets for everything from iPads to YouTube. ArcSoft includes some basic video editing features, as well, which gives it a slight leg up over CyberLink’s solution.
The hardware acceleration methods supported by MediaConverter include Intel’s QuickSync and proprietary shader-based solutions from AMD and Nvidia (APP and CUDA, respectively). The build we used, version 18.104.22.168, can also tap into AMD’s VCE encoder block. NVENC support isn’t on the menu, though, so the program used our GeForce GT 640’s shaders to accelerate encoding.
Our testing was conducted in much the same way as with MediaEspresso. We tried to stick as close as possible to our chosen compression targets. Hardware acceleration was enabled through the drop-down menu at the bottom right of the main program window.
We ran into an odd problem when testing VCE transcoding with the Radeon HD 7750. Our first run stalled with the progress bar at 0%, and when we tried a second run, the system crashed. Once we rebooted, however, everything was peachy. We were able to reproduce this strange behavior after re-installing our Windows image, so perhaps it’s simply a bug in the MediaConverter build we used.
|Idle wattage||37 W||37 W||43 W||46 W|
|Peak wattage||83 W||81 W||89 W||91 W|
Again, QuickSync comes out on top, managing the lowest encoding time and the lowest power consumption. And again, it gives us an output file with a lower bitrate than we asked for. We’re seeing a greater spread between reported bitrates across the board, though. The software and CUDA-powered encoders are overachieving somewhat, and the hardware encoding blocks from AMD and Intel both come in under 4000Kbps.
Note that the VCE and CUDA implementations are both slower than plain software encoding. Of course, it’s worth stressing that the Core i7-3770K is a relatively high-end processor (it’ll set you back $309.99 at Newegg right now). The hardware encoders might have compared more favorably to a slower chip. Also, the CUDA solution would likely have benefited from a faster GeForce with more shaders. In other words, a different mix of hardware might have yielded substantially different results, with the CUDA encoder potentially faring better and the software implementation falling behind.
What about image quality? Do the black-box encoders also struggle here?
Well, QuickSync doesn’t give us the sharpest-looking image, but it seems MediaConverter does a much better job than MediaEspresso of reining in our various hardware solutions. The differences in image quality are more subtle, and there are no egregious failures like the jaggies around object edges we noticed with QuickSync in MediaEspresso. Those may have been an artifact of low-quality interpolation in CyberLink’s software.
We do see the same loss in color saturation across the board, though. Strange.
In our action scene, the software encoder yields the best results, followed from a distance by the CUDA encoder. Surprisingly, despite maintaining a higher bitrate than QuickSync, the AMD VCE encoder produces the worst results. There’s a lot of artifacting, and a number of details are lost amid blurry smudges (like the light trails just under Spiderman’s neck and the pattern on his chest). The differences are even subtler in our still scene. If you look closely at the side of the child’s face and the wall behind the characters, you can see QuickSync produces more artifacts than the other solutions. CUDA seems to do the best job of preserving the original video’s film grain. VCE and the software encoder both fall somewhere in between.
Here, looking at the videos in motion didn’t reveal any quality differences that the screenshots didn’t already highlight.
This conversion utility differs greatly from the other two apps we tested. It costs nothing, and though it does include a number of presets for mobile devices, its interface isn’t really designed with the computer illiterate in mind. All kinds of encoding and conversion settings are at at your disposal right there in the main window. To be honest, it took a little digging in the documentation to figure out what some of the settings do.
The publicly available version of Handbrake lacks hardware acceleration support entirely. The pre-release build we used, as we noted earlier, features a beta, OpenCL-accelerated version of the x264 encoder. During a presentation at AMD’s Fusion Developer Forum last month, x264 lead developer Jason Garrett-Glaser said he expected to release the source code “in a couple of months”—so, by mid-August or thereabouts. He added that a public build would likely be out before then.
One more thing: this build of Handbrake doesn’t have a setting to disable OpenCL acceleration. Fiddling with the x264 config file in the program directory didn’t help, either. In the end, we used the latest public release available from the Handbrake website (0.9.8) to test raw CPU performance. That may not have been the most scientific approach, but it was the only option available.
|Idle wattage||37 W||37 W||37 W||43 W||46 W|
|Peak wattage||88 W||87 W||87 W||114 W||113 W|
One would expect the OpenCL encoding process to work identically regardless of the hardware used, so it’s a little surprising to see bitrate and file size differences. (Yes, we tried re-testing and came up with the same results.) In any case, the Radeon achieves the quickest encoding time, with Intel’s HD 4000 integrated graphics coming in last—behind the unaccelerated build running in software mode, in fact.
We can probably chalk up that last result to insufficient optimization. Our own testing gives no indication that the HD 4000 has poor OpenCL performance in general; in LuxMark, we saw a mobile incarnation of the HD 4000 slightly outpace the integrated graphics inside AMD’s A10-4600M APU. So, perhaps this version of the x264 encoder just isn’t properly optimized to take advantage of the HD 4000. For what it’s worth, the x264 developers recorded fairly substantial performance gains running their accelerated encoder on an AMD A10 APU’s integrated graphics.
Aside from slight variations in artifact patterns, it’s hard to discern much of a difference between the different solutions here. That’s good news. It means the OpenCL acceleration doesn’t degrade image quality in a noticeable way, regardless of the hardware used.
Handbrake is the only one of our three test apps that doesn’t mess with color saturation, too. Really, I’d say it has the best output, hands down.
The unfortunate truth is that, right now, hardware-accelerated video transcoding on the PC is a mess.
Support for black-box encoders is spotty. We saw output quality at the same settings vary wildly depending on the conversion software used. Not only that, but none of the black-box encoders we used matched the quality level of unaccelerated software conversion. Sometimes, the differences were glaring, with the black boxes producing a ton more artifacts and adding ugly jaggies around hard object edges. The only upside, really, is the encoding speed. For some folks, maybe that’s all that matters. Maybe it’s simply about getting a big video down to a manageable file size in as little time as possible. If you’re going to be watching the output on a 4″ smartphone, perhaps that isn’t a bad approach. Artifacts may not be visible or noticeable on that small a display, making encoder quality very much a secondary concern.
It’s a shame, though. Four long years have passed since Elemental released Badaboom 1.0, and we’re still facing a heavily fragmented ecosystem with vast inconsistencies in performance and image quality.
There may be hope on the OpenCL front. As we’ve seen, the OpenCL-accelerated version of x264 can produce relatively consistent output on different hardware. However, only a portion of the encoding pipeline is accelerated, with much of the work still being done on the CPU. On our test rig, substituting the Radeon HD 7750 for a much quicker Radeon HD 7850 didn’t substantially reduce encoding times—they were still just over 30 seconds. It’s possible some optimization work remains to be done. After all, we were using a beta, and the x264 developers haven’t released a public version of their OpenCL-accelerated software yet. Still, we’re not completely sold on the effectiveness of OpenCL acceleration here.
For the time being, the best option for quick, high-quality video transcoding is unfortunately to buckle down, get yourself a fast CPU, and run the best software encoder you can find (which may be Handbrake).
If performance matters to you more than quality, then using QuickSync in MediaConverter might be a suitable option. Encoding times will be very short, and image quality, while poorer than with Handbrake, will be adequate, especially if you’ll be viewing the video on a smaller screen. Other hardware transcoders were slower than our CPU in MediaConverter, though, and we were generally unimpressed with the image quality of the hardware solutions in MediaEspresso.