AMD: 4000-series Radeons have ‘known performance issues’ with OpenCL

From the beginning, both AMD and Nvidia have been heavily backing the OpenCL framework, which promises vendor-agnostic GPU computing goodness for all. If the contents of a thread in AMD’s Developer Forums are any indication, however, owners of Radeon HD 4000-series graphics cards may end up with the short end of the stick when it comes to GPU computing.

In the thread, a developer who identifies himself as Matt Taylor wrote, “We’re developing using openCL, and have one dev machine with an NVIDIA GTX 260, and another with an ATI 4870. . . I’m sorry to say we are getting approximately 5x the performance from the NVIDIA card, than from the ATI.” Performance is so bad, Taylor adds, that his 2.4GHz Core 2 Quad processor outperforms the Radeon “by a factor of two.”

AMD OpenCL Compiler Engineer Micah Villmow responded an hour later with the following:

This is entirely dependent on how you coded the kernel and what OpenCL features you are using. There are known performance issues for HD4XXX series of cards on OpenCL and there is currently no plan to focus exclusively on improving performance for that family. The HD4XXX series was not designed for OpenCL whereas the HD5XXX series was. There will be performance improvements on this series because of improvements in the HD5XXX series, so it will get better, but it is not our focus.

Villmow later qualified that response by saying, “[the Radeon HD 4870] just has to be programmed differently than the 5XXX series to get performance because of the lack of proper hardware local support. It is possible to get good performance, just not with a direct port from Cuda [Nvidia’s GPU compute architecture].” He also stressed that AMD’s compiler stack will include more device-specific optimizations as it matures.

In any case, this example doesn’t bode well for pre-DirectX 11 Radeons in the coming wave of OpenCL applications.

We should of course point out that not all Nvidia cards are based on the same GT200 architecture as the GeForce GTX 260 that purportedly performed so well. G92-based offerings like the GeForce GTS 250, GeForce 9800 GT, and GeForce GTX 200M series make up a big chunk of Nvidia’s current lineup, and they’re all derived from the older G80 design. The G80 was Nvidia’s first DirectX 10 architecture, and it might have some of the same hardware limitations as the Radeon HD 4000 series when it comes to GPU computing. (Thanks to Expreview for the link.)

Comments closed
    • dustyjamessutton
    • 10 years ago

    If OpenCL ends up as successful as OpenGL in the gamer market, I’m not holding my breath. As long as DirectX performance is good, I’m happy. I do wish that OpenGL had been more successful, as I root for good cross-platform standards, but OpenGL didn’t have the finances or the marketing behind it to make it a good competitor to DirectX, and I suspect OpenCL to suffer the same fate.

      • reever
      • 10 years ago

      Or the fact that adding new features in OpenGL takes like 5 years

      • Manabu
      • 10 years ago

      Actually, OpenGL was successfull at the image processing and rendering market… OpenGL 3.0 was made thinking about them, not games and DX10. So I don’t doubt about openCL gain traction in the same market.

    • Freon
    • 10 years ago

    I’m not holding my breath for this to really screw me. *hugs 4850*

    • moose17145
    • 10 years ago

    I fail to see how this matters. Any firm or person that would be using this to do anything practical (IE doing things that actually make money, and not just dicking around during their free time) will have a card that is designed to handle this anyways. Meaning they will already have a Radeon 5XXX series card, or more likely, a GeForce solution based off of the GT200 architecture.

    Also i don’t understand again why it matters given how new this technology is. Everyone is acting like it needs to work on every generation of cards right from the get go. Give it another couple years for the technology to mature and grow, both on the hardware and the software front. By then no one will care even if it runs 5xxx series card or not, let alone a 4xxx series card anyways.

      • _Sigma
      • 10 years ago

      If OpenCL is to be used in anything widespread (OS features, games, etc) it needs to be backwards compatible with, at the very least, what was until quite recently a high-end graphics card.

    • BooTs
    • 10 years ago

    Hey Cyril, nice xmas gift to the trolls and fanbois.

    • applejack
    • 10 years ago

    this article missed an important point, noted at the source (expreview):

    “if you are using local memory, they are all currently emulated in global memory. This can cause a fairly large performance hit if the application is memory bound. On the ATI Radeon HD 5000-series, local memory is mapped to hardware local and thus is many times faster than the ATI Radeon HD 4000-series,” explained Mr. Villmow.

    • Fighterpilot
    • 10 years ago

    #4 l[

      • Arveedeetoo
      • 10 years ago

      Take your snide comments to the guy that made that quote. Maybe you can dazzle him with your expertise in GPGPU code optimization. I’ve been around too long to be a “fanboi” of any sort so no thanks.

    • Krogoth
    • 10 years ago

    They both take different architectural approaches. It is hardly surprising that there is a huge performance difference with the same codepath.

    Nvidia GPGPUs have fewer “shader units”, but they are clocked much higher than their AMD counterparts. AMD takes the ton of “shader units” at lower clock speeds route.

    Most GPGPU applications on market are optimized for Nvidia architectures. I suspect AMD GPGPUs can be almost as fast if the aforementioned applications were optimized for it. AMD hasn’t gone as far as Nvidia in encouraging software developers.

    • NarwhaleAu
    • 10 years ago

    1 line article summary: “It is possible to get good performance [from a 4xxx], just not with a direct port from Cuda”

      • odizzido
      • 10 years ago

      Obvious statement is obvious.

        • Shining Arcanine
        • 10 years ago

        Every time I see someone say “x y is x” where x is an adjective and y is a noun, it seems as if a puppy died.

          • Firestarter
          • 10 years ago

          Dead puppy is dead.

    • Mikael33
    • 10 years ago

    I don’t see the relevance of this if Nvidia’s G80 based cards are just as bad.

      • swaaye
      • 10 years ago

      Well the HD4xxx cards coincided with G92 / GT200 era. G80 should be compared to the not-even-mentioned RV670 and perhaps even R600.

      • Shining Arcanine
      • 10 years ago

      The GT200 is only supposed to be about 2 times faster than the G80. Having it be 5 times faster than the G80’s competition would imply that the G80’s competition is slower than the G80.

      • HighTech4US
      • 10 years ago

      > I don’t see the relevance of this if Nvidia’s G80 based cards are just as bad.

      Wow lets extrapolate that if AMD screwed up nVidia must have also.

      nVidia had the vision to design the G80’s with local memory. Something AMD thought was unnecessary has now bitten them and their users hard. You want performance with AMD toss your 4000 series and buy new hardware.

      GPU computing came from nVidia’s vision. AMD has been the follower not the leader in the GPU compute field.

      OpenCL is not a magic bullet that can fix poor design decisions. AMD’s screw up is theirs alone.

        • squngy
        • 10 years ago

        It wasn’t a screw up, its an oversight.

        They simply didn’t have GPGPU in mind when they designed the 4000 series. Nv on the other hand has been pushing it pretty hard for quite a while…

          • fellix
          • 10 years ago

          Actually they had — HD4000 architecture is compliant with OpenCL (and DirectCompute for that matter) regarding the hardware thread sync’ing, which previous HD2000 and HD3000 generation lacked completely.
          The issue here is the “castrated” local data shared memory (LDS) functionality, not compatible with aforementioned APIs and thus is being emulated by the device driver in the global (video) memory, which is MUCH slower in most cases.
          This is all fixed in Evergreen (HD5000) — the HDAO effect in DiRT2 is implemented through DirectCompute using the LDS memory to sample the depth buffer data way faster than the conventional method found in SSAO. That way only one read/write cycle to the video memory is performed per pass and tons of precious bandwidth is saved.

            • applejack
            • 10 years ago

            so technically Dirt2 HDAO could have worked for many NVIDIA cards currently on the market. they are DirectCompute (4.x) compliant.
            still codemasters / ATI wouldn’t let DX10 hardware use DX11 features (game launches in DX9 mode instead). now It seems more likely that ATI & their HD4xxx series are the ones responsible for the paid off features being “compatible” with DX11 class hardware alone.

            • fellix
            • 10 years ago

            HDAO in DiRT2 is implemented through DC 5.0 (part of DX11) API, which requires minimum of 32KB of LDS memory. NV’s hardware, up to this point, presents only 16KB since G80. It’s up to Fermi to cover this gap-hole, when it comes out.

            • applejack
            • 10 years ago

            but they could have implemented it through DC 4.x (which is also part of DX11) as well. after all, DC 4.x was specifically made for the DX10.x class hardware running in DX11 mode. HDAO is one of four DC 5.0 features available in Dirt2 for DX11 hardware only. do you suggest that DC 4.x is not worth using at all ?

            • fellix
            • 10 years ago

            DC 5.0 is the most complete and feature supporting specification, why bother with cut-down derivatives made because one IHV is lagging behind the schedule line?
            DiRT2 already supports two DirectX generations of APIs — a third one will complicate the development and after-release support for questionable gains.
            ATi already have a sizable DX11 market penetration — a virtual monopoly there — where DX10.1 and 11 compatible SKUs accounts for several millions (DX10.1 already presented enough features from DX11, for graphics speedups mostly). Sadly, the GPGPU side of R700 arch is not up to par for its excellent graphics capabilities in such small and affordable “package”.

            • Freon
            • 10 years ago

            “why bother with cut-down derivatives made because one IHV is lagging behind the schedule line”
            As if there are ten IHVs in the graphics market and one is insignificant?

            Your rest of your point isn’t lost, they can’t spend infinite development time tweaking the game for every card out there, but the above statement is just silly.

            • applejack
            • 10 years ago

            you missed my point. see, both HD4xxx & GT200 can do DC (4.1 & 4.0 respectively), but I suspect GT200 is faster in this case as its faster in OpenCL, else, AMD would have bothered with cut-down derivatives so their own HD4xxx customers can enjoy a few DX11 goodies, with supposedly better DC support (4.1) compared to nVIDIA’s (4.0).

            there are much more HD4xxx around than HD5xxx, that’s for sure.

            skipping DC 4.x support in Dirt2 may be a result of:

            1. lack of time.
            2. DX11 hardware cheap marketing.
            3. bad DC 4.x performances compared to the rival.

            I vote for all 3.

        • Mikael33
        • 10 years ago

        So it was nVidia’s vision hmm? Sounds like someone been smoking something hallucinogenic.
        It’s quite simple, Nvidia real goal is GPGPU(and money), Fermi makes that quite obvious, where as ATI’s goal is games, they were just forced to include gpgpu because of nvidia.

        • Flying Fox
        • 10 years ago

        The Stream Computing initiaitive was announced before CUDA I believe.

        • OneArmedScissor
        • 10 years ago

        r[

          • SomeOtherGeek
          • 10 years ago

          Heh heh, yep, sure was. Loving my 4870 and will love it for a good long time. It actually plays everything under the sun, so it has plenty of performance! Need to re-examine your wants and needs.

        • Freon
        • 10 years ago

        “bitten them and their users hard”

        You are smoking some mean stuff and not sharing. You should work on your manners.

    • Asbestos
    • 10 years ago

    Ah, yes. The coming wave of OpenCL apps. Brace yourselves.

      • MadManOriginal
      • 10 years ago

      I’ve been bracing myself for well over a year and so far…nothing useful. The one thing I’d really want, and even then only a little because I’m not big in to movies, is a video encoder. But so far every reviews of GPU video encoders I’ve seen shows less than ideal output. It’s fine for portable use but I’d want to archive stuff for home use or for later transcoding and doing that in an inferior form just to save some time on a process I’d do once does not work for me.

      • WaltC
      • 10 years ago

      I imagine the coming wave of DX11-specific games will get us all wet before the coming wave of OpenCL applications does…;)

      I’m just happy happy that when I bought my 3d gpus I bought them because I wanted 3d gpus, and nothing else.

      • gtoulouzas
      • 10 years ago

      I’m hearing they will complement Duke Nukem Forever quite well.

      • willyolio
      • 10 years ago

      i think by the time the “coming wave” arrives the HD5k series will already be old.

    • PRIME1
    • 10 years ago
      • TheEmrys
      • 10 years ago

      Has F@H moved to an openCL format? I must have missed that.

      • Arveedeetoo
      • 10 years ago

      boy, such a quick retort. read a bit further in that article and you come across this comment from the programmer who optimized the Milkyway@Home client for GPU Computing:

      /[

        • MadManOriginal
        • 10 years ago

        It’s too bad Folding is far and away the most used DC project. If there were others that were better written for ATi cards maybe they ATi cards hold a bit more value like older NV ones have.

      • mesyn191
      • 10 years ago

      Doesn’t seem to be a Radeon 5xxx issue or even a 4xxx issue but a programming issue from the guys who are writing F@H.

      • Goty
      • 10 years ago

      F@H is not OpenCL and has not been recoded for any Radeon since the 3000 series.

      You’re a fail troll, Prime.

    • Scrotos
    • 10 years ago

    So if you want to target your OpenCL app for 4xxx series Radeons, you just need to have an optimized path for that specific type of card.

    I assume you’d also optimize for the G80 versus GT200 versus Radeon 5xxx series, too, for maximal performance. I mean, don’t compilers already try to pop out optimal code for different CPU architectures (Intel versus AMD) and generations (P4 versus Core2)?

    It seems like this is being blown out of proportion. Yes, the 4xxx series will be slower if you slap a quick port of CUDA code on it, is what the AMD bloke said. Seems like a no-brainer to me? Your code is slow so profile it, find out what’s causing the slowdown, and try to optimize around those cases?

      • ET3D
      • 10 years ago

      NVIDIA has the advantage of having pushed CUDA for a while now, so developers know the exact limitations and how to write code that works well on NVIDIA hardware. It’d be a while before they manage to get better performance out of ATI cards. Hopefully AMD will help in this effort.

      • sotti
      • 10 years ago

      We don’t need that kind of rational analysis around here.

      If you can’t see how the sky is falling, no soup for you.

      • HighTech4US
      • 10 years ago

      > So if you want to target your OpenCL app for 4xxx series Radeons, you just need to have an optimized path for that specific type of card.

      No the 4000 series will always be under performing as AMD was not forward looking in its design. You can’t optimize software for missing hardware and expect performance to magically improve.

      > I assume you’d also optimize for the G80 versus GT200

      You assume wrong.

        • Waco
        • 10 years ago

        I sure as hell would if I wanted optimal performance…

        • sschaem
        • 10 years ago

        Totaly wrong guys. You can implement algorithms that run faster a CPU then implemented with cuda and the fastest nvidia card.
        Why you wonder? because if you break the HW architectures rules on the GPU, performance fall off a cliff.

        But what is important to decide on the fate of the 4K serie on opencl, and missing from the article, is what algorithm perform so poorly on a 4000 serie card..

      • Manabu
      • 10 years ago

      The GPU is not an milagrous chip that runs everything faster than CPU. Many algorithms can’t be optmized for GPUs, because GPUs are very limited. Sometimes, you really need to have an local shared cache for fast communication between threads, and there is no clever algorithm that can hide the latency. It is not always possible to optmize.

      Some programs will fail with less complex GPU cores, like RV770 compared to GT200, or HD5xxx compared to Fermi. ATI’s architetures seems to be aways more or less one step behind Nvidia’s in GP-GPU. That is comprehensible. We will have to live with it.

    • lilbuddhaman
    • 10 years ago

    What will the performance difference be between a r[<5XXX<]r and a l[

      • Game_boy
      • 10 years ago

      Whatever you want it to be, when you optimise your code.

      • PRIME1
      • 10 years ago

      The point of OpenCL is that you don’t have to optimize for one vendor.

        • willmore
        • 10 years ago

        Right, sure, just like you don’t have to optimize code for AMD or Intel CPUs.

          • Shining Arcanine
          • 10 years ago

          You actually do not need to optimize it for either vendor’s CPUs. You just need to optimize it for the instructions in use. Which architectures have which instructions is mostly irrevelant.

            • lycium
            • 10 years ago

            in practice one does optimise, and with amd vs nv it’s simd vs scalar; that can have reasonably complex algorithmic implications.

            • Manabu
            • 10 years ago

            Just look at an high performance app with hand-coded assembly. They optimize for each architeture, because each one has different pipeline lenght, unaligned loads penalty, cache line split penalty, etc.

            Architeture improvments generaly eliminate botlenecks and thus the nessity of workarounds, like is needed for RV770 comparing to GT200, and will be needed for HD 5xxx comparing to Fermi. When no workaround is possible, or workarounds are much slower, you can call an architeture slower. I think it is the case for ATI gpus. Not aways is possible to “optimize”.

            Some x264 commits:

            l[< Initial Nehalem CPU optimizations <]l l[http://x264.nl/x264/changelog.txt<]§

            • ew
            • 10 years ago

            Applications that are optimized that way are the exception, not the rule.

            • Manabu
            • 10 years ago

            The initial argument is if one don’t need to optmize for each CPU beyond OpenCL. The answer is that, if speed is very important, you can gain a lot (sometimes more than double the speed) if you target an specific architeture instead of another in x86 cpu field. The programs that need to be so opitmized are few, tought.

            I in fact could go further. OpenCL is at same level as C. Clever Hand writen assembly code is often more than 2 times faster than compiler generated assembly, even for an lower level language as C. In programs like x264 they were expecting 4~5x overal speed boost on ARM only by rewriting some of the most used functions in NEON, so the local gains are even greater.

            C vs ARM/x86 assembly is like Open CL vs CAL/IL for ATI gpus. If you really need more performance, you should go dirt at the low level.

            • Shining Arcanine
            • 10 years ago

            There is a difference between CPUs and GPUs. I believe my post said CPUs.

            • Manabu
            • 10 years ago

            Yes, you said. different x86 and ARM architectures are CPUs. So I answered for you, with examples for you.

            I only included a bit about opencl because it is what this news post is all about, and actually what you post is about. I was showing that there is not so much difference as you think between the necessities of especial optimizations between CPUs and GPUs. In both you can gain alot by “going dirty&low”

            • Shining Arcanine
            • 10 years ago

            Stream processing is the exception. The majority of stuff running on CPUs is not stream processed and the stuff that is stream processed should be moved to the GPU.

            Have fun optimizing something like the travelling salesmen problem (just to name one example) in that way. No amount of architecture specific optimization will give you a significant speed-up. Another good example is GUI processing, which is reliant on the MVC design pattern and cannot really be optimized for a specific architecture. Try making a web browser, or even more simply, an XML parser and you won’t get any speed-up from a specific architecture.

            Believe me, I have tried doing assembly code optimizations of various things and I found that it only helps in a limited number of special cases, all of which could be considered stream processing, which again, should be done on the GPU. The only real speed-up you might be able to get are from by calibrating data structures like B-trees so each node fits in a cache line, but that is another exception because there you are modifying constants and not any actual code.

            About 1 to 2 years ago, when I was thinking about things that benefited from stream processing, it was somewhat interesting for me to observe that some basic examples that benefit from stream processing would benefit more from a programmer being more clever in terms of mathematics. E.g. The summation of numbers from 1 to n is better done by an O(1) formula (i.e. n(n+1)/2) than by the computation of an actual loop.

            It is easy to make a stupid algorithm that will benefit from stream processing. It is hard to make a smart algorithm that benefits from stream processing. The stupid algorithm has to have a small slow-down over the smart algorithm for the application in question to actually benefit from stream processing, which is not always possible to get and is difficult to get when it is possible to get.

            • Manabu
            • 10 years ago

            I don’t saw this answer. This is much informative.

            But I don’t know if every stream processing task is beter done in GPU. x264 developers failed to make the cuda work faster than their x86 assembly. This could be due their incompetence at cuda programing, or because current GPU architeture is inadequate for this. This may change with Fermi and future more complex GPU architetures, but I don’t know.

            For example, in GPU they could not perform a mathematical equivalent of exaustive motion search, and then the gpu ended up being slower. The matematical equivalent, that i, was many tis 3 times faster in CPUmes slower than the dumb method in GPU because latencies issues, etc.

        • willardjuice
        • 10 years ago

        That is not true at all. The point of OpenCL is that it will work with all vendors, not necessarily optimally.

          • khands
          • 10 years ago

          It did work, however, in this case the code was pre-optimized for CUDA capable hardware, ie Nvidia.

Pin It on Pinterest

Share This