A closer look at Folding@home on the GPU

MUCH HAS BEEN MADE of ATI’s recently announced stream computing initiative, which aims to exploit graphics hardware for more general purpose computing tasks. On paper, stream computing has incredible potential. ATI claims the 48 pixel shaders in its top-of-the-line Radeon offer roughly 375 gigaflops of processing power with 64 GB/s of memory bandwidth. But gobs of horsepower isn’t the only thing that matters—you need to be able to apply that power to the road if you want to go anywhere.

Stanford’s Folding@home project is already putting the Radeon’s pixel processing horsepower to use with a beta GPU client that performs protein folding calculations on the graphics processor. According to Stanford, the GPU client runs between 20 and 40 times faster on newer Radeons than it does on a CPU, a claim that no doubt sends folding enthusiasts’ hearts aflutter.

Such an increase in folding performance is certainly tantalizing. We decided to give the GPU client a spin to see what we could find out about it.


The GPU client’s optional eye candy is a nice perk

For our little experiment, we ran Folding@home’s CPU and GPU clients for several days on the same system. We used the latest beta command line clients in both instances, which worked out to version 5.04 for the CPU and 5.05 for the GPU. Our test system consisted of an Abit Fatal1ty AN8 32X motherboard with a dual-core Opteron 180 2.4GHz processor, 2GB of DDR400 Corsair XMS PRO memory, and a Radeon X1900 XTX graphics card with ATI’s Catalyst 6.5 drivers (the only revision that Stanford has extensively tested with the GPU client).

With our test system configured, we pitted the CPU and GPU clients against each other in a virtual race. A single CPU client was set to run on CPU 0, leaving the GPU client with the remainder of the system’s resources. According to Stanford’s GPU folding FAQ, at least 25% of a system’s CPU resources should be made available to the GPU client. With our test system crunching the CPU client on only one core, the second core, or 50% of the system’s CPU resources, was available to the GPU client. Unfortunately, we were only able to run the system with a single Radeon X1900 XTX graphics card, since Stanford’s GPU folding client doesn’t currently support CrossFire configurations.

Five days after releasing our test system into the wild, we checked on each client’s scores, with surprising results. Conveniently, both clients had recently completed work units when we tallied the totals.

Total points 24-hour average Work units
GPU client 2640 377 8
CPU client 899 128 6
CPU client x 2* 1798 256 12

Over five days, our Radeon X1900 XTX crunched eight work units for a total or 2,640 points. During the same period, our single Opteron 180 core chewed its way through six smaller work units for a score of 899—just about one third the point production of the Radeon. However, had we been running the CPU client on both of our system’s cores, the point output should have been closer to 1800, putting the Radeon ahead by less than 50%.

Either way, that’s a far cry from a 20 to 40-fold increase. That’s not entirely surprising, though. Stanford’s own GPU folding FAQ explains that the points awarded for GPU client cores aren’t exactly comparable to those running on the CPU client:

We will continue to award points in the same method as weÂ’ve always used in Folding@Home. To award points for a WU, the WU is run on a benchmark machine. The points are currently awarded as 110 points/day as timed on the benchmark machine. We will continue with this method of calibrating points by adding an ATI X1900XT GPU to the new benchmark machine (otherwise, without a GPU, we could not benchmark GPU WU’s on the benchmark machine!). Since Core_10 GPU WU’s cannot be processed on the CPU alone, we must assign a new set of points for GPU WUs, and we are setting that to 440 points/day to reflect the added resources that GPU donors give to FAH. In cases where we need to use CPU time in addition to the GPU (as in the current GPU port), we will give extra points to compensate donors for the additional resources used. Right now, GPU WU’s are set to 660 points/day. As we go through the beta process, we will examine the issue of points for WUs, as we understand the significance of this in compensating donor contributions.

So point totals don’t necessarily reflect the relative folding power of the Radeon X1900 XTX. CPU and GPU clients draw from different pools of work units, and points are based on the performance of a benchmark system rather than how many calculations have actually been completed. The GPU client may be doing between 20 and 40 times more work, but the points Stanford awards don’t reflect that. It will take more than a farm full of Radeons to dominate the Folding@home leaderboard.

Not content to stop our investigation at point totals, we fired up our watt meter to see just how much juice our test system consumed with several folding configurations. The system’s power consumption was measured at idle with and without Cool’n’Quiet clock throttling, and then with various Folding@home client combinations.

System power consumption
Idle with Cool’n’Quiet 98.6W
Idle without Cool’n’Quiet 113.0W
CPU client x1 160.2W
CPU client x2 185.6W
GPU client 195.6W
CPU client + GPU client 228W

Clearly, the GPU client is much more power-hungry running on a Radeon X1900 XTX than the CPU client is with an Opteron 180. Even when compared with two cores running at full steam, the GPU client still pulls an extra 10W at the wall socket. Still, given our point totals, the GPU client appears to be the more power-efficient of the two.

Using an extrapolated point total for two CPU clients running in parallel, which is pretty realistic given how Folding@home burdens the CPU, we’d expect to generate around 1,798 points while pulling 185.6W, which is good for close to 9.7 points per watt. The GPU client, on the other hand, generated 2,640 points and pulling 195.6W, yielding close to 13.5 points per watt.

Interestingly, with our test system running one CPU and one GPU client, we generated a total of 3,539 points pulling 228W, or 15.5 points per watt.

Unfortunately, the scoring scheme for Stanford’s GPU folding client doesn’t reflect the apparent processing power advantage of graphics processors like the Radeon X1900 XTX. The use of a benchmark system is consistent with how points are awarded with the CPU client. Still, if a GPU really is doing vastly more folding work than a CPU, perhaps the points system should weight GPU time more heavily. 

Comments closed
    • GreatGooglyMoogly
    • 13 years ago

    /[

    • shank15217
    • 13 years ago

    the cpu still plays a big part for the gpu client, it acts as the i/o device to the gpu and possibly something more, maybe some calculations cannot be done on current gpus. I think the performance would really suffer for both gpu and cpu clients working in tandum if there was only a single core cpu. I wouldn’t mind a direct comparision of performance of the gpu client on different cpus.

    • BaldApe
    • 13 years ago

    So, they dont want the people without Radeon x1900 gpus to feel bad, so they mucked up the point system?

    Edit: removed some of my comment so as not to appear to be a FNT.

      • shank15217
      • 13 years ago

      thats just a really bad comparision. dont compare people with gpus. affirmative action my a**

        • bthylafh
        • 13 years ago

        Heh, I smell a FNT.

          • BaldApe
          • 13 years ago

          What an FNT?

        • BaldApe
        • 13 years ago

        Fare enouph. What I’m talking about is lowering the bar so people wont feel bad or be left out rather than keeping it real.

    • sigher
    • 13 years ago

    SO a whole article and no information on how much it in actuallity speeds it up in reality? perhaps you should have waited until you figured out a way to measure true performance before releasing the article.
    Still, the wattage thing was interesting and informative.

      • alphaGulp
      • 13 years ago

      What exactly do you expect someone who isn’t working directly on the Folding@Home code to be able to do? It is basically a black box to anyone else.

      Regardless, the article answers some big questions, such as
      – How GPU work units are measured (very different from what I would have expected)
      – What impact you can expect your GPU to have on your points
      – Whether the performance / watt is in the GPU’s favor or not
      – …

      Where else have you seen any of this answered?

      Still, it would be awesome if techreport could find a way to interview the Folding@Home team and get more direct info.

        • sigher
        • 13 years ago

        So isn’t that a bit suspicious? if F@H is such a black box, and you install that on your systems?
        There’s one born every minute I guess.

          • alphaGulp
          • 13 years ago

          Every single piece of software you buy that isn’t open source is a black box. You do have the option of reverse-engineering from the binary, but that is very difficult to work with.

          Even with source code, when it is as complex as this stuff is likely to be, you basically need to talk to the developers to find out what’s going on and why. Pity the fool who needs to figure out a program with little or no documentation and no help / guidance from the original developer(s).

    • alphaGulp
    • 13 years ago

    I’m super disappointed that the points aren’t calculated based on the work done.

    I realize some tasks may be better suited to CPU rather than GPU. However, since by their very nature these tasks are highly parallelizable, I am really not convinced that only a small subset of the work can be done by the GPU. I would imagine that those parts of the work that are not parallelizable would end up taking 25% or more of the CPU’s cycles.

    At this point, it may be that work units need to be modified to use the GPU? If so, then that might explain why only certain types of WU’s can be used by the GPU.

    Before I get heat for not focusing on the point of Folding@Home (helping research, blah blah), lets not forget the pure fun in setting up a machine that will be a beast for folding.

    If Stanford values the actual ‘work’ being done, they should reward it regardless of its origin. Their practical reason for doing this is that more people would buy X1900 GPUs. From a folder’s point of view, why skew your purchase decisions towards these GPUs if they don’t accomplish much more than a second core might?

    ** edit: typo **

    • maxxcool
    • 13 years ago

    Has anyone any real data on “what” it actually accelorates? It would seem to me that this will only help with 32bit floating arythmatic, not good old interger.

    I think this is a lot of fluff really. Sure its faster, but only in certain circumstances it seems. i would like to know when it will really help since the Core 2 Dou seems to be able to keep a respectable pace..

    • Bookrat
    • 13 years ago

    An article about folding, but no mention anywhere of team 2630? For shame…

      • Dissonance
      • 13 years ago

      Almost as bad as not reading TFA.

      “For more info on Folding, please see our folding page and consider joining Team TR!”

        • drfish
        • 13 years ago

        FWIW, I didn’t notice it until I read the article a 2nd time. A plug at the top would have been nice, especially with the potential for this article to get dugg…

        • Bookrat
        • 13 years ago

        I read T whole FA, but missed the little italicized comment after the conclusion. As it happens, the italics actually served to *[

          • indeego
          • 13 years ago

          I know where you are coming from, TR needs a print article button so we can find specificsg{<.<}g

      • indeego
      • 13 years ago

      Your post contains no advertisement of /[https://techreport.com<]§

        • lyc
        • 13 years ago

        r[

    • tygrus
    • 13 years ago

    The GPU version uses to many low-level hardware features to run using DirectX API alone. They may change the point value depending on supply and demand. The GPU version can’t solve everything so they still need the CPU users to be onside. The reference machine and calculations have changed over time. Those using CPU’s with SSE2/AltiVec gained a lot more points than having it swiched off.
    Some references to 20x speed-ups refer to portions of the code and not the total execution.
    Algorithm for GPU is different and is using more FLOP’s per WU.
    System for using GPU relies on CPU frequently checking if GPU needs data read/sent = wasted CPU cycles & wasted GPU cycles.
    System is still a bit Beta and will develop as GPU’s and API’s progress.

    • murfn
    • 13 years ago

    If the GPU client works strictly within the DirectX API, it should be a simple matter to get it to run on the reference rasterizer. The binary would have to ask for it, but that should be simple for ATI to add in. This would make for a much better GPU vs. CPU comparison.

    • indeego
    • 13 years ago

    Nevermind, broke the pageg{<.<}g

    • Bill Clo
    • 13 years ago

    Points are good, but let’s not forget why most of us are here: the science and the possibility of improving people’s health. Simply, the GPU client is going to make it possible to work on proteins that simply couldn’t be worked on before. It appears to be able to generate much more useful science than the CPU client. The results of such work may help you or a loved one in the future after all…

    As for points, which are not my primary motivation for participating, yes the GPU client should generate more points than the CPU client. It requires more expensive hardware to run. . But 10-20x the points of a CPU client seems excessive. After all, when Gromacs came into widespread useage, they said it did 10x the work of the Tinker core. Yet the points didn’t increase. (certain bonus work units notwithstanding). Don’t forget that not all protein projects can/will use the GPU client.

    However, since the GPU work units will only run on certain expensive video cards, and will probably require a dedicated CPU core to keep it working at full speed, I can see a point increase as a good idea. The new 330 PPD per unit seems to be a reasonable compromise. If the work units get larger, as we all know they will, the points will increase.

    I seem to recall we heard the same sort of complaints when SSE2 optimizations came out and Folders with older machines that couldn’t use the new optimizations were upset. They didn’t get as many points, and they started to fall behind. This is nothing new, IMHO.

    Get whatever hardware you can afford, and Fold on.

    • LiamC
    • 13 years ago

    Hold your horses.

    The GPU client only works on a select subset of Work Units. GPU’s cannot do general work units at all.

    Now compare this with say project 1495–a select subset of general work units. I have a 2.4GHz X2 (o/c 3800+). It gets between 348 and 374 ppd.

    On the same project, my C2D (E6600 @ 2.93GHz, 1 333MHz FSB) gets between 1 219 and 1 258 ppd! GPU’s don’t look that impressive on certain select work units.

    Don’t get me wrong, GPU’s have a place, but it’s the problem & algorithm. Some problems will suit, others won’t. SSE4 may make CPU’s look better. R600 may make GPU’s look better.

    It will really depend on the type of problem to be solved. Still, I’m thinking of upgrading to an X1900/1950. Should buy shares in ATi; Oh, wait…

    • drfish
    • 13 years ago

    Are those numbers from 2D or 3D clockspeeds?

    • Entroper
    • 13 years ago

    IMO they should be comparing GPU work to the same work done on CPUs, apples to apples. If GPUs are doing 20-40x as much work, doesn’t Stanford want to recruit as many GPUs as possible? So give folders an incentive, give them all the points!

      • FubbHead
      • 13 years ago

      Not until AMD/ATI’s PR department pays them for it… 🙂

    • Shintai
    • 13 years ago

    So looking on the numbers, an OCed C2D can basicly beat a single GPU folding in terms of points?

    How much effort did it take on the CPU core(s) to have the GPU client running?

    I have a mere x1900GT, and I dont think it can make more points than my OCed C2D.

      • Krogoth
      • 13 years ago

      X1900GT is faster then a C2D even when OCed, but the problem is that point system is different for GPU WUs and is pretty darn conservative ATM.

        • Shintai
        • 13 years ago

        You remind me of a parrot, more or less telling what I said 😛

        y[

      • BaldApe
      • 13 years ago

      If the points are all that matter to you, then of course you are correct. I suspect they will change their point system to better reflect the work that is being accomplished once GPUs that support this kind of processing are more pervasive. Hopefully.

      Ps: Caaa Caaawww

    • adisor19
    • 13 years ago

    Does this run on my ATi VGA wonder ?

    Just kidding 😉

    Adi

    • Proesterchen
    • 13 years ago

    Thanks for this quickie eval of, and interesting (and better than expected by me) power numbers on the GPU client.

    Makes me wonder, if that’s just a result of it not using all the graphics card’s ressources to their max, whether that’s an application/programming or a hardware limitation.

    • d0g_p00p
    • 13 years ago

    Rock on TR!!

      • stmok
      • 13 years ago

      Agreed! Finally, someone does some tests and makes a comparison!

      Thanks Geoff!

      • flip-mode
      • 13 years ago

      Agreed. A lot of the comments made here have been fairly unappreciative. That’s a shame. It’s OK to make suggestions and all, but to embed them in mocking statements is just rude.

      Thanks Geoff.

Pin It on Pinterest

Share This