ATI dives into stream computing

I SPENT LAST WEEK taking in information like a sponge, and this week, it’s still slowly seeping out. My IDF wrap covered Intel’s future directions in processor technology, which had a lot to do with highly parallel floating-point processing. Many of my conversations with folks other than Intel last week touched on this same theme—whether it was about ATI’s stream computing initiative, Nvidia’s plans to counter, or the battle over gaming physics. Read on for my take on ATI’s foray into what it calls stream computing.

ATI’s stream computing kickoff
Last Friday, ATI invited a number of journalists and analysts to a short but information-packed event devoted to its new stream computing initiative. ATI is using the phrase “stream computing” to refer to the class of applications more commonly referred to under the GPGPU label, an acronym which refers to general-purpose processing on a graphics processing unit. CEO Dave Orton explained that ATI chose the term stream computing because the class of computing problems the GPU handles well are primarily about data flow, a characteristic that separates these problems from the types of computation at which CPUs have traditionally excelled.

Orton identified a number of specific areas where ATI sees opportunities for GPUs to accelerate computation, including medical research, analysis of video and audio data for security applications (such as facial recognition), financial analysis, seismic modeling for oil and gas exploration, media search applications, physics simulations in video games, and media encoding.


CEO Dave Orton explains ATI’s Stream computing initiative

In these areas, he said, the GPU has the potential to be “orders of magnitude” faster than CPUs due to its nature as a highly parallel floating-point processor. Orton pegged the floating-point power of today’s top Radeon GPUs with 48 pixel shader processors at about 375 gigaflops, with 64 GB/s of memory bandwidth. The next generation, he said, could potentially have 96 shader processors and will exceed half a teraflop of computing power.

Orton was quick to emphasize that ATI is not looking to compete directly with CPUs, just to find and address a set of problems that map especially well to the GPU. He described the CPU-GPU relationship as complementary and symbiotic. He also made it clear that the day’s events were not part of a new product launch. ATI is just inaugurating a new direction in seeking out this business, he said, and showcasing some actual applications where the GPU has been fruitfully applied.

Much of the rest of the event was devoted to speakers who had actual stream computing applications to discuss or demo.

Folding@Radeon


Stanford’s Vijay Pande talks Folding on a GPU

First among them was Vijay Pande of Stanford University, Professor of Chemistry and Director of the Folding@Home project. TR readers should be very much familiar with Folding, since we field one of the top ten Folding teams in the world. Pande was there to talk about the new beta Folding client that uses the GPU. Currently, it only runs on newer Radeons, where it shows big performance increases—between 20 and 40 times the speed of a CPU. Pande said the client is presently achieving around 100 gigaflops per GPU. To give some perspective, he then demonstrated the graphical versions of the CPU and GPU clients side by side, and the GPU version showed constant motion, while the CPU one chunked along at a few frames per second.

This particular implementation of stream computing has now gone live. The FAH project released the first beta of the client to the public earlier this week.

I talked with Pande about the possibility of a Folding client for Nvidia GPUs, and he had some interesting things to say. The Folding team has obviously been working with Nvidia, as well as ATI. In fact, Pande said Nvidia has their code and is running it internally. At present, though, ATI’s GPUs are about eight times as fast as Nvidia’s. He was hopeful Nvidia could close that gap, but noted that even a 4X gap is pretty large—and ATI is getting faster all of the time.

The bottom line for Pande and his colleagues, of course, is how Folding on a GPU can further research about diseases like Parkinson’s and Alzheimer’s. Pande characterized the move to GPU Folding as one that opens new possibilities.

PeakStream makes GPU-based HPC accessible
Next up was Michael Mullany, VP of Marketing for PeakStream. This brand-new company has built a set of software tools to serve the high-performance computing (HPC) market, which is where big, high-margin players like oil and gas companies, automakers, and aerospace firms reside. PeakStream believes GPUs can bring strong outright performance, solid performance per watt, and good performance per square foot of space in the data center.

To capitalize on that opportunity, PeakStream’s software platform plugs into standard development tools like gcc and the Intel compilers to allow applications nearly transparent access to GPU computational power. PeakStream’s profiler determines whether the code being executed is a good fit for a particular type of processor, and their virtual machine provides a layer of abstraction from the execution hardware. Code that’s been profiled and fed into the VM may wind up being executed on an x86 processor, Sony’s Cell processor, or a GPU, depending on its needs.

Mullany showed a demo that PeakStream developed while working with Hess, a large U.S. oil and gas producer, on a seismic analysis algorithm. This algorithm analyzes the echoes created by controlled explosions on the surface of the earth in order to determine the shape of the rock layers and other features beneath the ground. The analysis ran about 15 times as fast with the GPU as it did on the CPU alone, which Mullany explained would allow for new levels of resolution or new types of analysis.

Mullany said PeakStream is working with customers on a range of applications, from financial firms wishing to price derivatives to academics simulating fluid dynamics. In one instance, he said, PeakStream stepped into a project where a defense contractor was using a GPU to do signal processing in a mobile application. With its software, PeakStream was able to deliver a five-fold performance improvement.

That example perhaps best illustrates the potential value of PeakStream’s product. ATI has documented some of the workings of its GPUs for developers to use, but doesn’t really provide a robust set of tools that will allow developers to write programs in high-level languages and then compile them for the GPU. Partners like PeakStream will be very important if ATI is to make its stream computing push a success.

Microsoft pledges support
Speaking of important partners, Microsoft sent a rep to the event, as well. Chas Boyd, an Architect in Microsoft’s Graphics Platform Unit, spoke briefly about Microsoft’s support for non-traditional uses of GPUs in Windows. Boyd showed off a Windows Vista image editor that handles image processing operations on the GPU rather than the CPU, making photo editing a much quicker task. He also talked about using GPUs to handle graphical problems in a non-graphical way.

You’ll note that the demo scene above has lots of dense grass in it. This kind of detailed vegetation can cause problems for renderers, because determining which blade of grass is in front of the others is notoriously difficult. Boyd said that by using a prefix sort algorithm running on the GPU, this app is able to determine quickly and correctly the proper polygon depths and render the image correctly. The result is higher image quality, but it comes by using the GPU as a general-purpose processor.

Boyd said more parts of Microsoft are becoming more engaged with GPUs as these sorts of uses expand. The entire Vista Aero user interface now runs on a GPU, and he noted that physics interactions, particles, fluids, and the like are being mapped successfully to GPUs using DirectX. Over time, he claimed, Microsoft will be evolving the DirectX API to facilitate such things—from DirectX 10 forward.

Havok FX physics go fully interactive
Boyd’s talk of physics being successfully mapped to the GPU using DirectX was surely a reference to Havok’s GPU-based physics engine, Havok FX. Jeff Yates, Havok’s VP of Product Management, followed Boyd on stage with a demo of that physics engine. Havok has shown demos of basic rigid-body physics acceleration running on GPUs in the past, but Yates also showed off a nice demo of cloth or fabric, which tends to require more computing power.

Then he produced a real surprise: Havok FX with “gameplay physics”—that is, physics interactions that affect gameplay rather than just being eye candy—running on the GPU. I wasn’t even aware they had truly interactive GPU-based physics in the works, but here was a working demo.


Brick War shows Havok FX’s gameplay physics in action

The demo game, Brick War, is based on a simple premise. Each side has a castle made out of Lego-like snap-together bricks, and the goal is to knock down all of the soliders in the other guy’s castle by hurling cannonballs into it.

The game includes 13,500 objects, with full rigid-body dynamics for each. Havok had the demo running on a dual-GPU system, with graphics being handled by one GPU and physics by the other.

As the player fired cannonballs into his opponent’s castle, the bricks broke apart and portions of the structure crumbled to the ground realistically. Yates pointed out that the GPU-based physics simulation in Brick War is fully interactive, with the collision detection driving the rest of the rigid-body dynamics and also driving sound in the game.

Havok seems to have made quite a bit of progress on Havok FX in the past few months. According to Yates, the product is approaching beta and will soon be in the hands of game developers. When that happens, he said, game developers will need to change the way they think about physics, because the previous limits will be gone.

Yates’ was the last of the formal presentations, and a quick Q&A session followed.

Conclusions
I came away from the ATI event most impressed with the quality and relative maturity of the applications shown by the presenters. Each of them emphasized in his own way that the GPU’s much higher performance in stream computing applications opens up new possibilities for his field, and each one had a demonstration to back it up. Obviously, it’s very early in the game, but ATI has identified an opportunity here and taken the first few steps to make the most of it. As they join up with AMD, the prospects for technology sharing between the two companies look bright.

ATI still faces quite a few hurdles in meeting the needs of non-graphics markets with its GPUs, though. Today’s GPUs, for instance, don’t fully support IEEE-compliant floating-point datatypes, so getting the same results users have come to expect from CPUs may sometimes be difficult or impossible. ATI also hasn’t provided the full range of tools that developers might want—things like BLAS libraries or even GPU compilers for common high-level languages—and so will have to rely on partners like PeakStream to make those things happen. I’m just guessing here, but I’d bet a software provider that focuses on oil and gas companies doesn’t license those tools for peanuts. If stream computing is to live up to its potential, ATI will eventually have to make some of these programming tools more accessible to the public, as it has done in graphics.

One other interesting footnote. On the eve of ATI’s stream computing event, Nvidia’s PR types arranged a phone conference for me with Andy Keane, one of Nvidia’s GPGPU honchos. (Hard to believe, I know, but Nvidia was acting aggressively.) The purpose of the phone call was apparently just to plant a marker in the ground signaling Nvidia’s intention to do big things in stream computing, as well. Keane talked opaquely about how the current approach to GPGPU is flawed, because people are trying to twist a device into doing something for which it wasn’t designed. They’re using languages like OpenGL and Cg in unintended ways. Very soon, he claimed, Nvidia will be talking about new technology that will change the way people program the GPU, something that is “beyond the current approach.”

That was apparently all he really wanted to say on the subject, but I stepped through several of the possibilities with him, from providing better low-level documentation on the GPU’s internals to providing BLAS libraries and the like. Keane wasn’t willing to divulge exactly what Nvidia is planning, but if I had to guess, I’d say they are working on a new compiler, perhaps a JIT compiler, that will translate programs from high-level languages into code that can run on Nvidia GPUs. If so, and if they deliver it soon, ATI’s apparent lead in this field could evaporate.

For now, though, ATI is playing nice and simply letting its partners speak for it. Based on what those partners have said, the Radeon X1000 series seems better suited to non-graphics applications than Nvidia’s GeForce 7 series for a range of technical reasons, from finer threading granularity to more register space on the chip. I expect we won’t hear too much more from Nvidia on this front until after its next generation of GPUs arrives.

Comments closed
    • murfn
    • 16 years ago

    This is a question of practical computing. On the one hand you have to maintain a certain minimum precision for physics, AI, etc. , and on the other there is no point attempting to achieve a higher precision than the minimum you have set. Hence, a constant precision. For multiplayer you obviously have to have the same precision across machines. And games with physics usually have multiplayer aspects.

    Any excess computational resources can be used to maximise the graphics framerate, as it has always been. Because the frame rates of physics and graphics are not synced, they cannot share the same rendering context on the GPU. Because context switching and proper scheduling on a GPU are currently difficult, GPU physics has so far been marketed as two distinct cards. And not necessarily because the higher end cards cannot handle both physics and graphics.

    • Stranger
    • 16 years ago

    not necesarily. The only reason you’d need a fix time step was if you say had mulitple clients that all needed to come up with the same results i.e. an online game where all the clients need tobe on the same page. Otherwise I doubt the engin would care if the results are slightly worse then a higher precision smaller timeslice estimation. It would be possible to use a varible timeslice physics. The only downside I could see to this would be how unpredictable the calculations could be making it extremely hard to debug.

    §[<http://en.wikipedia.org/wiki/Numerical_methods<]§ a not so complicated example. §[<http://en.wikipedia.org/wiki/Numerical_ordinary_differential_equations<]§ here you could just vary h depending on performance requirments. I'd imagine that their would obviously be some bottom range of h where the predictions would become unpredictable.

    • sigher
    • 16 years ago

    I have both an ATI graphics card as well as a nvidia, the reason I mention ATI in these comments is probably because it is an article about ATI don’t you think.
    I’m just disappointed when I hear long speeches that get me hopeful only to find little or nothing comes from it or the coming is slow as molasses and hope that when we all emote such the companies might try harder.
    Also I like to point out that ATI had a link to cyberlink’s h264 decoder for ATI on their site and that you had to pay for just as with Nvidia, if it has since then been implemented into the driver they might have made that more clear, but compliments to them nonetheless.

    • Shintai
    • 16 years ago

    H264 is part of the driver, unlike the $$$ you gonna pay with nVidia. As for the different drivers etc, I guess you would say nVidia is different? No…

    In my eyes ATi and nVidia is just as useless with all their hot air. But right now I prefer ATI since they dont mess up my video output like nVidia.

    • sigher
    • 16 years ago

    I think it was Nvidia’s developer site that actually already had a photoshop filter long ago that used the GPU to be very speedy, as a concept model.

    Or was it some other developer site? anyway it was shown as a concept already long ago

    • sigher
    • 16 years ago

    I think ATI is all talk and not very active, they hold speeches and make press statements but meanwhile their rendermonkey program hasn’t been updated for years (2004 I understand), their site has dropped links to a H264 accelerator and only talks about it, their hydravision is meant for old cards etcetera etcetera.
    Then they refuse to give developers the GPU information they need and they don’t even have a forum, it’s all not presenting an active commitement to me.
    The F@H client only works with driver 6.5, which is pretty old, or 6.10 which is beta, all because there’s no consistent output accross drivers for applications.
    Now they reportingly bring better support in the future starting with 6.10 but the’ve been talking about this alternative GPU use for many months and you’d expect they would have something already available.

    They do of course show a commitment to hot air both in the press and in hardware if you fogive me joking.
    Nevertheless I have an ATI card because they got good speed and enough tricks to keep them competitive.

    • BaldApe
    • 16 years ago

    Exactly. Before PCI-E this sort of thing wasnt possible. Now you have full duplex (forgive me if this isnt the right term) signaling, so you can write data to main memory. There is a lot of latency though, so its not much use for realtime applications like games. this so called stream computing is another matter though.

    It will be interesting to see how much more capable the dx10 gpus will be for this sort of processing. the unified shader units and new features in sm 4 should have even more processing capabilites. Nvidias g80 gpus will be able to compete with ati; their current architecture isnt as useful as ati’s for this sort of thing.

    • BaldApe
    • 16 years ago

    Retarded?
    ” I still see there being a seperate video card for handling the memory access intensive stuff to prevent the main core from slowing.”

    Its possible you could see high end GPU/CPU products on the same package… maybe sharing a large amound of high speed cache like you see on the 360… But when you consider that vram is moving to gddr4, and amd is just moving to ddr2… i just dont see how they could keep a high end gpu fed without drastically changing the on die memory controller to support faster ram…

    the more i think about it, the more convinced i am that if there are ever cpu/gpu combo parts they will be low end, and even then it makes little sense. maybe ati will help amd develope a co-processor, but this will likely be for servers and workstations rather than mainstream destop systems.

    sorry, off subject…

    • Chryx
    • 16 years ago

    For one thing, doing it via overlay means it’s totally nondestructive.

    secondly, Ageia are chucking much less data around.

    • sigher
    • 16 years ago

    The truth of the matter is that GPU are limited in what calculation they can do, at least on current cards, even the folding at home client reportingly can only work on certain calculations due to the limitations.
    And they already have seperate parts on the CPU die, the FPU units and the SSE/SSE2/3Dnow! units.
    But perhaps they can port some of the research from ATI to their cores, who knows, but at this time AMD seems to have chosen the route of adding a hypertransport port to motherboards and expanding the hypertransport capabilities and are thinking of addon cards that communicate with the CPU in a very direct and high-speed way.

    • sigher
    • 16 years ago

    Did that for the 100 people worldwide that own an ageia card, as opposed to the millions upon millions who own a GPU.

    • murfn
    • 16 years ago

    You still have a problem with the video card having to render to its local video memory and then transferring the buffer to the system memory. I was wondering whether ATI’s cards have an architectural advantage over NVidia in streaming applications because they sneakily added their version of TurboCache to the circuitry of their flagship GPU’s.

    • Shintai
    • 16 years ago

    You can skip frames with screenoutput. Nothing happens, its an end product. You cant with say physics, AI or any other. Since they have to match the realtime chronological development of the game. Else you would quickly end up with a disasterous game where everything behaves unexpected.

    • DrDillyBar
    • 16 years ago

    indeed. AGP suffered from poorly implimented realtime transfers for data to and from the Video card. PCIe (uh) has far more bandwidth then can be currently used, and we’re not talking a’boot gaming here.

    • murfn
    • 16 years ago

    Copying back and forth from main memory to onboard memory is done by the DMA (direct memory access) manager on the card. It does not involve the CPU. And IMO PCIe can handle bandwith in two directions at one, so there would be not time penalty there. Also, with Turbocache I believe video cards can render directly onto main memory.

    • ripfire
    • 16 years ago

    y[

    • Chryx
    • 16 years ago

    things can be copied from video memory back to main memory, ideally asyncronously, as surely writing directly back to main memory directly would bottleneck the process to PCI-E or main memory speeds ?

    • ripfire
    • 16 years ago

    Uh. You do realize that using pixel shaders only writes to video memory and NOT back to the main memory where it is important. Otherwise, how else would you save the applied filter to disk?

    The whole point of Stream Computing is the ability to write back to main memory.

    • Chryx
    • 16 years ago

    photoshop filters don’t really need the general purpose part of the equation..

    Coreimage on Macs is an OS side framework to do image processing on the gpu, and it’s not ‘GP’ in the slightest, it’s just using the pixel shaders to do what they do.

    • Bensam123
    • 16 years ago

    “According to Yates, the product is approaching beta and will soon be in the hands of game developers. When that happens, he said, game developers will need to change the way they think about physics, because the previous limits will be gone.”

    Wonder if he knows about PhysX and how they already did that…

    • ew
    • 16 years ago

    What is a ‘game progress framerate’?

    Do you mean to say that the physics simulation is using a constant time step? That is certainly unnecessary. The time step will be set so the the error in the simulation never gets too big.

    • ripfire
    • 16 years ago

    Wow. This is awesome. The first thing I can think of is GP-GPU Enhanced Photoshop Filters.

    Althought, I didn’t see a lot of talk about media encoding capabilities.

    • BabelHuber
    • 16 years ago

    Just use your old card as a physics accelerator when you buy a new video card. It’s as easy as that – I plan to buy a X1900XT, and perhaps I can use it for this purpose later on.

    • Cybert
    • 16 years ago

    Well I’m still banned from forums despite $128+ donations. Anyway, I have studied architectures from PDP8 through the 68K and up to IA64. x86 is the running joke of the computer industry. I will take bets on the fall of x86. 5 years.

    • Cybert
    • 16 years ago

    x86 will fall. In 5 years or less. Mark these words.

    • Shintai
    • 16 years ago

    Not the screenframerate. But game progress framerate yes.

    • ew
    • 16 years ago

    Physics simulations don’t require a constant frame rate.

    • Willard
    • 16 years ago

    Nice article. Thank you.

    • murfn
    • 16 years ago

    I think the collision part of game-play physics is still done on the CPU. The GPU does all the motion calculations and the CPU collates the results and deals with collisions.

    Unlike graphics, physics requires (my guess) a constant frame rate. With two separate cards it is easier to manage. For a one card solution you would need a scheduler that caps graphics GPU processing time. This may be possible with the GPU context switching ability that Vista will enforce on GPU’s.

    • liquidsquid
    • 16 years ago

    THIS is likely the reasoning behind AMD & ATI as well as Intel & nVidia. With multi-core processors being the future, they can strip out the graphics-specific portions of the GPU and then drop it on the same processor die for highly parallel computing. Imagine 2 CPU cores and 2 GPU cores on one die, and what performance that would provide! Nobody said that all the cores have to be the same on a die.

    I still see there being a seperate video card for handling the memory access intensive stuff to prevent the main core from slowing.

    • Jigar
    • 16 years ago

    Cool…. Seems like DX 10 and Phys…. will shine

    • WaltC
    • 16 years ago

    Interesting and informative article. Makes a convincing counter-proposition to all the silly fluff emerging lately about “real-time ray tracing,” the main purpose of which is to promote cpus, and which we seem to hear every time a new cpu architecture is released…;) I prefer not to confuse my cpus with my gpus, and would prefer they remain separate and complementary. I don’t think there’s much danger of gpus replacing cpus, or vice-versa.

    • Dposcorp
    • 16 years ago

    Nicely done, Scott.
    Interesting info, to say the least.
    I enjoyed reading it.

    • IntelMole
    • 16 years ago

    DSPs as a rule will have a completely different ISA.

    So does this, but at least you have the x86 to manage the VM in PeakStream. Besides which, this is just throwing a a packet of instructions at the GPU and saying “go calculate”, it’s not doing a lot of the program flow / branch prediction / caching etc. that you might find on a DSP.

    • blastdoor
    • 16 years ago

    Great article.

    I hope the next generation of GPUs allows both graphics and game-play physics to be run at good FPS on the same card. There’s no way I’m buying two graphics cards.

    • Freon
    • 16 years ago

    For some reason I think getting an ASM instruction set, register info, etc. isn’t in the cards.

    The question of libraries and compilers is a big one.

    And a lot of this sounds like new ways to spin what is essentially a DSP. At least the bonus is the hardware exists regardless. Might as well use it. But it will only get used if it is convenient for programmers, which gets back to this huge compiler/processor data sheet question.

    • harmisajedi
    • 16 years ago

    fascinating write-up.

    one wonders exactly what the full potential of torrenza is when the know-how & product lines brought by ati to the fold are thrown into the mix…

    • Stranger
    • 16 years ago

    Nvidia’s comments are interesting. If I were to wildly guess what he was talking about i bet it has to do with the rumors that each of the four vector components are broken into their own “stream Processor” on the G80

    • Damage
    • 16 years ago

    Ugh. Fixed.

    • Stranger
    • 16 years ago

    Minor fix

    Orton pegged the floating-point power of today’s top Radeon GPUs with 48 pixel shader processors at about 375 gigaflops, with r[<64 MB/s<]r of memory bandwidth. The next generation, he said, could potentially have 96 shader processors and will exceed half a teraflop of computing power.

Pin It on Pinterest

Share This