Floating-point units in server-grade CPUs

Discussion of all forms of processors, from AMD to Intel to VIA.

Moderators: Flying Fox, morphine

Re: Floating-point units in server-grade CPUs

Postposted on Thu Nov 11, 2010 7:53 pm

SecretSquirrel wrote:Interesting that you just proved our point and interpreted it the wrong way.

Shining Arcanine wrote:As for Android, I recently ran Sunspider on a Google Nexus One, which did not take much longer to complete than Sunspider in Google Chromium on my laptop's Intel Core T2400 processor. I think that the difference was 5 seconds versus 2 seconds, which is negligible. As mentioned earlier, Javascript relies solely on floating point operations, so it would seem that the "emulation" is not as slow as you would think it is


So what you are saying is that 3 seconds doesn't make much difference, and you are correct. What you missed is that your "emulated" version on the Nexus one took 250% longer to run than the version on your desktop. Not a big deal when we are talking about less than ten seconds. But what about something that takes an hour to run on your desktop? Perhaps a POVRAY render or recoding that movie rip. Now it will take two and a half hours. That's a big difference. I happen to work in "The Real World (tm)" supporting some folks who actually make chips for a living and I can tell you that if Intel or AMD decided to drop the FPU and it a lot of what we do take 2.5x longer, that manufacturer would not have another processor in our company. Despite what you may claim, a large category of problems do not lend themselves well to massive parallelization due to data dependencies in the calculation, algorithmic limitations, IO requirements, etc.

--SS


Please name the problems you cite. If they are large enough that they are taking a noticeable amount of time, then I am certain that you will find a way to parallelize them. Google Chromium is an excellent example of this, where putting each page into its own separate process parallelized webpages rendering in a tabbed web browser, which was slow with the single renderer thread approach Firefox took. I doubt that everything you run is one massive problem that cannot be broken into separate threads and if it can, you can likely put it into a SIMD programming model. Regardless, everyone, everywhere agrees that the single threaded programming model is a dead-end in terms of performance. Any business that cannot parallelize its critical software applications will be killed by those that can, in which case, the strength of a single processing unit does not matter so long as you have a sufficiently large number of them.

wibeasley wrote:
Shining Arcanine wrote:I think you missed the bottomline, which is that floating point performance is not important in CPUs to the point where people should be arguing over how well AMD's floating point units in their new CPUs perform. That is why I asked why people care about it in the first place and it is also why I explained why the units are unnecessary. The performance of unnecessary units is not really an area that merits people's attention.
Here are two more GPGPU people who believe that that FPUs aren't unnecessary.
Could this obviate the need for extensive concurrency training for software developers? Can they simply offload parallel computation to the GPU, which, unlike the CPU, has the potential to linearly scale performance the more cores it has? Can you just “fire it and forget it,” as Sanford Russell, general manager of CUDA and GPU Computing at Nvidia, puts it? Sorry, no.

“The goal is not to offload the CPU. Use CPUs for the things they’re best at and GPUs for the things they’re best at,” said (Mark) Murphy. An example is a magnetic resonance imaging reconstruction program that he found worked best on the six-core Westmere CPU. “The per-task working set just happened to be 2MB, and the CPU had 12MB of cache per socket and six cores," said Murphy. "So if you have a problem with 2MB or less per task, then it maps very beautifully to the L3 cache. Two L3 caches can actually supply data at a higher rate than the GPU can."
http://www.sdtimes.com/content/article.aspx?ArticleID=34842&page=2


I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Thu Nov 11, 2010 8:04 pm

Shining Arcanine wrote:... so it would seem that the "emulation" is not as slow as you would think it is



Shining Arcanine wrote:I was running it on an actual Nexus One, not an emulated one. The Android emulator took about 10.5 minutes to do Sunspider on my laptop.


I'm sorry, you stated that, because it only took a few seconds longer, "emulation" is not as slow as you would think. As this was a discussion about FPUs, my assumption was that you took those results to mean that FP emulation on the Nexus One was not that much slower than running with a real FPU on your desktop machine. Please enlighten me about your above quoted statement regarding "emulation".

--SS
SecretSquirrel
Gerbil Jedi
Gold subscriber
 
 
Posts: 1738
Joined: Tue Jan 01, 2002 7:00 pm
Location: The Colony, TX (Dallas suburb)

Re: Floating-point units in server-grade CPUs

Postposted on Thu Nov 11, 2010 8:06 pm

just brew it! wrote:
Glorious wrote:
Shining Arcanine wrote:The situation today with GPUs is much different than the situation in the past when you had accessory chips whose only purpose was to make the CPU better. While those were like the male angler-fish, the GPU is not.

Interesting. Here on our planet, in the real world, the GPU is indeed an "Accessory chip" that makes "the CPU better."

You know, seeing how a CPU can do everything a GPU can do, albeit slower, and how a GPU is useless without a CPU.

Yup. A GPU is quite literally a specialized, highly parallelized FPU... one that just happens to have a couple of video outputs hanging off of it.


In computer hardware, floating point units are logical units that take data inputs and a input and produce a data output according to those inputs, with a mapping from inputs to outputs that corresponds to the IEEE754 standard. If your statements are correct in saying that GPUs are floating point units, then block diagrams of GPUs contradict your statements by failing to adhere to the definition of a floating point unit. Here is a block diagram for a recent GPU:

http://techreport.com/r.x/nvidia-fermi/fermi-block.png

Since what you say contradicts the definition of a floating point unit, what do you consider a floating point unit to be?

By the way, as a side note, page 106 of Nvidia's CUDA programming guide states that integer types are supported, which means that you can do integer operations on Nvidia's GPUs:

http://developer.download.nvidia.com/co ... de_3.1.pdf

SecretSquirrel wrote:
Shining Arcanine wrote:... so it would seem that the "emulation" is not as slow as you would think it is



Shining Arcanine wrote:I was running it on an actual Nexus One, not an emulated one. The Android emulator took about 10.5 minutes to do Sunspider on my laptop.


I'm sorry, you stated that, because it only took a few seconds longer, "emulation" is not as slow as you would think. As this was a discussion about FPUs, my assumption was that you took those results to mean that FP emulation on the Nexus One was not that much slower than running with a real FPU on your desktop machine. Please enlighten me about your above quoted statement regarding "emulation".

--SS


Emulation is usually used in reference to simulating a full machine. Emulating instructions usually means doing things that are logically equivalent to the instructions, without actually using them. When you said emulation, I thought you were referring to running sunspider on an actual android emulator, which is something I have done. When I realized your misuse of terminology, I edited my post to compensate for it.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Thu Nov 11, 2010 11:35 pm

Thread: Fail.
Buub
Maximum Gerbil
Silver subscriber
 
 
Posts: 4214
Joined: Sat Nov 09, 2002 11:59 pm
Location: Seattle, WA

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 1:22 am

Shining Arcanine wrote:
wibeasley wrote:Here are two more GPGPU people who believe that that FPUs aren't unnecessary.
...
I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.
That's an unlikely 'if'. I don't think anyone in this thread but yourself is convinced by your arguments. Those two quotes are by two additional people who don't think FPUs are unnecessary.
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 6:51 am

I think considering "server-grade CPUs" in isolation is fundamentally flawed. Sure there are chips that only get used in servers... Itanium, Power, Sparc etc but these all seem to be loosing ground to cheapo x86 based chips. This isn't happening because one architecture is better than another, it's not happening because one has an FPU and one hasn't. It's because x86 is cheap!

Why are x86 chips cheap? Because they are mass produced desktop CPUs that go through a slightly different QA process as they come off the production line. They're produced in the hundreds of millions, something that can't be said for Power, Sparc etc.

Being ubiquitous also has a knock on effect of making software for x86 cheaper too.

So x86 server CPUs need a half decent FPU so the same design can be used as a desktop processor not because they need them for server workloads. Sure AMD/Intel could disable their FPUs in their Opterons/Xeons but what would be the point? It's not like the FPU represents a large proportion of the chip (in terms of transistors or heat) which is mostly cache.

Are traditional FPUs starting to fade into insignificance? Certainly looks that way but given how we're only just starting to look at replacing the BIOS I can't see it happening any time soon.

EDIT:
As for Android, I recently ran Sunspider on a Google Nexus One, which did not take much longer to complete than Sunspider in Google Chromium on my laptop's Intel Core T2400 processor. I think that the difference was 5 seconds versus 2 seconds, which is negligible. As mentioned earlier, Javascript relies solely on floating point operations, so it would seem that the "emulation" is not as slow as you would think it is


I'm no expert on ARM, but doesn't the Snapdragon chip in the Nexus actually have an FPU? According to this page:
http://www.arm.com/products/processors/technologies/vector-floating-point.php
an IEEE754 FPU is optional for v7 and up

Here's a copy and paste from the interwebs:
cpuinfo for a Nexus One:
Processor : ARMv7 Processor rev 2 (v7l)
BogoMIPS : 162.54
Features : swp half thumb fastmult vfp edsp thumbee neon
CPU implementer : 0x51
CPU architecture: 7
CPU variant : 0x0
CPU part : 0x00f
CPU revision : 2
Hardware : mahimahi
Revision : 0081
Serial : 0000000000000000
ARM Floating Point architecture (VFP) provides hardware support for floating point operations in half-, single- and double-precision floating point arithmetic. It is fully IEEE 754 compliant with full software library support.
Fernando!
Your mother ate my dog!
cheesyking
Minister of Gerbil Affairs
 
Posts: 2285
Joined: Sun Jan 25, 2004 7:52 am
Location: That London (or so I'm told)

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 10:12 am

SA wrote:If they are large enough that they are taking a noticeable amount of time, then I am certain that you will find a way to parallelize them


Dude, when you say things like that we have to wonder if you even understand what the word "parallelize" means. Not every problem is parallelizable. The classic analogy is that bearing a child takes 9 months, no matter how many women you have. It is not impossible, or even all that unlikely, to have a bunch of dependent calculations. Thus not every problem can be split up and run concurrently.

SA wrote:Google Chromium is an excellent example of this, where putting each page into its own separate process parallelized webpages rendering in a tabbed web browser, which was slow with the single renderer thread approach Firefox took.


That's a "problem" that's obviously parallelizable because it is composed of indepedent tasks. Each webpage doesn't have any dependency on any other webpage. That's not finding concurrency within a problem, that's just having a bunch of different problems to begin with.

And if it was strictly a performance question they would have just threaded it, the primary reason behind the process per window/tab concept was the security model that comes with processes.

SA wrote: I doubt that everything you run is one massive problem that cannot be broken into separate threads and if it can, you can likely put it into a SIMD programming model.


You're the one speaking in absurd absolutes here, not us. NO ONE has claimed that everything isn't parallelizable. They are just claiming that some things aren't, and they can't handwave that away.

SA wrote:Regardless, everyone, everywhere agrees that the single threaded programming model is a dead-end in terms of performance.


The performance increases have slowed down, this is true, but single-thread performance still matters and will continue to matter. If it DIDN'T matter, you'd see ICs with 16 in-order cores taking the world by storm. I don't see that, do you?

This ridiculous faith you have in the notion that every problem can be parallelized is just, well, absurd. It plainly isn't true. It's also more complicated than that, because even if most of your problem is parallelizable, there is still a hard limit to how much performance you can gain by throwing parallel execution at it. Guess what the limit is? Oh, right, the amount of time your program takes in the parts that aren't parallelizable. You are only asymptotically approaching it by adding more and more parallel execution! In other words, even in a world with "free" parallelization hardware(instantaneously fast Tesla's for everyone!), singlethreaded performance will always matter. In fact, such a world would make single-threaded performance the DETERMINING factor!

It will still matter! It will always matter!

What I am referring to is known as Amdahl's law. Ubergerbil wrote a great post about this some years back.

viewtopic.php?f=2&t=44090&hilit=amdahl

SA wrote: Any business that cannot parallelize its critical software applications will be killed by those that can, in which case, the strength of a single processing unit does not matter so long as you have a sufficiently large number of them.


That maybe true if performance is extremely important to your product and you're leaving possible concurrency on the table that your competitors are picking up, but it's not true if you product is designed to deal with problems that inherently cannot be parallelized well.

And, again, if the strength of a single "processing unit" didn't matter, why don't we see ICs with umpteen in-order cores dominating the market?

SA wrote:I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.


If anyone is "ignoring" your point, that's because your "point" is a fantasy. You're not making a decent argument that they're unnecessary, you're just waving your hand and saying they are.

It's like starting a mathematical proof with a priori definition for the division of zero and then "proving" a whole host of mathematical concepts. Yes, you can do some pretty groundbreaking things once you do that (1 can now equal 2, AWESOME!). It's just that, well, you know, we're not really impressed. Saying we should just ignore your first statemnt and concentrate on your later work because it's so incredible is missing the point.

SA wrote:In computer hardware, floating point units are logical units that take data inputs and a input and produce a data output according to those inputs, with a mapping from inputs to outputs that corresponds to the IEEE754 standard.


Not that I fully understand what the heck you even mean, but the IEEE754 is a bit more than just "how do I perform operations on floats of like precision." There are subtle, but incredibly important matters like "how do I do operations between floats of differing precisions" and "how do I handle exceptions." There are rounding modes, FMAs, subnormals, lions, tigers and bears! Not-so-incidentally, those kinds of things are actually the complex parts of the standard that take up the majority of its text.

SA wrote:If your statements are correct in saying that GPUs are floating point units, then block diagrams of GPUs contradict your statements by failing to adhere to the definition of a floating point unit. Here is a block diagram for a recent GPU:


Here's what Scott prefaced that diagram with:

Scott Wasson wrote: Images like the one below may not mean much divorced from context


He's only more right when they are used in the WRONG context.

SA wrote:Since what you say contradicts the definition of a floating point unit, what do you consider a floating point unit to be?


Logic that is intended to deal with Floats?

SA wrote:By the way, as a side note, page 106 of Nvidia's CUDA programming guide states that integer types are supported, which means that you can do integer operations on Nvidia's GPUs:


Do all of them handle them natively through, or just Fermi? Because the fact that a programming framework can use them doesn't exactly mean a whole lot by itself, you know?

And, in respect to Fermi, it's perhaps more of a vector processor than a straight FPU, which JBI covered by saying "specialized" and "highly parallel." So, what do you think you are showing?

SA wrote:Emulation is usually used in reference to simulating a full machine.


News to me. When people are talking about FPUs and embedded processors, they're usually talking about software emulation, kernel emulation or how the processor can emulate having a FPU through microcode that just uses its ALU. In all cases, you're not simulating the "full machine" and in software emulation, you're not even simulating instructions at all.

SA wrote:When I realized your misuse of terminology, I edited my post to compensate for it.


:roll: Just because he uses a word in a context you're unfamiliar with doesn't mean he's wrong. It's just your raging absolutism leading you into silliness again.

Just because you think you have really cool, nicely defined and easily understood box doesn't mean you can suddenly stuff the entire world into it. And your box sucks anyway. Stop telling us what you *think* you've learned in class and actually pay more attention. This isn't just real world versus the academy because you regularly get the theory wrong too.
Glorious
Darth Gerbil
Gold subscriber
 
 
Posts: 7886
Joined: Tue Aug 27, 2002 6:35 pm

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 11:51 am

wibeasley wrote:
Shining Arcanine wrote:
wibeasley wrote:Here are two more GPGPU people who believe that that FPUs aren't unnecessary.
...
I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.
That's an unlikely 'if'. I don't think anyone in this thread but yourself is convinced by your arguments. Those two quotes are by two additional people who don't think FPUs are unnecessary.


<x> implies <y> and <x> is false. Is <y> true or false? If you do not know the answer to this, then you do not understand the logic involved.

By the way, hardware can never compensate for programmers are incapable of parallelizing code. If having hardware floating point units in a CPU is the difference between a program performing and a program failing to perform, then the program is not written properly. The limiting factor should always be the parallelism of the underlying hardware. That is the way forward.

cheesyking wrote:I think considering "server-grade CPUs" in isolation is fundamentally flawed. Sure there are chips that only get used in servers... Itanium, Power, Sparc etc but these all seem to be loosing ground to cheapo x86 based chips. This isn't happening because one architecture is better than another, it's not happening because one has an FPU and one hasn't. It's because x86 is cheap!


Morphine chose the thread title. Originally, this was in a thread about bulldozer. I asked why floating point performance in CPUs was important and he replied that it mattered for servers. He later split the thread and gave it a new title based on that.

cheesyking wrote:
As for Android, I recently ran Sunspider on a Google Nexus One, which did not take much longer to complete than Sunspider in Google Chromium on my laptop's Intel Core T2400 processor. I think that the difference was 5 seconds versus 2 seconds, which is negligible. As mentioned earlier, Javascript relies solely on floating point operations, so it would seem that the "emulation" is not as slow as you would think it is


I'm no expert on ARM, but doesn't the Snapdragon chip in the Nexus actually have an FPU? According to this page:
http://www.arm.com/products/processors/technologies/vector-floating-point.php
an IEEE754 FPU is optional for v7 and up


TheEmrys made what appeared to me to be the broad generalization that Android devices lacked floating point units, so I stated what the implications of that would be based on my experience with the Nexus One.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 12:23 pm

Shining Arcanine wrote:I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.
wibeasley wrote:That's an unlikely 'if'. I don't think anyone in this thread but yourself is convinced by your arguments. Those two quotes are by two additional people who don't think FPUs are unnecessary.
<x> implies <y> and <x> is false. Is <y> true or false? If you do not know the answer to this, then you do not understand the logic involved.
<x>: SA "can make a decent argument for them being unnecessary"
<y>: "their actual performance is not really something that should be a concern for people"
I agree that <x> is false (or at least lacks support), and so we don't have evidence for claiming if <y> is true or false.

Furthermore, everyone in this thread appears to be concerned with their performance (with a variety of their own convincing if x, then y arguments).
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 12:26 pm

Shining Arcanine wrote:hardware can never compensate for programmers are incapable of parallelizing code. If having hardware floating point units in a CPU is the difference between a program performing and a program failing to perform, then the program is not written properly. The limiting factor should always be the parallelism of the underlying hardware.
What about Markov chain Monte Carlo? It's a tool that, almost by itself, made multivariate Bayeisan inference tractable for real world problems. MCMC is similar to a random walk, in the sense that the current iteration is dependent on the previous iteration. There are tricks to parallelize it to a limited degree (such as running 4 to 8 concurrent chains), but most MCMCs need 10,000+ iterations within each chain to converge to a stable estimate. And Bayesian statistics is becoming a huge part of statistics.

I'd be thrilled if you were a programmer who wasn't incapable of parallelizing MCMC code so that it could saturate hundreds of cores.
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 7:36 pm

wibeasley wrote:I'd be thrilled if you were a programmer who wasn't incapable of parallelizing MCMC code so that it could saturate hundreds of cores.


You know what, I bet he is. He does not understand why a hardware FPU is remotely useful to anyone, and can't write a working "for" loop, but I feel confident that SA is a virtuoso at writing efficient, massively parallel code by hand.
Saber Cherry
Gerbil XP
 
Posts: 303
Joined: Fri Mar 14, 2008 3:41 am
Location: Crystal Tokyo

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 7:58 pm

Saber Cherry wrote:but I feel confident that SA is a virtuoso at writing efficient, massively parallel code by hand.

Writing I'm not sure, but talking about it, he is a master.
There is a fixed amount of intelligence on the planet, and the population keeps growing :(
morphine
Grand Admiral Gerbil
Silver subscriber
 
 
Posts: 10092
Joined: Fri Dec 27, 2002 8:51 pm
Location: Portugal (that's next to Spain)

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 10:58 pm

wibeasley wrote:
Shining Arcanine wrote:I think you are ignoring the point being that if I can make a decent argument for them being unnecessary, then their actual performance is not really something that should be a concern for people.
wibeasley wrote:That's an unlikely 'if'. I don't think anyone in this thread but yourself is convinced by your arguments. Those two quotes are by two additional people who don't think FPUs are unnecessary.
<x> implies <y> and <x> is false. Is <y> true or false? If you do not know the answer to this, then you do not understand the logic involved.
<x>: SA "can make a decent argument for them being unnecessary"
<y>: "their actual performance is not really something that should be a concern for people"
I agree that <x> is false (or at least lacks support), and so we don't have evidence for claiming if <y> is true or false.

Furthermore, everyone in this thread appears to be concerned with their performance (with a variety of their own convincing if x, then y arguments).


With whom do you agree? I certainly do not think <x> is false in that case.

Furthermore, how does a hardware floating point unit differ from a placebo? Everyone can be convinced that it helps, but does it really help outside of synthetic benchmarks?

wibeasley wrote:
Shining Arcanine wrote:hardware can never compensate for programmers are incapable of parallelizing code. If having hardware floating point units in a CPU is the difference between a program performing and a program failing to perform, then the program is not written properly. The limiting factor should always be the parallelism of the underlying hardware.
What about Markov chain Monte Carlo? It's a tool that, almost by itself, made multivariate Bayeisan inference tractable for real world problems. MCMC is similar to a random walk, in the sense that the current iteration is dependent on the previous iteration. There are tricks to parallelize it to a limited degree (such as running 4 to 8 concurrent chains), but most MCMCs need 10,000+ iterations within each chain to converge to a stable estimate. And Bayesian statistics is becoming a huge part of statistics.

I'd be thrilled if you were a programmer who wasn't incapable of parallelizing MCMC code so that it could saturate hundreds of cores.


As far as I know, Monte Carlo integration can be parallelized on a GPU. A parallel pseudo-random number generators using the Mersenne Twister with the Box-Muller transform exists at Nvidia's website:

http://developer.download.nvidia.com/co ... mples.html

While I have not read the paper to have verified their claims, I assume that their claim to have created a parallel the Mersenne Twister algorithm is valid. If it is not valid, then I know that it is possible produce n O(1) random psuedo-number generators using the xor method suggested in Numerical Recipes. The only issue would be calculating the constants used in each generator, but it should be possible to produce a list of tens of thousands (e.g. K = 32768) of sets of valid constants for the xor method, have the program initialize as many as it needs (up to the limit K) and then call each in parallel as a source of random numbers. By that method, pseudo-random number generation is parallelized. Keep in mind that each set of constants will need to be put through the Die-Hard tests to make certain that it is a good selection, although that only needs to be done once, so it is O(1) when you have the end-result in a table.

With a parallel psuedo-random number generator, you could then scale each number produced to the interval of integration, use that as an input to the integrand and then obtain the output. All of that should be doable in parallel in O(1) time. You could then sum the result in O(log(n)) steps. Since you will need to repeat this process across millions of numbers, the total computational complexity of the calculation is O(m*log(n)), where m is the number of monte carlo iterations and n is the number of threads executing concurrently on the GPU. This process can be run in parallel on millions of cores simultaneously.

Considering that there is enough parallelism to exploit in the Monte Carlo integration that you likely will be unable to exhaust it, Amdahl's law slowdowns will occur exclusively in other aspects of the computation. Since the Monte Carlo integration should dominate the rest of the calculation by orders of magnitude, parallelizing the Monte Carlo integration would count as having parallelized the entire calculation in practice. If the Monte Carlo integration does not dominate the computation, then please pardon my ignorance. While I know a decent amount about Monte Carlo integration, my knowledge of Markov Chains is poor. My numerical analysis class is ahead of schedule, so I will ask my professor if he would cover them on a day we have nothing to do. I will leave it to you to tell me whether or not the above description counts as having parallelized Monte Carlo Markov Chains, but having done Monte Carlo integration firsthand, I suspect that it does.

A Markov Chain Monte Carlo calculation using a parallel Monte Carlo integration without hardware floating point units will outperform a Markov Chain Monte Carlo calculation using a single threaded Monte Carlo integration with a hardware floating point unit. Having it both be parallel and use hardware floating point units is better than either case, but the improvement from the hardware floating point unit is bounded by a constant factor and that in itself should not excite anyone, which was the reason why I asked why people cared about CPU floating point performance in the first place. It does not seem like something that should excite people.

By the way, how do you know about MCMC calculations? I would never expected to encounter anyone here familiar with the subject.

morphine wrote:
Saber Cherry wrote:but I feel confident that SA is a virtuoso at writing efficient, massively parallel code by hand.

Writing I'm not sure, but talking about it, he is a master.


It is far easier to talk about something than it is to do something. I described a table of sets of constants for the xor method and then proceeded to use it as if they already existed. In practice, producing that table requires a significant amount of computation. Each set of constants represents one pseudo-random number generator and it would take a significant amount of time to produce on the order of 10^5 pseudo-random number generators with good statistical randomness. A super computer would likely need to be employed to do the calculations necessary to produce the table and even then, it could take months.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Fri Nov 12, 2010 11:27 pm

Shining Arcanine wrote:Having it both be parallel and use hardware floating point units is better than either case, but the improvement from the hardware floating point unit is bounded by a constant factor and that in itself should not excite anyone, which was the reason why I asked why people cared about CPU floating point performance in the first place. It does not seem like something that should excite people.

By the way, how do you know about MCMC calculations? I would never expected to encounter anyone here familiar with the subject.

These two points right here are 2 out of 3 reasons why SA is everybody's favorite poster. (The third, of course, would be if he also said that O(n) and O(n^2) describe the same class of algorithms, thereby getting the theory wrong too.)

So, Shining Arcanine, even though you won't read this[1], let me ask you two questions:
1) If you have an assignment to write a program and run it on a data set, and you have two days left, does it make a difference to you if you have two separate solutions, one of which will take 20 hours to run, and the second of which does 3 times more work/datum, thereby requiring 60 hours? Do you prefer one of those solutions over the other? If your initial solution is the 60-hour one, and then you realize that you can cut the work down and run a 20-hour one, are you excited?
2) Do you really imagine that in a forum focused on computers, with a heavy emphasis on clean data and technical computing, you are the only person to have started college? Or is it just that you think that you are hearing about all the important concepts around all of math and science in 4 years, and only people who specialize in a field remember any of it once they get a job?

[1]: For beating you up too badly in an argument over Java's performance compared to other languages -- and let me remind you that, even by your argument, Java was only slower by a constant factor? :lol:
Core i7 920, 3x2GB Corsair DDR3 1600, 80GB X25-M, 1TB WD Caviar Black, MSI X58 Pro-E, Radeon 4890, Cooler Master iGreen 600, Antec P183, opticals
SNM
Emperor Gerbilius I
 
Posts: 6206
Joined: Fri Dec 30, 2005 10:37 am

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 12:48 am

SA continues to shrilly argue What Should Be against What Is.
Buub
Maximum Gerbil
Silver subscriber
 
 
Posts: 4214
Joined: Sat Nov 09, 2002 11:59 pm
Location: Seattle, WA

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 1:08 am

SA wrote:By the way, how do you know about MCMC calculations? I would never expected to encounter anyone here familiar with the subject.
There are at least four other regular forum members who identify themselves as statisticians (and I regret not keeping track of their names -pm me if you're reading this). My main interest is computational statistics and I've written a book chapter about simulation methods (published by the APA next year). MCMC occupies the second half of the chapter. It's a fun topic.
SA wrote:While I have not read the paper to have verified their claims, I assume that their claim to have created a parallel the Mersenne Twister algorithm is valid...
It's also fun to hear your creative ideas about parallelizing MCMC. I genuinely hope I'm wrong, but I think there are two big differences between it and any univariate RNG, like the MT. First, MCMC is usually directed only at multivariate (posterior) distributions. For univariate distributions, systematic (ie, nonrandom) integration is much more efficient than MCMC. And for problems with a handful of dimensions, there are several good simulation techniques that lack dependence between the steps (such as an "Independent Metropolis Hastings" and "Rejection/Acceptance Sampling"). But for typical multi-level models in statistics, every participant contributes several dimensions to the joint posterior distribution. And most experiments have 50+ participants/cases, so you need an MCMC to lessen the burden of the dimensionality. I think the parallel MT behaves similarly to an Independent MH; if you think I'm overlooking something, I'll take that nvidia article more seriously and read it.

Second, MCMCs can require thousands of steps just to reach the target/stationary distribution. They have to wander around a lot to get there. I'm sure the MT begins its first step already at a stationary distribution (otherwise, you'd have to discard the "burn-in" steps). You use MCMC when you have a really vauge idea of the shape of the joint/multivariate target distribution. If you already have a good idea of the target distribution, then something like the Independent MH is a much more efficient method (because none of the steps are correlated, so in a sense you're getting a bigger sample size).
SA wrote:A Markov Chain Monte Carlo calculation using a parallel Monte Carlo integration without hardware floating point units will outperform a Markov Chain Monte Carlo calculation using a single threaded Monte Carlo integration with a hardware floating point unit.
Assuming you have hundreds of cores, I agree that wall time could be quicker, but I imagine the power efficiency would be awful.

Markov chains involve moving from one state to the next -that's an inherently dependent process. Maybe it's not impossible to parallelize in the future, but it's silly to blame that difficulty on incompentant programmers. Again, I'd love to be proven wrong about this. It would make my day to read an article how to parallelize a multivariate MCMC beyond a few cores.
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 3:16 am

Meadows wrote:Shining Arcanine, I think it's absolutely cool how you pulled off your [Defense of the Troll] (1 hour cooldown) ability, much like Prime1, Krogoth, et al, because you just completely ignored Glorious' comment (here, I'll help, click this link).

As per the above skill's description, the reason why you did that is because he's completely proven you wrong and if you were to acknowledge that, the thread would already be dead by now. Instead, you choose to continue ignorance and your own dark entertainment without realising that the battle's already over.

It was fun until you did that.


Bump for irritating, but continued relevance. In fact, here's another comment by Glorious that's very well made altogether, and went completely ignored again.


Shining Arcanine wrote:
wibeasley wrote:<x>: SA "can make a decent argument for them being unnecessary"
<y>: "their actual performance is not really something that should be a concern for people"
I agree that <x> is false (or at least lacks support), and so we don't have evidence for claiming if <y> is true or false.

Furthermore, everyone in this thread appears to be concerned with their performance (with a variety of their own convincing if x, then y arguments).


With whom do you agree? I certainly do not think <x> is false in that case.


Of course you don't, well d'uh, it's your claim after all. Had you paid attention however, you would've noticed long ago that nobody here agrees with you on just about any topic you bring up.

wibeasley wrote:
SA wrote:By the way, how do you know about MCMC calculations? I would never expected to encounter anyone here familiar with the subject.
There are at least four other regular forum members who identify themselves as statisticians (and I regret not keeping track of their names -pm me if you're reading this). My main interest is computational statistics and I've written a book chapter about simulation methods (published by the APA next year). MCMC occupies the second half of the chapter. It's a fun topic.
SA wrote:While I have not read the paper to have verified their claims, I assume that their claim to have created a parallel the Mersenne Twister algorithm is valid...
It's also fun to hear your creative ideas about parallelizing MCMC. I genuinely hope I'm wrong, but I think there are two big differences between it and any univariate RNG, like the MT. First, MCMC is usually directed only at multivariate (posterior) distributions. For univariate distributions, systematic (ie, nonrandom) integration is much more efficient than MCMC. And for problems with a handful of dimensions, there are several good simulation techniques that lack dependence between the steps (such as an "Independent Metropolis Hastings" and "Rejection/Acceptance Sampling"). But for typical multi-level models in statistics, every participant contributes several dimensions to the joint posterior distribution. And most experiments have 50+ participants/cases, so you need an MCMC to lessen the burden of the dimensionality. I think the parallel MT behaves similarly to an Independent MH; if you think I'm overlooking something, I'll take that nvidia article more seriously and read it.

Second, MCMCs can require thousands of steps just to reach the target/stationary distribution. They have to wander around a lot to get there. I'm sure the MT begins its first step already at a stationary distribution (otherwise, you'd have to discard the "burn-in" steps). You use MCMC when you have a really vauge idea of the shape of the joint/multivariate target distribution. If you already have a good idea of the target distribution, then something like the Independent MH is a much more efficient method (because none of the steps are correlated, so in a sense you're getting a bigger sample size).


Wow, I didn't understand a single sentence in that one! Awesome :lol:
(Yes, my presence in this thread is largely caused by not professional reasons but the failure of Arcanine, who's hardly Shining and continues to attract good laughs.)
Meadows
Grand Gerbil Poohbah
Silver subscriber
 
 
Posts: 3190
Joined: Mon Oct 08, 2007 1:10 pm
Location: Location: Location

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 10:52 am

wibeasley wrote:
SA wrote:By the way, how do you know about MCMC calculations? I would never expected to encounter anyone here familiar with the subject.
There are at least four other regular forum members who identify themselves as statisticians (and I regret not keeping track of their names -pm me if you're reading this). My main interest is computational statistics and I've written a book chapter about simulation methods (published by the APA next year). MCMC occupies the second half of the chapter. It's a fun topic.


Has the book been published? I would be interested in reading it in about a month when I have time.

wibeasley wrote:
SA wrote:While I have not read the paper to have verified their claims, I assume that their claim to have created a parallel the Mersenne Twister algorithm is valid...
It's also fun to hear your creative ideas about parallelizing MCMC. I genuinely hope I'm wrong, but I think there are two big differences between it and any univariate RNG, like the MT. First, MCMC is usually directed only at multivariate (posterior) distributions. For univariate distributions, systematic (ie, nonrandom) integration is much more efficient than MCMC. And for problems with a handful of dimensions, there are several good simulation techniques that lack dependence between the steps (such as an "Independent Metropolis Hastings" and "Rejection/Acceptance Sampling"). But for typical multi-level models in statistics, every participant contributes several dimensions to the joint posterior distribution. And most experiments have 50+ participants/cases, so you need an MCMC to lessen the burden of the dimensionality. I think the parallel MT behaves similarly to an Independent MH; if you think I'm overlooking something, I'll take that nvidia article more seriously and read it.

Second, MCMCs can require thousands of steps just to reach the target/stationary distribution. They have to wander around a lot to get there. I'm sure the MT begins its first step already at a stationary distribution (otherwise, you'd have to discard the "burn-in" steps). You use MCMC when you have a really vauge idea of the shape of the joint/multivariate target distribution. If you already have a good idea of the target distribution, then something like the Independent MH is a much more efficient method (because none of the steps are correlated, so in a sense you're getting a bigger sample size).
SA wrote:A Markov Chain Monte Carlo calculation using a parallel Monte Carlo integration without hardware floating point units will outperform a Markov Chain Monte Carlo calculation using a single threaded Monte Carlo integration with a hardware floating point unit.
Assuming you have hundreds of cores, I agree that wall time could be quicker, but I imagine the power efficiency would be awful.


This is moving outside of my field. While I have nearly finished my undergraduate degree and one of my two majors are "Applied Mathematics and Statistics", the statistics component is minimal. I lack familiarity with your terminology and I have no idea what the acronym MH means. I assume that a posterior distribution refers to the distribution obtained after applying information by means of Bayes Theorem. With that said, I cannot say much more in the scope of MCMC until I know at least what MH means, although I can still comment on problems that require Monte Carlo integration in general.

I think that as long as you are doing Monte Carlo integration, it is feasible to parallelize that portion of the computation. The Runge Kutta method can do numerical integration faster, but I imagine that Monte Carlo integration is used for its property that you can arbitrarily stop and continue the calculation until you obtain convergent results. With the Runge-Kutta method, you will need to anticipate the required stepsize beforehand, which can be computationally more expensive than Monte Carlo. That is because if the step size is too small, you did much more computation than you had to do, and if the step size is too large, you need to start from scratch, guessing how much smaller the step-size must be. You can probably address those issues by using some kind of genetic algorithm to determine the appropriate step-size. Assuming that you can do that a small machine doing Runge-Kutta could compete with a very large cluster doing Monte Carlo. However, the small machine will not become any faster at what it does as time passes while the cluster will as its size increases.

I agree that the power efficiency of using Monte Carlo in a cluster is awful in comparison to the power efficiency of using Runge-Kutta in a single machine. The reality of these computations is that parallelizing them requires the use of algorithms that in the single-threaded case are slow, but become better as the parallelism increases, such that the calculation will run faster on a sufficiently large cluster. That produces a large discontinuity in power consumption, but the GPU might be able to lessen that. GPUs are essentially a clusters on chips with all of the non-compute parts stripped out, which allow their energy efficiency to be roughly an order of magnitude greater than that of a cluster.

wibeasley wrote:Markov chains involve moving from one state to the next -that's an inherently dependent process. Maybe it's not impossible to parallelize in the future, but it's silly to blame that difficulty on incompentant programmers. Again, I'd love to be proven wrong about this. It would make my day to read an article how to parallelize a multivariate MCMC beyond a few cores.


Do you see any difference between MCMC calculations and the calculation of the pixels on a screen when playing a video game? Ignoring the fact that you have real time constraints and can do more or less depending on the performance of the GPU, each frame must be calculated before the succeeding frame. The entire calculation is parallelized by the ability to parallelize computation of pixels within a frame. The dependence of the process really does not matter as far as its ability to be parallelized is concerned because there is enough parallelism in each frame that in practice the dependent nature of the calculation does not usually form a bottleneck.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 12:28 pm

In some ways, I like how you are fearless proposing ideas in a topic that's new to you. If you want to focus on MCMCs, let's start a new thread in Developer's Den. It brought up MCMC as an example of an important numerical procedure that is not easily parallelizable to hundreds of threads. Therefore, I don't agree with you that we can eliminate the FPU and blame incompetent "programmers [who] are incapable of parallelizing code".
SA wrote:Would someone enlighten me as to why people care about floating point performance?
If neither of us is able to find a vein of articles describing the problems and potential soultions to parallelize an MCMC, are you willing to say there's a possibility that it's not a programmer's fault? And if so, can you understand why some computational programmers care about floating point performance and maybe even are excited about the upcoming AVX extensions?

If you do start a new MCMC thread: (a) the 'MH' refers to 'Metropolis-Hastings'; it's the oldest and most general MCMC sampler. (b) I don't see any helpful connections between the Runge-Kutta and MCMC in this context. As I remember, there's no stochastic component to the RK. Tell me if I'm wrong, and I'll get interested. (c) My impression is that if an MCMC was like rendering, then we'd only find the marginals of a multivariate distribution (instead of the joint distribution). However if rendering was like an MCMC, then rendering the tree (that's on the left side of the screen) would depend on how the rock (on the right side of the screen) was drawn during the previous frame. And because these elements are stochasticly drawn, the uncertainty prevents the execution from skipping ahead steps/frame. And these are typically continuous variables, so please don't suggest branch-prediction will solve all the problems.
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 1:35 pm

wibeasley wrote:In some ways, I like how you are fearless proposing ideas in a topic that's new to you. If you want to focus on MCMCs, let's start a new thread in Developer's Den. It brought up MCMC as an example of an important numerical procedure that is not easily parallelizable to hundreds of threads. Therefore, I don't agree with you that we can eliminate the FPU and blame incompetent "programmers [who] are incapable of parallelizing code".
SA wrote:Would someone enlighten me as to why people care about floating point performance?
If neither of us is able to find a vein of articles describing the problems and potential soultions to parallelize an MCMC, are you willing to say there's a possibility that it's not a programmer's fault? And if so, can you understand why some computational programmers care about floating point performance and maybe even are excited about the upcoming AVX extensions?

If you do start a new MCMC thread: (a) the 'MH' refers to 'Metropolis-Hastings'; it's the oldest and most general MCMC sampler. (b) I don't see any helpful connections between the Runge-Kutta and MCMC in this context. As I remember, there's no stochastic component to the RK. Tell me if I'm wrong, and I'll get interested. (c) My impression is that if an MCMC was like rendering, then we'd only find the marginals of a multivariate distribution (instead of the joint distribution). However if rendering was like an MCMC, then rendering the tree (that's on the left side of the screen) would depend on how the rock (on the right side of the screen) was drawn during the previous frame. And because these elements are stochasticly drawn, the uncertainty prevents the execution from skipping ahead steps/frame. And these are typically continuous variables, so please don't suggest branch-prediction will solve all the problems.


My comment's target audience did not include people who did scientific computation and I think I posted about that earlier in the thread. I would need to see an example of what is actually being calculated to say more about it, but I think I will refrain from further comment on this specific problem until your book is available. With that said, my belief is that people outside of my comment's target audience are in the minority here.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 5:25 pm

Really you can believe whatever you want, because it doesn't need to be based in fact. Then everybody wins... :lol:
tfp
Grand Gerbil Poohbah
 
Posts: 3076
Joined: Wed Sep 24, 2003 11:09 am

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 9:52 pm

There's no revolutionary concepts in the upcoming chapter. It's an intro/review intended for behavioral statisiticians. With your have a math & programming interests, you'd get a lot more from something like Robert & Casella's two books, if your library has them.

I suspect there are still more people outside your target audience who benefit from an FPU/SSE/AVX. But I'm happy if we can agree that at least some people benefit from them. Just please don't convince AMD & Intel to drop those execution units.
wibeasley
Gerbil Elite
Gold subscriber
 
 
Posts: 952
Joined: Sat Mar 29, 2008 3:19 pm
Location: Norman OK

Re: Floating-point units in server-grade CPUs

Postposted on Sat Nov 13, 2010 10:44 pm

wibeasley wrote:There's no revolutionary concepts in the upcoming chapter. It's an intro/review intended for behavioral statisiticians. With your have a math & programming interests, you'd get a lot more from something like Robert & Casella's two books, if your library has them.


Thanks for the link. I will see if the books are available at my university's library.

wibeasley wrote:I suspect there are still more people outside your target audience who benefit from an FPU/SSE/AVX. But I'm happy if we can agree that at least some people benefit from them. Just please don't convince AMD & Intel to drop those execution units.


There is a theoretical benefit, but it is not significant for the vast majority of computer users. Most computer users do word processing, web browsing and email, which is not limited by CPU floating point performance. Many other things are not limited by floating point performance either, including the applications commonly run on the servers that businesses use.

I doubt that AMD and Intel could remove floating point instructions from their CPUs entirely because of all of the legacy code that they must support. Removing floating point hardware for them would likely require making a derivative instruction set architecture and it would likely be taken as a opportunity to start from scratch with an ISA that is more consistent. At the same time, the overwhelming majority of systems in which computer processors are used do not rely on strong floating point performance. AMD and Intel appear to be moving in two different directions on this. AMD halved the number of floating point units per core in their new Bulldozer chip and their strategy is to push developers to use the GPU component of their upcoming Fusion products for floating point intensive calculations.

Intel on the other hand has no significant investment in GPUs. While it has integrated GPUs into some of its processors, the integrated GPUs in Sandy Bridge are incapable of doing GPGPU calculations and appear to be intended more to keep people from buying GPGPU capable graphics processors than anything else. The transcode engine in SandyBridge also appears to serve that purpose. With that in mind, I expect to see Intel push CPU floating point performance as far as it can go until they lose their process technology advantage; at which point, they will be forced to adopt a different approach to processor design.

Moving forward, the industry will transition to GPGPU computing for intensive calculations. This trend can be observed in the largest supercomputers in the world, in tiny smartphones and everything in between them:

http://news.cnet.com/8301-13924_3-20021232-64.html
http://armdevices.net/2010/11/13/arm-ma ... d-devices/
http://www.nvidia.com/object/cuda_apps_flash_new.html
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 12:47 am

Shining Arcanine wrote:There is a theoretical benefit, but it is not significant for the vast majority of computer users.

Yeah, because nobody watches video or uses Flash nowadays, or has any sound to go with that video :roll:

Really, all you Jedi-handwaving is becoming ridiculous at this point. All the SSE and its successor friends were created precisely because the need arose for such operations to be fast, or at the very least fast enough.
  • Game servers
  • VoIP servers
  • Video streaming servers
  • Web servers - statistics collecting, especially
  • Database servers, quite often
  • Rendering servers

You don't like FP instructions. Fine. You don't use FP instructions. Whatever. Meanwhile, the rest of the world will keep on using them. And you can throw a tantrum in as many ways as you see fit.
There is a fixed amount of intelligence on the planet, and the population keeps growing :(
morphine
Grand Admiral Gerbil
Silver subscriber
 
 
Posts: 10092
Joined: Fri Dec 27, 2002 8:51 pm
Location: Portugal (that's next to Spain)

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 12:58 am

You know that GPUs just have a ton of floating point units that run in parallel right?

Now I don't expect you to actually do this but look at the pictures, the second one down in TRs right up shows what each CUDA unit is. You'll see that each contains both an FP unit and an Int unit.
http://www.techreport.com/articles.x/18332

So really your complete argument is pointless because they are still doing PF calculations and if you can't tell from past history AMD and intel pull everything into the CPU as the die gets smaller. At some point all of those CUDA/AMD/Intel graphics units will be generally accessable by software and they will be on processor.

I don't understand why you aren't talking co-processor vs CPU instead of no one uses FP!!! Everyone does now, just because of Aero. Intel/AMD will and are integrating a bunch of simple FP/vector units on to the CPUs just to handle graphics which is FP work. If intel has it's way the failed attempt at the GPU a year or so back will have pieces of it end up on the processor. It will then do graphics well enough and those computation units are x86 so I'm sure intels lib will use them at some point. AMD and Nvidia are slowly moving thier graphics chips to being more and more general in thier work load. AMD is putting graphics the on CPU this next year. Nvidia is odd man out because other than ARM they aren't making CPUs.

FP work is needed and useful, people don't want to buy co-processors, and Intel/AMD are integrating as much as possible whenever they can into thier CPUs. GPGPU cards will continue to live in a niche market get use to it.
tfp
Grand Gerbil Poohbah
 
Posts: 3076
Joined: Wed Sep 24, 2003 11:09 am

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 1:39 am

morphine wrote:
Shining Arcanine wrote:There is a theoretical benefit, but it is not significant for the vast majority of computer users.

Yeah, because nobody watches video or uses Flash nowadays, or has any sound to go with that video :roll:

Really, all you Jedi-handwaving is becoming ridiculous at this point. All the SSE and its successor friends were created precisely because the need arose for such operations to be fast, or at the very least fast enough.
  • Game servers
  • VoIP servers
  • Video streaming servers
  • Web servers - statistics collecting, especially
  • Database servers, quite often
  • Rendering servers

You don't like FP instructions. Fine. You don't use FP instructions. Whatever. Meanwhile, the rest of the world will keep on using them. And you can throw a tantrum in as many ways as you see fit.


If you look hard enough, it would be difficult to find a program that does not use floating point instructions. Operating systems use floating point intructions in their CPU schedulers. The wisdom of using them is not in question, but whether or not the usage of floating point instructions is significant is.

Web servers and databases do not make any significant usage of floating point instructions. Video streaming servers are essentially file servers and need no floating point instructions to stream data. I doubt game servers and VoIP servers do either. Game servers likely do not have a heavy reliance on floating point performance.

VoIP servers do digitial signal processing, which is usually floating point intensive, but that is not widely used. Rendering is also quite possibly floating point intensive, but it is even less widely used than VoIP. As far as I know, render servers all run RenderMan. It was designed before GPGPU existed, so it originally ran on CPUs. Since it has enormous computing power requirements, I would expect Pixar to be porting it to GPUs.

By the way, Adobe has GPU acceleration working in Flash on Windows and Mac OS X and video is GPU accelerated nearly everywhere. Since people here practically worship the ownership of a recent discrete GPU, it is more than likely that any video that you watch in your web browser is being GPU accelerated.

tfp wrote:You know that GPUs just have a ton of floating point units that run in parallel right?

Now I don't expect you to actually do this but look at the pictures, the second one down in TRs right up shows what each CUDA unit is. You'll see that each contains both an FP unit and an Int unit.
http://www.techreport.com/articles.x/18332

So really your complete argument is pointless because they are still doing PF calculations and if you can't tell from past history AMD and intel pull everything into the CPU as the die gets smaller. At some point all of those CUDA/AMD/Intel graphics units will be generally accessable by software and they will be on processor.

I don't understand why you aren't talking co-processor vs CPU instead of no one uses FP!!! Everyone does now, just because of Aero. Intel/AMD will and are integrating a bunch of simple FP/vector units on to the CPUs just to handle graphics which is FP work. If intel has it's way the failed attempt at the GPU a year or so back will have pieces of it end up on the processor. It will then do graphics well enough and those computation units are x86 so I'm sure intels lib will use them at some point. AMD and Nvidia are slowly moving thier graphics chips to being more and more general in thier work load. AMD is putting graphics the on CPU this next year. Nvidia is odd man out because other than ARM they aren't making CPUs.

FP work is needed and useful, people don't want to buy co-processors, and Intel/AMD are integrating as much as possible whenever they can into thier CPUs. GPGPU cards will continue to live in a niche market get use to it.


What is your point? Floating point calculations are the principle thing GPUs are used to do, so it only makes sense to have hardware floating point units on a GPU. I never disputed that.

By the way, in 5 years, it will likely be the case that every smart phone in the world will have a GPGPU processor. How is that a niche market?

http://armdevices.net/2010/11/13/arm-ma ... d-devices/
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 4:15 am

Shining Arcanine wrote:Web servers and databases do not make any significant usage of floating point instructions.

It shows how you have ZERO real-world experience. How am I going to calculate statistics in my web app? CUDA, I suppose? What about the Bayesian filtering in mail servers? Integer math, too? I really wonder if you stopped to actually *think* about what you write.

Shining Arcanine wrote: Video streaming servers are essentially file servers and need no floating point instructions to stream data.

... right up to the point where they transcode a file. Note the "streaming" part, not just "file serving".

Shining Arcanine wrote:I doubt game servers and VoIP servers do either [...] Game servers likely do not have a heavy reliance on floating point performance.

I "doubt" you know anything about the real world at this point. VoIP codecs use FP. Ergo, servers need quite a bit of that. And game servers? Are you seriously telling me that a Team Fortress 2 server does all of its calculations using integer math? Heh...

Shining Arcanine wrote:VoIP servers do digitial signal processing, which is usually floating point intensive, but that is not widely used.

Nope, not widely, it's just a booming market of which everyone's trying to get a piece of. Oh, but it hasn't reached your basement lab yet, so it doesn't exist, right? :roll:

Shining Arcanine wrote: Rendering is also quite possibly floating point intensive, but it is even less widely used than VoIP.

Suuuure... just a few hundred thousand (millions?) boxes in the render farms out there. But you haven't got one of those in the lab either, I guess.

Shining Arcanine wrote:As far as I know, render servers all run RenderMan.

As far as I know, you're completely out of touch with... everything :lol:

So let's sum this whole thing up: you don't know jack about any of the stuff you're writing about, being completely out of touch with anything that's actually done in the Real World (tm). Your comments on VoIP not being widely used and all rendering being done via Renderman (lol) are particularly amusing. You think GPGPU is awesome and standard floating-point math sucks. You've tried to apply the bits of theory that you know to every possible thing you can see (that you don't know the first thing about). When presented with opposing evidence, you do so much Jedi-handwaving that your arms are about to fall off.
There is a fixed amount of intelligence on the planet, and the population keeps growing :(
morphine
Grand Admiral Gerbil
Silver subscriber
 
 
Posts: 10092
Joined: Fri Dec 27, 2002 8:51 pm
Location: Portugal (that's next to Spain)

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 9:42 am

Shining Arcanine wrote:What is your point? Floating point calculations are the principle thing GPUs are used to do, so it only makes sense to have hardware floating point units on a GPU. I never disputed that.

By the way, in 5 years, it will likely be the case that every smart phone in the world will have a GPGPU processor. How is that a niche market?

http://armdevices.net/2010/11/13/arm-ma ... d-devices/



The gpu is for graphics acceleration nothing more and at some point it will be onchip as well if it wasn't already. The ARM chips still has a general FP unit for normal work that is REQUERED for performance reasons, but I know you have done any embedded work so we'll just stay away from that.

I suppose DSPs will be replaced by GPGPU in embedded as well right ?
tfp
Grand Gerbil Poohbah
 
Posts: 3076
Joined: Wed Sep 24, 2003 11:09 am

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 10:02 am

morphine wrote:So let's sum this whole thing up: you don't know jack about any of the stuff you're writing about, being completely out of touch with anything that's actually done in the Real World (tm). Your comments on VoIP not being widely used and all rendering being done via Renderman (lol) are particularly amusing. You think GPGPU is awesome and standard floating-point math sucks. You've tried to apply the bits of theory that you know to every possible thing you can see (that you don't know the first thing about). When presented with opposing evidence, you do so much Jedi-handwaving that your arms are about to fall off.


If you want to summarize, then let us talk about your knowledge. How many computer programs have you written? How many bug reports have you filed for other people's programs? How many patches have you produced for those bug reports?

The things you think a computer program does are likely not even 10% of the things that it actually does behind the scenes, all of which you take for granted. Instead of talking about other people's knowledge, you should examine your own. The ability to put a bunch of hardware components together and install Windows makes you a technician, not an expert.

tfp wrote:
Shining Arcanine wrote:What is your point? Floating point calculations are the principle thing GPUs are used to do, so it only makes sense to have hardware floating point units on a GPU. I never disputed that.

By the way, in 5 years, it will likely be the case that every smart phone in the world will have a GPGPU processor. How is that a niche market?

http://armdevices.net/2010/11/13/arm-ma ... d-devices/



The gpu is for graphics acceleration nothing more and at some point it will be onchip as well if it wasn't already. The ARM chips still has a general FP unit for normal work that is REQUERED for performance reasons, but I know you have done any embedded work so we'll just stay away from that.

I suppose DSPs will be replaced by GPGPU in embedded as well right ?


Graphics acceleration is a special case of something known as stream processing, which is a super set of vector processing. Anything that strongly benefits from vector processing (e.g. SSE) will benefit from being run on a GPGPU. Digital Signal Processing is one of these things.
Disclaimer: I over-analyze everything, so try not to be offended if I over-analyze something you wrote.
Shining Arcanine
Gerbil Jedi
 
Posts: 1717
Joined: Wed Jun 11, 2003 11:30 am

Re: Floating-point units in server-grade CPUs

Postposted on Sun Nov 14, 2010 11:47 am

Shining Arcanine wrote:
morphine wrote:So let's sum this whole thing up: you don't know jack about any of the stuff you're writing about [...]


If you want to summarize, then let us talk about your knowledge. Blah blah yadda, yak yak yak.

Whoa, whoa, whoa. Hold on there, little buddy. You're the one who has yet to prove anything, so it's far from your turn to require anything of others.
Meadows
Grand Gerbil Poohbah
Silver subscriber
 
 
Posts: 3190
Joined: Mon Oct 08, 2007 1:10 pm
Location: Location: Location

PreviousNext

Return to Processors

Who is online

Users browsing this forum: No registered users and 4 guests