TR Forums

cass · Thu Oct 26, 2006 4:40 pm

Damage wrote:
notfred, I think your basic formula isn't terrible, but it does have a weakness in that it doesn't really address locality, memory access patterns, and resource sharing when running similar WU types across multiple cores.

Assuming that one folding process is "glued" to each core, why would what type WU each core is working on affect the other?

I mean the amount of data that is transferred in and out while folding is pitifully small, so I am assuming and have always observed that memory system speed has little to zero impact on folding speed.

I don't think the folding processess can share any data lest the results be skewed, because I was under the impression that each run of even the same WU produce slightly different results. Maybe some of the operations that are performed would be queued or stored in the shared cache and somehow be shared, but that would be assuming that like WU were started and continued at the same pace, and I don't think that would be likely due to system stuff being ran on different cores at different times.

I think main memory, disk subsystem, Instruction cache, and L2Cache is all thats shared in a C2D system right? In folding I am assuming and thoroughly believe these systems are not stressed.

I guess there could be some skewing due to instruction cache sharing, but if you were running 4 cores with all gromacs versus 4 cores with a Tinker, a gromac, a double gromac, and say an amber, you could see if there was enough sharing to make a difference.

I think the biggest problem is going to be if there is not enough memory, and the folding processes start using diskspace for memory.

Guess I'm just curious as to what tests would rule in or out intercore afflictions.

HT.... didn't stanford warn against running HT... more forum diving.

Damage · Fri Oct 27, 2006 12:48 pm

Hey, notfred, how does this thing scale to eight cores? Does it load them all up?

Note: this is not an idle question. Hardware is here, man.

drfish · Fri Oct 27, 2006 1:00 pm

This is sweet, my little project died horribly when my free time dried up but I haven't been trying too hard to get back to it because it looks like it'll end up being a waste of time if TR is going to show off some cool stuff...

iamajai · Fri Oct 27, 2006 1:12 pm

Cass:

I noticed on my e6300 that when I'm folding two 364 point bonus workunits simultaneously that I get almost the same output as folding one of those workunits. In other words, both workunits together slow each other down to half speed. I tested this by pausing one of the folding services and seeing the time/step go down to almost half.

Other workunits at the same time have little to no affect. I have 1GB RAM so I don't believe it is swapping to HD that is changing the performance.

So I do see some influence on folding speed depending on the workunit mix between the two cores, sometimes much more dramatic than others.

As to the cause, I can only guess it may be the decrease in L2 cache/core when both cores are running together. Although I doubt the folding code uses more than 1MB...the data still needs to be transferred to and from L2 to main regardless...

Fri Oct 27, 2006 1:20 pm

OMG Damage, you can't be serious!

I was beginning to think about things when I saw that 16 way rig on the front page. Short answer is no, this will not scale that big right now.

Long answer:
Kernel is currently set for a max of 8 CPUs, but that's just a quick switch to change so no issues there. I think Linux supports up to 255 CPUs so I don't think we'll be seeing too many systems break that.

The real issue is the actual benchmark, it only has 4 WUs to do benchmarks with. I suppose I could just add run more copies of the same WUs to load up the other processors - that would also address your concerns over WU interactions. Let me have a look at doing that - it's not going to be an instant thing - maybe some time over the weekend.

Damage · Fri Oct 27, 2006 1:28 pm

notfred wrote:
I suppose I could just add run more copies of the same WUs to load up the other processors - that would also address your concerns over WU interactions. Let me have a look at doing that - it's not going to be an instant thing - maybe some time over the weekend.

That makes sense. Perhaps just load up the other four cores by mirroring the load on the first four? I dunno. Would be nice to have something comparable to the current numbers, but it's not entirely necessary. Sorry to work you like this.

cass · Fri Oct 27, 2006 2:27 pm

iamajai wrote:
Cass:

I noticed on my e6300 that when I'm folding two 364 point bonus workunits simultaneously that I get almost the same output as folding one of those workunits. In other words, both workunits together slow each other down to half speed. I tested this by pausing one of the folding services and seeing the time/step go down to almost half.

project number? run, clone, gen
Your OS?

lets see it and I will find one to run through one of my three systems.

Ragnar Dan · Fri Oct 27, 2006 2:59 pm

On the idea of the WU mix affecting the production rate of multi-core machines, I think it's possible that there is memory bus contention on some systems with certain WUs, especially if you don't have a separate stick or more of RAM available for each core.

I regularly get memory usage above 100 MB on certain WUs, so crunching through that data won't be helped by a large L2 cache, but I still would be surprised if it made that big of a difference what the other processor core was running. Are you sure, iamajai, that nothing else is causing the difference? Check Task Manager (and select the appropriate columns for display to see how much memory is being used), and make sure the folding clients are getting the CPUs' time.

Tarx · Fri Oct 27, 2006 3:11 pm

iamajai wrote:
Cass:
I noticed on my e6300 that when I'm folding two 364 point bonus workunits simultaneously that I get almost the same output as folding one of those workunits. In other words, both workunits together slow each other down to half speed. I tested this by pausing one of the folding services and seeing the time/step go down to almost half.

Unlike almost all other projects, the 364 point (bonus) projects are VERY cache dependent. With the shared cache of the C2D, that is likely the reason for that result. An old Pentium M 2GHz (with lousy FP and nothing special SSE/SSE2) with 2MB cache did about 700PPD! No other project (bonus or otherwise) was even half that in performance.

cass · Fri Oct 27, 2006 3:41 pm

I just went down and shut one folding instance down on my machine that was running a proj 2414 and a proj 2126. I shut down the 2126. The 2414 was taking 28 mins plus or minus a few seconds for 1% with both running and took 28mins with just the one running.

The particular machine only has 512MB memory and it is DDR400 running at 333. The chipset is via 890pro and it is running integrated video with shared memory of 64MB. The 2126 was taking about 3,700 K mem and the 2124 showed about 84,000k from task manager. The machine had ACAD 2000 running, bobcad, MCAM9, plus whatever system stuff going too... it showed only about a 400M page file. Machine is running winxp Home.

XP home shows some funny results in task manager in that when I shut down one folding, both Processor graphs came down, even though usage stayed at 50% for the one process.

Tarx wrote:
Unlike almost all other projects, the 364 point (bonus) projects are VERY cache dependent. With the shared cache of the C2D, that is likely the reason for that result. An old Pentium M 2GHz (with lousy FP and nothing special SSE/SSE2) with 2MB cache did about 700PPD! No other project (bonus or otherwise) was even half that in performance.

Maybe, but every bone in my body is screaming "no" at me becuase I have ran these WU on processors with 256K cache and they didn't run much if any slower than on the same core with 1MB cache.

Fri Oct 27, 2006 6:08 pm

I've had somewhat similar results, although it's a bit of apples verses oranges.

Both my socket A Sempron @ 2GHz and my slightly underclocked S754 @ 1.6GHz both run the 600 point WUs at about 190+ PPD.

The Sempron ran a 364 point WU at about the same speed, but the S754 is making 300 PPD! 256K L2 verses 512K L2 with memory usage at 101MBs

iamajai · Sat Oct 28, 2006 1:31 pm

I saw the WUs effect each other while running two project 1495 at the same time...can't remember the exact run/clone/gen.

I've had that happen twice and both times seen the same effect on production. I checked and there were no other processes taking up CPU as they were the only things running. I'm running XP Home btw...

I should note that I have not seen as drastic an effect while running a mix of other work units and often times there is no effect. Just that project 1495 seems to be an exception to the rule.

Cheers.

Sat Oct 28, 2006 4:26 pm

cass wrote:
XP home shows some funny results in task manager in that when I shut down one folding, both Processor graphs came down, even though usage stayed at 50% for the one process.

That's because by default Windows does not "pin" a process to a single core. A given process will randomly bounce between the two cores depending on which core Windows happens to think is more "idle" at the instant it decides to schedule a timeslice for the process.

cass · Sat Oct 28, 2006 9:16 pm

just brew it! wrote:
cass wrote:
XP home shows some funny results in task manager in that when I shut down one folding, both Processor graphs came down, even though usage stayed at 50% for the one process.

That's because by default Windows does not "pin" a process to a single core. A given process will randomly bounce between the two cores depending on which core Windows happens to think is more "idle" at the instant it decides to schedule a timeslice for the process.

Does xp pro behave differently, or do you have to set the processor affinity to stop it?

*edit OK... tried xp pro with the same results.

eitje · Sun Oct 29, 2006 5:57 pm

some feedback from an infrequent poster.

I just tried booting your CD on an EPIA SP13000 (hehe!). i loads the kernel, loads initrd with 4 lines of "."s, then generates a screenful of the error:

unknown interrupt or fault at EIP 00000060 c01002b1 000002b0

once it prints that out, the system reboots (it took 5 cycles to get that error exactly). it's the same error every time it boots.

the same error occurs using an EPIA MS10000.

testing with a P4 in an SB75S boots fine, however it gets stuck with "debug, Sending discover". I imagine this must be for DHCP, since I have DHCP turned off on my home router.

i know you said you have keyboard disabled, but if there is a way, i'd recommend allowing manual input of IP information.

i enabled DHCP briefly, and it finally got up & running. i'll post my final results & system specs in drfish's performance thread!

edit: the error appears to be related to - haha - acpi being used in your new version of the boot CD. ;D what would be the effect of offering an ACPI & non-ACPI version of the CD? would there be significant performance differences?
edit2: attempted acpi=off in the append for isolinux.cfg w/o luck. i don't really want to go to the effort of downloading your source & rebuilding it from there, but i will if i have to.

edit3: went ahead and rebuilt the kernel from the source - same error comes up. probably did something wrong in the rebuild.

do you still have the old, non-acpi version of the image somewhere? if so, please provide it for me so that i can test my little EPIAs.

edit4: i'm also seeing some info about C3s and the i686 flavor of the kernel. might be something to look @ too. i know you got scott's stuff that's more important here, though.

Tue Oct 31, 2006 10:34 pm

OK Damage, new version is up and it should support up to 255 CPUs if you have enough memory in the system. If there are more than 4 CPUs it will just add extra copies of the benchmarking WUs to loadup those extra CPUs before it starts the real benchmarking.

ecurb · Thu Nov 02, 2006 6:29 pm

cass wrote:
I don't think the folding processes can share any data lest the results be skewed, because I was under the impression that each run of even the same WU produce slightly different results. Maybe some of the operations that are performed would be queued or stored in the shared cache and somehow be shared, but that would be assuming that like WU were started and continued at the same pace, and I don't think that would be likely due to system stuff being ran on different cores at different times.

Nothing is shared between cores or clients. Each is an independent process.

I think main memory, disk subsystem, Instruction cache, and L2Cache is all thats shared in a C2D system right? In folding I am assuming and thoroughly believe these systems are not stressed.

This is commonly true but not universally so. Most WUs depend almost exclusively on the speed of your floating point hardware (some SSE and some straight x86 code). There are exceptions, though, as iamajai has observed. The most notable were the QMD WUs, which depended almost exclusively on the bandwidth between cache and main RAM.

I guess there could be some skewing due to instruction cache sharing, but if you were running 4 cores with all gromacs versus 4 cores with a Tinker, a gromac, a double gromac, and say an amber, you could see if there was enough sharing to make a difference.

As mentioned above, there is no sharing. Each client loads it's own copy of the core. There would be some skewing due to the I/O during checkpointing, but that's small enough that I wouldn't worry about it.

Do you have WUs representing all of the cores? Double Gromacs seems to be missing and perhaps others.

HT.... didn't stanford warn against running HT... more forum diving.

Yes, they did. A future enhancement might allow running one client per real CPU rather than one per virtual CPU, but this is probably very tricky without a knowledge of how windows numbers the processors and with the ability to preset affinity. It's probably not worth the effort.

Nevertheless, it's an excellent program

Fri Nov 03, 2006 9:47 am

The benchmark does the following cores:
Core_65
Core_78
Core_82
Core_a0

Looking at the FAHWiki I'm missing Double Gromacs and GB Gromacs. Technically I'm also missing QMD (don't believe they are handed out any more) and the GPU (requires Windows and the benchmark uses Linux).

In terms of ignoring HT CPUs, it's possible and not too hard. The benchmark is under Linux and the Linux scheduler is HT aware so I just need to ignore HT CPUs and the scheduler will do the work for me in terms of CPU affinity and avoiding HT processors. However it is not a priority at the moment - especially with HT declining in importance.

Damage · Wed Nov 15, 2006 1:15 am

Hmm. Looks like there's a problem with more than four cores. It sees eight, but it only loads up four. Here's the output:

Processor Detection
Processor 0 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 1 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 2 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 3 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 4 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 5 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 6 is an Genuine Intel(R) CPU @ 2.66GHz
Processor 7 is an Genuine Intel(R) CPU @ 2.66GHz
Found 8 processors

Progress
Loading processor 1
Loading processor 2
Loading processor 3
Loading processor 4
Starting benchmark of Tinker WU
Starting benchmark of Amber WU
Starting benchmark of Bonus Gromacs WU
Starting benchmark of Gromacs3.3 WU

----

Doh? I think it may need a tweak. Let me know if you need me to test anything.

Wed Nov 15, 2006 9:45 am

That's actually correct - by "Loading" it means running a dummy copy of the benchmarks to load that processor up and it will do those up until it has 4 processors left and then it will run the benchmarks on those 4 remaining ones.

Damage · Wed Nov 15, 2006 12:24 pm

Ah, OK. Gotcha. Thanks. :oops:

Wed Nov 15, 2006 12:58 pm

I am confused too. :oops:

Does that mean it is running 8 instances of the benchmark or 4?

Usacomp2k3 · Wed Nov 15, 2006 1:26 pm

Flying Fox wrote:
I am confused too.

Does that mean it is running 8 instances of the benchmark or 4?

It is running 8 instances of folding, but only benchmarking 4 of them.

Wed Nov 15, 2006 1:33 pm

Usacomp2k3 wrote:
Flying Fox wrote:
I am confused too.

Does that mean it is running 8 instances of the benchmark or 4?

It is running 8 instances of folding, but only benchmarking 4 of them.

Is that what we want? Seems odd to me we are not stressing all 8 cores? :roll:

Usacomp2k3 · Wed Nov 15, 2006 1:42 pm

Flying Fox wrote:
Usacomp2k3 wrote:
Flying Fox wrote:
I am confused too.

Does that mean it is running 8 instances of the benchmark or 4?

It is running 8 instances of folding, but only benchmarking 4 of them.
Is that what we want? Seems odd to me we are not stressing all 8 cores?

All 8 of the cores are being 'stressed' in that they are each running an instance of folding. However, the benchmark data is only being collected on 4 of them, while the other ones are just folding for fun (or just to make sure that there is a proper allocation of shared cache and the like as there would be under real usage).

*disclaimer. This is to the best of my knowledge. Obviously notfred will have to verify/correct my statements.

Wed Nov 15, 2006 4:10 pm

Usacomp2k3 has it right. In fact the instances of folding that are being run on those spare cores are just copies of the benchmark folding WUs but we don't collect the data from them on how long they run for.

koinkoin · Thu Dec 28, 2006 6:36 am

Will you release a version of Benchmark Folding Cd with proxy setting ?
I wanted test some servers of my compagny but all need http proxy setting to work , so Benchmark Cd dont work

Would be nice to have a custom Benchmark Cd wich can be configured to use http proxy.
I tried to modify source by using export http_proxy ... to get wget using proxy etc , but cant compil it , i have problems with glibc compilation on debian etch:

awk -f scripts/gen-sorted.awk \
               -v subdirs='csu assert ctype locale intl catgets math setjmp signal stdlib stdio-common libio malloc string wcsmbs time dirent grp pwd posix io termios resource misc socket sysvipc gmon gnulib iconv iconvdata wctype manual shadow po argp crypt nss localedata timezone rt conform debug  dlfcn elf' \
               -v srcpfx='' \
               nptl/sysdeps/pthread/Subdirs sysdeps/unix/inet/Subdirs sysdeps/unix/Subdirs assert/Depend intl/Depend catgets/Depend stdlib/Depend stdio-common/Depend libio/Depend malloc/Depend string/Depend wcsmbs/Depend time/Depend posix/Depend iconvdata/Depend nss/Depend localedata/Depend rt/Depend debug/Depend > /root/benchmark/glibc/sysd-sorted-tmp
awk: extra ] at source line 19 source file scripts/gen-sorted.awk
 context is
          >>>  sub(/\/[^/] <<< +$/, "", subdir);
awk: nonterminated character class \/[^
 source line number 20 source file scripts/gen-sorted.awk
make[1]: quittant le répertoire « /root/benchmark/glibc-2.4 »
make[1]: entrant dans le répertoire « /root/benchmark/glibc-2.4 »
awk -f scripts/gen-sorted.awk \
               -v subdirs='csu assert ctype locale intl catgets math setjmp signal stdlib stdio-common libio malloc string wcsmbs time dirent grp pwd posix io termios resource misc socket sysvipc gmon gnulib iconv iconvdata wctype manual shadow po argp crypt nss localedata timezone rt conform debug  dlfcn elf' \
               -v srcpfx='' \
               nptl/sysdeps/pthread/Subdirs sysdeps/unix/inet/Subdirs sysdeps/unix/Subdirs assert/Depend intl/Depend catgets/Depend stdlib/Depend stdio-common/Depend libio/Depend malloc/Depend string/Depend wcsmbs/Depend time/Depend posix/Depend iconvdata/Depend nss/Depend localedata/Depend rt/Depend debug/Depend > /root/benchmark/glibc/sysd-sorted-tmp
awk: extra ] at source line 19 source file scripts/gen-sorted.awk
 context is
          >>>  sub(/\/[^/] <<< +$/, "", subdir);
awk: nonterminated character class \/[^
 source line number 20 source file scripts/gen-sorted.awk
make[1]: *** Pas de règle pour fabriquer la cible « /root/benchmark/glibc/Versions.all », nécessaire pour « /root/benchmark/glibc/abi-versions.h ». Arrêt.
make[1]: quittant le répertoire « /root/benchmark/glibc-2.4 »
make: *** [all] Erreur 2
cp: ne peut évaluer `libc.so.6': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `nptl/libpthread.so.0': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `math/libm.so.6': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `elf/ld-linux.so.2': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `nss/libnss_files.so.2': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `resolv/libnss_dns.so.2': Aucun fichier ou répertoire de ce type
cp: ne peut évaluer `resolv/libresolv.so.2': Aucun fichier ou répertoire de ce type

drfish · Thu Dec 28, 2006 7:26 am

Welcome Francophoner! I can't help you with your question but I wanted to be the first to say hi!

Thu Dec 28, 2006 7:57 am

Yes, Francophoner, hello and be welcome.

koinkoin · Thu Dec 28, 2006 10:33 am

Thank you Dposcorp, drfish

I hope my english isn't too bad and notfred will be able to help me

TR Forums

Folding Benchmark CD

Who is online