TR Forums

Wed Jun 25, 2003 9:55 pm

If you can (and want to, incidentally), run it on a SDRAM system of similar CPU spec.... I reckon it'd be a fair bit slower, but that's just my instinct talking...

OK... nothing like an invitation to truly "geek out"... My "similar" SDR system was actually a 1700+ overclocked to 1800+ speed; that shouldn't make a difference. I tried it with both registered and unregistered SDR DIMMs, and also ran the test on a few other systems as well just for grins.

I used a modified version of the program without the output loop, so only the time of the prime calculation was being tested, not the speed of the printf() function.

The program was run with an upper bound of 100000000.

Results, from slowest to fastest:

K6-III+ 500MHz, PC-100 SDR - 33.7 secs
Athlon TBird 800MHz, PC-100 SDR - 15.1 secs
Athlon XP 1.53GHz (1800+ rating), PC-133 registered SDR - 13.7 secs
Athlon XP 1.53GHz (1800+ rating), PC-133 SDR - 12.3 secs
Athlon TBird 1.1GHz, PC-2100 registered DDR - 11.1 secs
Athlon XP 1.53GHz (1800+ rating), PC-2100 DDR - 8.4 secs

So yeah, I'd say it looks like it is generally limited by memory performance, except on the 500MHz K6-III+ (which just generally sucks at this test).

This was a mix of Win2K and Linux systems... but I used the g++ compiler on the Windows side as well, to eliminate the compiler as a variable.

liquidsquid · Wed Jun 25, 2003 9:57 pm

IntelMole, I really don't know what you are asking. Actually I have never programmed VB within any office products, so I don't know it's limitations there with array declaration. Can you give me some examples of what you are trying to do?

I simply used a list box because it made for a handy output. If you wanted to save to a file it would be essentially like this:

 dim F as long 'At the top of the routine

  F = freeFile
  open "results.txt" for output as #F

  for i = 2 to upperBound
   if a(I) then print #f, a(I)
  next i

  close #f

Keep in mind I wrote this without cheking it.

Boolean types in VB are still 32-bit, which sucks. Typically I will use them for clarity. There is however a Byte type, which boolean operations work fine on. That will cut requirements in one quarter on the one change alone. My goal with code was to simply translate, not improve.

Yes, writing results as found to be true in the loop would save some speed. Having a bit array to hold True/False data may be less memory hoggish, but much slower as determining a bit state vs. a byte state is slower due to masking then testing rather than just testing. Then agan that is much fewer memory fetches as a sequence of 32 bits could be saved right in a register if the compiler is good enough to recognize this and keep it there.

The limit of items in listBoxes is something like 2^16 values in Win98, but under NT it is 2^32 values, I think. Piles of results anyway.

Anyhow the limit of the VB routine is for testing primes up to the 32-bit value of long positivly, or 2^31. Anything larger and VB wont handle it without getting fancy. By this I mean using currency types, which are 64-bit integers, but hidden to look like fixed-point floats.

Maybe if I can get off of the toilet for a few minutes I will write a few versions and benchmark them based on the original.

-LS

IntelMole · Thu Jun 26, 2003 4:50 am

Sos mate... here's the code I'm using at the moment, it should work since I've just taken the old code from above, and used the array instead...

Sub CalcPrime2()

'Prime Calculator using array of primes

PrimeTest = Range("A2")
PrimeDiv = 0
flag = 0

'Get list of primes from Cells
While Cells(ArrayNumber, 3) > 0

PrimeArray(ArrayNumber) = Cells(ArrayNumber, 3)
ArrayNumber = ArrayNumber + 1

Wend

'Increments the number to be tested
'Gets number of factors of number-> flag

Do
PrimeTest = PrimeTest + 1

PrimeDivision:
flag = 0
ArrayRef = 1
While ArrayRef <= (PrimeTest) / 2
DivValue = PrimeTest / PrimeArray(ArrayRef)
If DivValue = 1 Then
flag = flag + 1
    ElseIf DivValue = Int(DivValue) Then
            flag = flag + 1
                ElseIf DivValue = PrimeTest Then
                        flag = flag + 1
    End If
ArrayRef = ArrayRef + 1
Wend

If flag = 1 Then GoTo PrimeConfirm
If flag <> 1 Then PrimeTest = PrimeTest + 1
GoTo PrimeDivision

'Puts the confirmed prime in the Range, resets flag
PrimeConfirm:
Range("A2") = PrimeTest
Cells(ArrayRef, 3) = PrimeTest
flag = 0
Loop Until 1 > 1 + 1 'loops indef.
End Sub

I just get errors like "400"

, or "Expected array" (with highlight of the array reference), so I'm pretty sure that it's to do with the array syntax...

That 400 message... I haven't a clue :lol:

,
IntelMole

muyuubyou · Thu Jun 26, 2003 5:47 am

I wrote some nice remarks on the subject yesterday, but I spilled my tea over it and I couldn't send it

When I get back home I'll send you some stuff that will make your proggie overkill

IntelMole · Thu Jun 26, 2003 7:55 am

Tsk tsk you should know better muyuubyou :-P

<said while eating food himself>

lol, cheers matey,
IntelMole

muyuubyou · Thu Jun 26, 2003 2:19 pm

This is a deep topic if you really want to make it near-perfect.

As justBrewIt said, you only have to check up to sqrt(n) to get factors. That's a pretty basic improvement.

There are non-prime numbers not containing x in their factor list, where x is any number. Just take any prime number, like 7. 7*7 = 49. 49 is not prime and it's only factor is 7. 13*13*13 has only 1 prime factor: 13 , and so on.

If you want to make it real high-level. Include Miller (rather hard) or Fermat primality test (aka Fermat's little theorem): if p is prime and a is coprime to p, then a^(p-1) - 1 is divisible by p. Thus, a^(p-1) mod p = 1

PGP uses this, using 2,3,5,7 . Of course those are coprimes to anything since they're primes.

The check goes like this:
2^(p-1) mod p = 1 ?
3^(p-1) mod p = 1 ?
5^(p-1) mod p = 1 ?
7^(p-1) mod p = 1 ?
--------------------------
If it fails for any of the above, it's not prime. If it doesn't it's probably prime and you have to check it. Most of the numbers are not prime and get filtered by this quick check, so it will improve your algorithm noticeably.

Try p=7
2^6 mod 7 = 1 ? y
3^6 mod 7 = 1 ? y
and so on (if all the filters are passed, you have to try another test, or just factoring)

Try p=15
2^14 mod 15 = 1 ? n
not prime

Miller test is harder to implement.

liquidsquid · Thu Jun 26, 2003 2:49 pm

The only problem is the size of the numbers you will be dealing with in this method. 7^n will get very large very quickly, so the largest 32-bit value you can check (assuming signed math) is 7^11, or testing up to a value of 12. Another method must be found to determine the outcome of 7^12 mod 13 without trying to hold the value of 7^12 before finding the remainder of the division. Going to double-precision will get you further, but not much as the value increases exponentially.

The non-efficient method may be slower, but it DOES test all the way up to the maximum that a long can hold.

Edit... I am not good enough at math to make routines that will find values well outside of the range of longs using integer math for speed.

-LS

Thu Jun 26, 2003 3:24 pm

FWIW, many C++ compilers support 64-bit integers (__int64 in Visual C++, and long long in g++).

muyuubyou · Thu Jun 26, 2003 3:29 pm

liquidsquid wrote:
The only problem is the size of the numbers you will be dealing with in this method. 7^n will get very large very quickly, so the largest 32-bit value you can check (assuming signed math) is 7^11, or testing up to a value of 12. Another method must be found to determine the outcome of 7^12 mod 13 without trying to hold the value of 7^12 before finding the remainder of the division. Going to double-precision will get you further, but not much as the value increases exponentially.

The non-efficient method may be slower, but it DOES test all the way up to the maximum that a long can hold.

Edit... I am not good enough at math to make routines that will find values well outside of the range of longs using integer math for speed.

-LS

You can overcome that problem with some modulo-math tweaks, and an arbitrary precision library. Powers are much faster to compute with arbitrary precission than factors. Too bad I can't find my original code

muyuubyou · Thu Jun 26, 2003 3:51 pm

I found this: http://www.perfsci.com/free/giantint/

There you have really interesting stuff.

If you want to get theoretical, check here:
http://www.cse.iitk.ac.in/news/primality_v3.pdf

There are recent, and very important, discoveries in the last one.

moog · Thu Jun 26, 2003 4:34 pm

Edit: Yes, eliminating the even flags would cut the array size in half. It only eliminates extra work on the very first sweep (the one to mark all of the even numbers) though.

2x bandwidth. In addition to eliminating the even #'s sweep, we can also skip even numbers in the loops (i++ --> i+=2, the j loop would run 1/2 as many times) would also eliminate an if (!TestBit(a,n)) for the even numbers and possible branch mispredictions which are bad on deeply pipelined CPUs (like this P4).

Edit: Actually the j loop would run 1/4 as many times.

Thu Jun 26, 2003 5:00 pm

2x bandwidth. In addition to eliminating the even #'s sweep, we can also skip even numbers in the loops (i++ --> i+=2, the j loop would run 1/2 as many times) would also eliminate an if (!TestBit(a,n)) for the even numbers and possible branch mispredictions which are bad on deeply pipelined CPUs (like this P4).

Edit: Actually the j loop would run 1/4 as many times.

Not true. The j loop runs only when we find a new prime, to mark all numbers that have the new prime as a factor. Since all even numbers are knocked out on the first pass, we never enter the j loop for even numbers greater than 2.

For similar reasons, the memory bandwidth isn't halved (even though the array size is). Once the stride length of the j loop exceeds the size of a cache line, you're going to transfer one cache line to/from memory for each flag you set regardless.

Your comment about branch mispredicts makes sense; since the branches are semi-random, they will be hard to predict and the BPU will likely get it wrong a large percentage of the time. But the mispredict penalty occurs in the i (outer) loop, so it will probably only have a second-order effect on overall execution speed.

moog · Thu Jun 26, 2003 6:27 pm

For similar reasons, the memory bandwidth isn't halved (even though the array size is). Once the stride length of the j loop exceeds the size of a cache line, you're going to transfer one cache line to/from memory for each flag you set regardless.

If the array is halved, we only need to transfer half as much memory per i loop iteration.

Not true. The j loop runs only when we find a new prime, to mark all numbers that have the new prime as a factor. Since all even numbers are knocked out on the first pass, we never enter the j loop for even numbers greater than 2.

True, all even #'s are eliminated. However, we can do a bit better. Right now it's j += i, on every even iteration it's reflagging an even number. Off the top of my head, I'd say you have to make it something like (j = i*2+i; j <= n; j += i*2). Further packing by removing the even number flags would make it slightly different. I misstated in the last post, we can cut 1/2 the j iterations.

But the mispredict penalty occurs in the i (outer) loop, so it will probably only have a second-order effect on overall execution speed.

Yes.

liquidsquid · Fri Jun 27, 2003 11:27 am

just brew it! wrote:
FWIW, many C++ compilers support 64-bit integers (__int64 in Visual C++, and long long in g++).

Ok, so what then is the largest value we could test? 2^63 = 9223372036854775808 (assuming signed math), then find max for 7^max that is within this number, or use log(9223372036854775808) / log(7) = 22.44, or the largest test is 22. Again extended integers will similarly be limited to a point, and will get so large it will be too slow to operate with.

-LS

moog · Wed Jul 02, 2003 1:04 pm

Hi, I had some free time so I did what I suggested earlier: cut out the even # flags, cut out even # j iterations. Here's the modified code:

//  Sieve of Eratosthenes - packed bit flag version

#include <stdio.h>
#include <stdlib.h>
#include <math.h>

// Define a useful type
typedef unsigned char byte;

// Macro SetBit(a, n) sets the nth bit of byte array a
#define SetBit(a, n) ((a)[(n) >> 3] |= (1 << ((n) & 0x7)))

// Macro TestBit(a, n) tests the nth bit of byte array a
#define TestBit(a, n) (((a)[(n) >> 3] & (1 << ((n) & 0x7))) != 0)


int main(int argc, char *argv[]) {
   // Make sure user specified upper bound
   if (argc > 1) {
      // Get upper bound from command line
      int n = atoi(argv[1]);
      if (n > 0) {
         int alen = (n-3) / 16 + 1;
         byte *a = new byte[alen];
         for (int i = 0; i < alen; i++)
            a[i] = 0;
         int sqrtn = (int)sqrt(n);
         int lim = (n-3) / 2;
         for (int i = 3, j = 0; i <= sqrtn; i += 2, j++) {
            if (!TestBit(a, j)) { // i (= j*2 + 3) is prime
               for (int k = i+j; k <= lim; k += i)
                  SetBit(a, k);
            }
         }
         printf("2\n");
         for (int i = 3, j = 0; i <= n; i += 2, j++) {
            if (!TestBit(a, j))
               printf("%d\n", i);
         }
         delete [] a;
      }
   }
   else {
      printf("please specify highest desired prime\n");
   }
   return 0;
}

As far as explaining the code, bit flag 0 represents 3, flag 1 for 5, flag 2 for 7, etc. Since there's no flag for 2, I had to hardcode it to be printed to make the output the same as jbi's original version.

Here's the benchmark:
Config: dual 486, 512MB, linux, g++ 3.2 with the -O3 switch, gprof 2.13.90.0.2, sieve ran with n = 10000000

Here's what gprof gives me:

Flat profile:

Each sample counts as 0.00195312 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 70.52      2.83     2.83        3   942.06   942.06  sieve1(int)
 29.48      4.01     1.18        3   393.88   393.88  sieve2(int)

sieve1 - jbi's packed flag version
sieve2 - further packed version
Timings include mem alloc/dealloc but not printing the #'s. Anyway I always take gprof with a grain of salt, so I'm just going to say that this version, gives a speed boost of 2-3X while giving a memory saving of 2X. Yeah! Anyway, this result is not really a surprise as eliminating the even #'s reduces your processing by 1/2.

Maybe there are ways of making this faster still.
And if I were to enter the maximum possible unsigned value, n = 2^32, it would still take ~134MB of memory in this version... not good. (anyway, there is a way to still use a sieve using a fixed size array)
[/code]

IntelMole · Wed Jul 02, 2003 4:39 pm

Oooooh hardcore tweakers thread

As it goes, I've managed to get through my ANSI C book nearly... and what did I come across today?

Bit-fields and bit-twiddling :lol:

That's a fairly impressive tweak there, cutting down the IO by that much...

And to think that I started this all off because a friend wanted to play with prime numbers... he didn't even want this macro I made... :-D

How's that for appreciation :-D

,
IntelMole

Wed Jul 02, 2003 5:05 pm

That's a fairly impressive tweak there, cutting down the IO by that much...

I wonder how many times faster this latest version is than the original VB macro?

IntelMole · Wed Jul 02, 2003 5:15 pm

lol A fair few orders of magnitude :-D

But in a sense, it's a different program... although I suppose even the original VB macro would get stuck at some number...
IntelMole

moog · Wed Jul 02, 2003 5:28 pm

That's a fairly impressive tweak there, cutting down the IO by that much...

Thanks!

Just imagine how fast it might get if ppl from the comp.lang newsgroups or Carmack did a version!

I wonder how many times faster this latest version is than the original VB macro?

Actually you know, a lot of students here write their simulations in matlab. With loops instead of matrices! (which is worse for those who don't know matlab)
General student rule: what takes two days in badly written C will take a month in matlab.

IntelMole · Wed Jul 02, 2003 5:34 pm

moog wrote:
That's a fairly impressive tweak there, cutting down the IO by that much...

Thanks! Just imagine how fast it might get if ppl from the comp.lang newsgroups or Carmack did a version!

Well, maybe not all that much faster, I certainly can't really think of any tweaks that would improve it that much... even looking at register variables lol

Maybe we should submit this to the x86-64 compiling thread that was on another forum... If you've no idea what I'm talking about, there was a group looking to compile and/or optimise source for x86-64 ... maybe they could think of something.

Hell, I'm sure Carmack could think of something though ... Quantum computing/simulation or something I wonder :-D

...

Maybe he'd make the computer levitate </sarcasm>,
IntelMole

Wed Jul 02, 2003 5:48 pm

But in a sense, it's a different program... although I suppose even the original VB macro would get stuck at some number...

Quite true. The original program was not limited by the need for a flag bit for each potential prime. Better for testing individual numbers for "primeness", but much slower for generating a list of primes starting from 2.

IntelMole · Wed Jul 02, 2003 6:01 pm

I doubt it, but does a "flag" type exist? Or could the type be defined?

If it does, I suppose that would also speed the program a little, by cutting out the need for bit-twiddling...

It seems to me a little weird that after all these years of programming languages, and people using things like:

[i]condition[/i] ? [i]Yes[/i] : [i]No[/i]

or

if ([i]condition[/i])
    [i]yes[/i];
else [i]No[/i]

that we don't have a bit operator like this.

Just an idle thought though...
IntelMole

Wed Jul 02, 2003 6:06 pm

Not sure quite what you're getting at. A boolean flag really is an abstraction of the simplest unit of information inside the computer, namely a single bit. Not sure how you could simplify it much further...

I suppose the compiler could allow you to declare arrays of bits directly.

Or you could implement your own "bit array" object, using C++ classes. But there'd probably be a little more overhead than just doing the raw bit twiddling.

IntelMole · Wed Jul 02, 2003 6:13 pm

Ah... I forgot that one...

That's the One I meant then :-D

Would it not then be easier to create an array of booleans then, rather than bit-twiddle all the way? :-D

IntelMole

Wed Jul 02, 2003 6:24 pm

Damn, this is getting confused. I didn't mean that current C++ compilers allow you to declare arrays of bits; I was just speculating that it would be a useful feature to have.

You could write a C++ wrapper class that would emulate an array of bits. But internally, the class would still need to do the bit twiddling. The upside is that the class would be reusable, for any application that required an array of booleans.

moog · Wed Jul 02, 2003 6:31 pm

Yeah, it would be easier, but with bit-twiddling you're the absolute boss, there's less room for the compiler to play around. You're almost at the assembler lvl.
Mayb jbi knows better, but I think bit-twiddling was the best/easiest thing to do for this case. Personally I just find bit-twiddling and low lvl issues fun.

And yeah you can do bit fields in C, like

struct EmptyIE {
unsigned int start_ptr_1 : 13, stop_ptr_1 : 11;
//unsigned int : 8;
};

EmptyIE tst = { 1, 2 };

13 least significant bits for start_ptr_1 = 1, next 11 bits for stop_ptr_1 = 2. The compiler has the discretion to use as many bytes to accomodate the bit field (I'm guessing here a 32 bit int). The compiler should then generate appropriate code when you use the bit fields. Your own bit crunching will be just as effective anyhow.
I don't know about compiler support for this, but it worked on borland 5.0.

IntelMole · Wed Jul 02, 2003 6:38 pm

Yeah I was considering a structure command for speeding up the program, but I reckon it'd be a bit counter-productive, since the array is sort of two arrays in one, the way you operate it... i.e. operating on the array element number, then changing the element's value... making a structure a bit redundant...

Damn, this is getting confused. I didn't mean that current C++ compilers allow you to declare arrays of bits; I was just speculating that it would be a useful feature to have.

Yeah, that's what I was thinking :-D

AFAIK though you can't specify variable types... AFAIK, about 80% into the C book :-D

IntelMole

moog · Wed Jul 02, 2003 6:39 pm

Would it not then be easier to create an array of booleans then, rather than bit-twiddle all the way?

It would be easier. While you don't need extra bit ops, it doesn't necessarily make it faster. The very first jbi version used an array of boolean. He noted his packed version even with the bit ops was slightly faster than the first.

IntelMole · Wed Jul 02, 2003 6:57 pm

moog wrote:
Would it not then be easier to create an array of booleans then, rather than bit-twiddle all the way?

It would be easier. While you don't need extra bit ops, it doesn't necessarily make it faster. The very first jbi version used an array of boolean. He noted his packed version even with the bit ops was slightly faster than the first.

Okay, hard techie quiz for people: I'm guessing that's because of the more stop-start nature of a boolean array causing higher latency access to the RAM... right?

I can't really see any other reason for it...
IntelMole

Wed Jul 02, 2003 7:41 pm

The native C++ boolean array was slower because C++ compilers don't normally pack boolean arrays as tightly as we did with the bit twiddling. The compiler will allocate a byte for each element, wasting 7 bits out of each byte. So for the sieve algorithm, you basically get clobbered on memory bandwidth.

The bit twiddling approach is about as fast as it gets for this sort of thing. A C++ wrapper class to implement a true bit (boolean) array would be more "elegant", but almost certainly slower than the raw bit twiddling. (It would probably be faster than using the native bool type for stuff like this though, since you'd still win on the memory bandwidth.)

TR Forums

Prime Number Calculator

Who is online