Personal computing discussed

Moderators: mac_h8r1, Nemesis

 
mikewinddale
Gerbil First Class
Topic Author
Posts: 164
Joined: Sat Jan 07, 2017 2:22 am

Benchmarking 48 cores, 96 threads

Tue Mar 26, 2019 10:40 pm

So I've thought about buying a ThreadRipper, but in the meantime, I thought I'd try out the Google Compute Engine, which lets you operate a cloud VM from Google's servers. (Recently, Google announced that the record for calculating digits of pi had been broken using the Google Compute Engine.)

So I created an Ubuntu VM with 48 Xeon cores and 96 threads, got a working GUI with VCN and RDP, and benchmarked it in Geekbench 4. For all those curious, here's how a Google 48 core Xeon VM compares to a Ryzen 7 2700X.

And here's a screenshot of a Google + Ubuntu + Xfce + tightvcn + xrdp VM, running Chrome. I've since switched to LXDE, since it seems to be smoother and I also like the interface more.

There's no 3D acceleration, so resolution and such are a little hit and miss. I thought I was stuck at 1024x768 because tightvcnserver -geometry 1600x900 -32 kept failing. But later, I discovered that it just couldn't handle 32 bit color. tightvcnserver -geometry 1920x768 -depth 24 is working just fine. So now I'm running LXDE at 1920x1080 with 24 bit color.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2815
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 9:30 am

It looks pretty gimped from a memory-bandwidth aspect. It clearly hampered scaling up to that many cores/threads as well.
Desktop: Z170A | 6700K @ 4.4 | 32 GB | Radeon VII | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 1 TB NVME + 2 TB SATA + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
MileageMayVary
Gerbil XP
Posts: 327
Joined: Thu Dec 10, 2015 9:18 am
Location: Baltimore

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 2:39 pm

The Ryzen doesn't loose nearly as badly as the simple core count ratio would make you think.

8 cores @ 3.7 GHz vs 48 cores @ 2.0 GHz and the 48 is 2.5 times as fast overall with 3.25 times the cycles?

Curious where scaling out plateaus (of course this would be different for every application).
Main rig: Ryzen 1600, R9 290@1100MHz, 16GB@2933MHz, 1080-1440-1080 Ultrasharps.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2815
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 4:01 pm

MileageMayVary wrote:
The Ryzen doesn't loose nearly as badly as the simple core count ratio would make you think.

8 cores @ 3.7 GHz vs 48 cores @ 2.0 GHz and the 48 is 2.5 times as fast overall with 3.25 times the cycles?

Curious where scaling out plateaus (of course this would be different for every application).

...but the memory bandwidth is almost identical between them. It shouldn't be, but clearly there's something funky with the test or the Google system.
Desktop: Z170A | 6700K @ 4.4 | 32 GB | Radeon VII | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 1 TB NVME + 2 TB SATA + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
Redocbew
Gold subscriber
Gerbil Jedi
Posts: 1878
Joined: Sat Mar 15, 2014 11:44 am

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 4:21 pm

I haven't done it, but I wouldn't expect testing of a virtual server to give exactly the same results as testing a physical machine. There's got to be some artifacts caused by the partitioning of physical resources, and the overhead of the virtual machine its self.
Do not meddle in the affairs of archers, for they are subtle and you won't hear them coming.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2815
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 4:45 pm

Redocbew wrote:
I haven't done it, but I wouldn't expect testing of a virtual server to give exactly the same results as testing a physical machine. There's got to be some artifacts caused by the partitioning of physical resources, and the overhead of the virtual machine its self.

It can be close if configured to be that way. Hypervisor overhead isn't as terrible as it used to be.
Desktop: Z170A | 6700K @ 4.4 | 32 GB | Radeon VII | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 1 TB NVME + 2 TB SATA + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
mikewinddale
Gerbil First Class
Topic Author
Posts: 164
Joined: Sat Jan 07, 2017 2:22 am

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 6:12 pm

I just ran another benchmark, this one from 7zip.

I am in my office, so my computer here is a 4-core Core i7-7700 @ 3.60 GHz. with Windows 10.

Instructions:
(1) Windows: https://www.7-cpu.com/utils.html --> download 7bench1400.7z --> extract ---> execute "7zr64.exe b" in cmd.exe
(2) Linux: sudo apt-get install p7zip-full p7zip-rar --> execute "7z b" in the terminal
(3) Details on command line parameters at https://sevenzip.osdn.jp/chm/cmdline/commands/bench.htm.

For both, I specified a dictionary size of up to 27, meaning I added "-md27" as a parameter to both. Thus, ""7zr64.exe b -md27" and "7z b -md27

Results are:

Windows, 4 core / 8 thread Core i7-7700 @ 3.60 GHz
7-Zip (r) [64] 9.22 beta : Igor Pavlov : Public domain : 2011-04-18

RAM size:  16247 MB,  # CPU hardware threads:   8
RAM usage:  6181 MB,  # Benchmark threads:      8

Dict        Compressing          |        Decompressing
      Speed Usage    R/U Rating  |    Speed Usage    R/U Rating
       KB/s     %   MIPS   MIPS  |     KB/s     %   MIPS   MIPS

22:   23669   582   3954  23026  |   266240   788   3049  24011
23:   23741   608   3980  24189  |   264192   796   3035  24170
24:   23133   630   3947  24873  |   260476   793   3047  24161
25:   22176   677   3742  25320  |   253447   785   3038  23833
26:   21392   704   3704  26068  |   252550   796   3023  24062
27:   17734   709   3265  23153  |   242395   771   3034  23388
----------------------------------------------------------------
Avr:          652   3765  24438               788   3037  23938
Tot:          720   3401  24188


Linux, Google Compute Engine, 48 core / 96 thread Xeon @ 2.00 GHz
7-Zip [64] 16.02 : Copyright (c) 1999-2016 Igor Pavlov : 2016-05-21
p7zip Version 16.02 (locale=C.UTF-8,Utf16=on,HugeFiles=on,64 bits,96 CPUs Intel(R) Xeon(R) CPU @ 2.00GHz (50653),AS
M,AES-NI)
           Intel(R) Xeon(R) CPU @ 2.00GHz (50653)
CPU Freq:  2653  2660  2660  2658  2659  2658  2659  2657  2658
RAM size:   86984 MB,  # CPU hardware threads:  96
RAM usage:  77244 MB,  # Benchmark threads:     96
                       Compressing  |                  Decompressing
Dict     Speed Usage    R/U Rating  |      Speed Usage    R/U Rating
         KiB/s     %   MIPS   MIPS  |      KiB/s     %   MIPS   MIPS
22:     103181  5928   1693 100375  |    1926004  8959   1834 164237
23:     114241  6557   1775 116398  |    1777923  8548   1801 153839
24:     115402  6559   1892 124080  |    1864691  8858   1849 163656
25:     108231  6656   1858 123574  |    1762957  8539   1837 156877
26:      93732  6396   1787 114218  |    1427710  7623   1689 128778
27:      87494  6279   1820 114233  |    1290569  7262   1625 117948
----------------------------------  | ------------------------------
Avr:            6396   1804 115480  |             8298   1773 147556
Tot:            7347   1788 131518


So let's see if those numbers are in the right ballpark.

R/U MIPS is a rating normalized for one CPU thread, at 100% utilization.
Rating MIPS is the rating for multi-threading.

Compression: let's look at Avr.
Core gets 3,765 / 24,438. Xeon gets 1,804 / 115,480.
Thus, Core is 2.08 times faster in single-thread. Considering that we are comparing a 3.60 GHz Core to a 2.00 GHz Xeon, being twice as fast sounds about right.
The Xeon is 4.73 times faster in multi-threading. It has 12 times the cores but half the clock speed, so let's say it should have been about 6 times faster. But it's only 4.73 times faster.

Decompression:
Core gets 3,037 / 23,938. Xeon gets 1,773 / 147,556.
Thus, Core is 1.71 times faster in single-thread. Since the Core has about twice the clock speed, this seems about right. In fact, 3.6 GHz / 2.0 GHz is 1.8, so being 1.71 times faster is almost exactly what we'd expect.
The Xeon is 6.16 times faster in multi-threading. This almost perfectly matches up with my expectation that having 12 times the cores at half the speed should have been 6 times faster.

So it seems like decompression is scaling almost perfectly with clock speed and cores. It's compression where the Xeon under-performs.

According to the 7-zip author,
Compression speed and rating strongly depend on memory (RAM) latency.

Decompression speed and rating strongly depend on the integer performance of the CPU. For example, the Intel Pentium 4 has big branch misprediction penalty (which is an effect of its long pipeline) and pretty slow multiply and shift operations. So, the Pentium 4 has pretty low decompressing ratings.


So yes, it seems like the Google Compute Engine's memory is possibly under-performing.
 
mikewinddale
Gerbil First Class
Topic Author
Posts: 164
Joined: Sat Jan 07, 2017 2:22 am

Re: Benchmarking 48 cores, 96 threads

Wed Mar 27, 2019 6:17 pm

Also, if the Google Compute + 48 core Xeon is only 6 times faster than my 4 core Core i7, then it's only equivalent to a 24 core Core i7. So I might just have to buy my own 32 core ThreadRipper someday. (Or whatever the equivalent of a 32 core ThreadRipper will be in the future.)
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2815
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Benchmarking 48 cores, 96 threads

Thu Mar 28, 2019 10:00 am

mikewinddale wrote:
Also, if the Google Compute + 48 core Xeon is only 6 times faster than my 4 core Core i7, then it's only equivalent to a 24 core Core i7. So I might just have to buy my own 32 core ThreadRipper someday. (Or whatever the equivalent of a 32 core ThreadRipper will be in the future.)

Lots of cores at low clock rates aren't always a win for many reasons. There's a good reason those 5 GHz SKUs exist in Intel's Xeon lineup.
Desktop: Z170A | 6700K @ 4.4 | 32 GB | Radeon VII | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 1 TB NVME + 2 TB SATA + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
mikewinddale
Gerbil First Class
Topic Author
Posts: 164
Joined: Sat Jan 07, 2017 2:22 am

Re: Benchmarking 48 cores, 96 threads

Thu Mar 28, 2019 8:57 pm

Okay, now I am astounded. I just ran what's called a "synthetic control" on both the Google Compute Xeon 48 core / 96 thread and my 8 core Ryzen 7 2700X.

The synthetic control method is an actual method I use in real-life, so it is 100% realistic. In fact, I was interested in the Google Compute Engine precisely in order to speed up the synthetic control specifically.

In order to obtain p-values for statistical hypothesis testing, you re-run the synthetic control method on every control group, so if you have, say, 1 treatment group and 10 control groups, you perform the same method 11 times. So it's extremely parallelizable. That's why I wanted to use a 48 core machine.

Anyway, here are some benchmarks. Because my dataset has 1 treatment and 37 controls, the maximum number of threads is 38.
48 core Xeon, using 38 threads: 619.11 seconds
8 core Ryzen 7 2700X, using 8 threads: 608.17 seconds
8 core Ryzen 7 2700X, using 16 threads: 499.29 seconds
4 core Core i7-7700, using 4 threads: 1011.97 seconds
4 core Core i7-7700, using 8 threads: 763.14 seconds

So in the end, the 8 core Ryzen performed in 499.29 seconds what took the 48 core Xeon 619.11 seconds. I don't know how to explain that. Sure, the Ryzen has double the clock speed, but it has 1/6 the cores. I would have expected the Xeon to be 3 times faster.

So now I don't have any reason to use the Google Compute Engine at all. Not only is the 48 core Xeon not sufficiently fast enough to make up for the cost (in time and effort) of remote computing, but it's actually slower! And I can only assume that a 16 or 32 core ThreadRipper would perform like a scaled Ryzen. (Again, the synthetic control operation scales almost linearly with additional threads, because you're literally performing the same operation multiple times, just with different treatment and control groups.)

Here is my code, using Stata 15/SE.
* Report execution time
set rmsg on, permanently

* Install synth
ssc install synth, replace all

* Install synth_runner
cap ado uninstall synth_runner //in-case already installed
net install synth_runner, from(https://raw.github.com/bquistorff/synth_runner/master/) replace

* Install parallel
net install parallel, from(https://raw.github.com/gvegayon/parallel/stable/) replace
mata mata mlib index

* Use the synth_smoking example dataset that comes with synth_runner
sysuse synth_smoking
tsset state year

/*
To test synth_runner, use the example smoking synthetic control regression used in the synth_runner
help file, but add the "nested allopt" parameters to increase the precision and execution time.
And add the "parallel" parameter to utilize multithreading.
And remove the "gen_vars" parameter because there's no need to generate output.

Thus, the synth_runner help file gives this as an example:
synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989)

I use this instead:
synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989) parallel nested allopt
*/

* maximum of 96 threads, with nested allopt
parallel setclusters 96
synth_runner cigsale beer(1984(1)1988) lnincome(1972(1)1988) retprice age15to24 cigsale(1988) cigsale(1980) cigsale(1975), trunit(3) trperiod(1989)  parallel nested allopt

* 8 threads, with nested allopt
parallel setclusters 8
* the same code as before

* 16 threads, with nested allopt
parallel setclusters 16
* the same code as before

parallel clean
Last edited by mikewinddale on Fri Mar 29, 2019 12:07 pm, edited 1 time in total.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2815
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Benchmarking 48 cores, 96 threads

Thu Mar 28, 2019 9:37 pm

I still think there's something funky going on with your memory allocation on Google Compute Engine. That Xeon machine should have ~4-6 times as much memory bandwidth as your Ryzen desktop.
Desktop: Z170A | 6700K @ 4.4 | 32 GB | Radeon VII | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 1 TB NVME + 2 TB SATA + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS

Who is online

Users browsing this forum: No registered users and 3 guests