Personal computing discussed

Moderators: Flying Fox, morphine

 
jmc2
Gerbil Team Leader
Topic Author
Posts: 227
Joined: Mon Aug 22, 2011 8:30 am

16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 2:51 pm

Did a bunch of speed/voltage encode testing and found that
with a 10minute (dvd) mpg file (to X264.mp4@VerySlow preset)
(TMPGEnc Video mastering 6)

16 cores were consistantly 2 seconds faster then the full 16/32 threads setting.
Encode time 3minutes 20 seconds. Gained 1% in fps.

Got 90 hours to encode. should take 30 hours to do it and every fps helps.
(only 10-11 cores used, no point going for that 32 core threadripper) :(
So wish I had the new enhanced Boost 2(?) support... well, got what I got.
86% faster then my 3930 6/12HT core from 2012...It is used at 90+%
So again 10-11 threads used. Not so far apart.

Crashes at 4Ghz but got 8 hours of encoding @ 3.95Ghz done yesterday.
Air cooled Noctua. Temps Mid 65 C +/- with an ambient 86 F(30 C) (florida).

Always hope for that "4Ghz" but can never get there (with good temps).
There is a high speed fan on the cooler and the case to keep temps down.
Good thing is that you KNOW when the computer has finished it's job.

jmc
Last edited by jmc2 on Wed Jun 20, 2018 3:08 pm, edited 1 time in total.
 
chuckula
Gold subscriber
Gerbil Jedi
Posts: 1887
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 2:58 pm

(only 10-11 cores used, no point going for that 32 core threadripper) :(


That's when you start thinking about running two encoding jobs in parallel (if that's possible to do) and then figure out if it's actually a benefit to overall performance by using more cores or if it's actually harming performance by oversubscribing the system. It might also be a fun way to test stability at 3.95GHz since you weren't pegging all the cores before.
4770K @ 4.7 GHz; 32GB DDR3-2133; GTX-1080 sold and back to hipster IGP!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
techguy
Gerbil XP
Posts: 353
Joined: Tue Aug 10, 2010 9:12 am

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 3:11 pm

There are 2 levers that can be used to extract parallelism from encoding workloads (neither of which are featured in your test case):
1) high bitrates (think Blu-ray quality or higher)
2) high resolution (think 1080p or even 4k)

I regularly peg my 10c/20t 7900x @ 100% with Handbrake transcoding from m2ts to mkv by maintaining bitrate and source resolution using a 2-pass process and vbr.
Last edited by techguy on Wed Jun 20, 2018 3:15 pm, edited 1 time in total.
 
Redocbew
Gold subscriber
Gerbil Jedi
Posts: 1759
Joined: Sat Mar 15, 2014 11:44 am

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 3:13 pm

You may find that it's not so easy to find an "optimal" number of threads for encoding(and probably for many other tasks as well). Optimal in terms of "this is the best this application can do", and "this is the best my leet CPU can do" can often be different things.
Do not meddle in the affairs of archers, for they are subtle and you won't hear them coming.
 
Chrispy_
Maximum Gerbil
Posts: 4481
Joined: Fri Apr 09, 2004 3:49 pm
Location: Europe, most frequently London.

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 6:42 pm

Just chiming in to say that your software is definitely to blame here.

My experience is of Vray (raytracing) on a renderfarm and we use a management tool called Deadline to chop up single jobs feed portions of it to multiple machines. Needless to say that Intel Hyperthreading adds around 40% more performance to the mix in our scenario and that Ryzen 7 1700 nodes (the most power-efficient by a country mile, and that's what matters when you have dozens of them running 24/7) definitely outperform hyperthreaded Broadwell-E so all threads MUST be being fully utilised.

If you're talking about 90 hours of encoding, and underutilised CPU, split the job into two clips and run them concurrently, then spend five minutes manually rejoining them once complete. AMD's SMT isn't quite at Intel's level, but you should see a 25-30% improvement, which is potentially 25+ hours shaved off the job.
Congratulations, you've noticed that this year's signature is based on outdated internet memes; CLICK HERE NOW to experience this unforgettable phenomenon. This sentence is just filler and as irrelevant as my signature.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2581
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Wed Jun 20, 2018 7:35 pm

The benefits of HT/SMT are totally dependent upon software scaling and the actual workload you're running. If you don't have pipeline bubbles and you're fully subscribing the various FPU/ALU/memory resources...you're not going to gain anything.
Desktop: Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
dragontamer5788
Gerbil First Class
Posts: 177
Joined: Mon May 06, 2013 8:39 am

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 10:46 am

Waco wrote:
The benefits of HT/SMT are totally dependent upon software scaling and the actual workload you're running. If you don't have pipeline bubbles and you're fully subscribing the various FPU/ALU/memory resources...you're not going to gain anything.


Heck, its even harmful! Because if you've got 100% full pipeline and everything, then cutting the effective L1 cache in half will harm a lot of workloads. 32-threads means your L1 and L2 cache has to hold less-data per thread and work with main memory more.

HyperThreading / SMT making things slower is a known issue. It generally makes "inefficient" code faster (CPU-bound), but "highly efficient" (cache bound) code slower. Unfortunately, there's no easy way to tell if your workload is CPU-bound or cache-bound. You can use like perftools on Linux, but I'm not aware of any easy access to low-level performance counters on Windows.

EDIT: Oh, AMD? So you can use AMD's CodeXL to profile programs. So that's a benefit, but you gotta learn low-level stuffs to understand all the knobs and buttons. Intel's VTune costs a lot of money (nearly $1000ish) but is considered one of the best performance tools in the market.
Last edited by dragontamer5788 on Fri Jun 22, 2018 10:50 am, edited 1 time in total.
 
Waco
Gold subscriber
Minister of Gerbil Affairs
Posts: 2581
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 10:48 am

If you're doing well enough that HT hurts you, you're also likely smart enough to only engage the first thread on each core. :)
Desktop: Z170A Gaming Pro Carbon | 6700K @ 4.4 | 16 GB | GTX Titan Xm | XSPC RX360 | Heatkiller R3 | Samsung 4K 40" | 2048 + 240 + LSI 9207-8i (128x8) SSD
NAS: 1950X | Designare EX | 32 GB ECC | 7x8 TB RAIDZ2 | 8x2 TB RAID10 | FreeNAS | ZFS | LSI SAS
 
chuckula
Gold subscriber
Gerbil Jedi
Posts: 1887
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 11:25 am

All the talk about HT on or off is interesting but since the original poster doesn't seem to be the software developer I don't think he can rewrite the encoding application to fix whatever bottlenecks it has.
4770K @ 4.7 GHz; 32GB DDR3-2133; GTX-1080 sold and back to hipster IGP!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Bauxite
Gerbil Elite
Posts: 770
Joined: Sat Jan 28, 2006 12:10 pm
Location: electrolytic redox smelting plant

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 11:39 am

Nothing new about this on x86 since, what, 2001? not all code likes HT/SMT/etc.

There are also various security implications to various implementations, some well documented...some not yet ;)
2018: at 120 Zen cores and counting, so pretty much done with intel on the desktop.
E5 2696v4 22c44t 2.2~3.7Ghz - The last great gleam of the pre-nerf HEDT era.
E5 1680v2 8c16t 4.5Ghz - "Yes Virginia, there were unlocked xeons" /weep for them.
 
Aranarth
Graphmaster Gerbil
Posts: 1153
Joined: Tue Jan 17, 2006 6:56 am
Location: Big Rapids, Mich. (Est Time Zone)
Contact:

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 2:01 pm

The biggest question is does his software (TMPGEnc Video mastering 6) like that many threads and hyperthreading or not?

That's a question for the developer and their forums.
Main machine: Core I7 -2600K @ 4.0Ghz / 16 gig ram / Radeon RX 580 8gb / 500gb toshiba ssd / 5tb hd
Old machine: Core 2 quad Q6600 @ 3ghz / 8 gig ram / Radeon 7870 / 240 gb PNY ssd / 1tb HD
 
dragontamer5788
Gerbil First Class
Posts: 177
Joined: Mon May 06, 2013 8:39 am

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Fri Jun 22, 2018 2:11 pm

jmc2 wrote:
Got 90 hours to encode. should take 30 hours to do it and every fps helps.
(only 10-11 cores used, no point going for that 32 core threadripper) :(


Do 2 or 3 at a time. Sure, code only scales to 10 cores. But just run the code 3 times at the same time to get good scaling across 32 cores.

Just split your files into 3 groups, and then run those three groups simultaneously, with 3 instances of your encoder.

Aranarth wrote:
The biggest question is does his software (TMPGEnc Video mastering 6) like that many threads and hyperthreading or not?

That's a question for the developer and their forums.


Nah, you can figure this stuff out and strategize without even going to developer tools. The original poster has already done a good job figuring out if hyperthreading / SMT is good or bad (bad in this case. Yeah, it happens. Good job figuring out though, that always takes a bit of testing).

So the only thing left is to figure out the optimal number of threads and the optimal number of simultaneous processes for his 30+ hour job.
 
jmc2
Gerbil Team Leader
Topic Author
Posts: 227
Joined: Mon Aug 22, 2011 8:30 am

Re: 16/0 True cores 1% faster then 16/32threads. Threadripper

Sat Jun 23, 2018 9:39 am

chuckula wrote:
That's when you start thinking about running two encoding jobs in parallel (if that's possible to do) and then figure out if it's actually a benefit to overall performance by using more cores or if it's actually harming performance by oversubscribing the system. It might also be a fun way to test stability at 3.95GHz since you weren't pegging all the cores before.


Thanks for mentioning that.(to all)
There (was/is?) a limit of only one instance of AC3 encoding allowed(Dolby).
But I have heard that AC3 is now an "open" standard.
So maybe I could now run two encodings at once.
Will have to try that and see if the AC3 encoding still works.

If I find that I can run two encodes at once then I'll be set.
Probably be turning on all the Threads then.

May to have to remove that silicone pad finally and put
in the top end heat paste...Grizzly Kryonaut rated=12 vs pad=6.
60s C is ok, 70s C is not.

-----------------
@techguy
1)high bitrates (think Blu-ray quality or higher)
-----So right there! Core use skyrockets.
Because of the heavy load, I avoid HD material
unless that is the sole source.
So thankful that I find dvd quality "good enough".

----------Here is some interesting info I got------------(uses FFmpeg I believe)
As an FYI, the mpeg2 decoder will use up to 8 threads but on SD, and even some HD material it's probably effectively less. However, the mpeg2 decoder will decode SD on your system at 250-400 FPS.

Looking at the X.264 encoder threading logic, it will use up to 3 threads for SD material in the first pass. It's basically height in pixels / 128. The comments in the code says that more threads reduces quality in pass 1.

Pass 2 threading uses max 32 threads for encoding on HD, and about 15 threads for SD. At one point we experimented with letting you set the thread counts, but it turned that x264 does an optimal job automatically.
------------------------------------------------------------------------------------------------
That explains my "Pass 2"(X264) being 5 times faster then "Pass 1".
(With VideoReDo) So wish I could use it.
But deinterlacing does not work with X264 and 2:3 pulldown video.
(H264 works but so slow)

Thanks again to all for the ideas!
jmc2

Who is online

Users browsing this forum: No registered users and 2 guests