Personal computing discussed

Moderators: renee, SecretSquirrel, notfred

 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 11:28 am

This is a pretty wacky question -- let me just be upfront about that.

On my Ubuntu Threadripper 2990wx system, I'm running a bunch of Monte Carlo simulations using the mcparallel() function in R. The way this works is that it forks off child processes, each of which does its thing, and then returns an answer. Every child process is working on a different (randomly generated) data set. All of the data sets are essentially the same size. So I've got 64 child processes all doing essentially the same thing at the same time, but on different (equally sized) data sets.

The problem I'm encountering is that sometimes (not always) the task scheduler dumps all threads onto the 16 cores that have direct memory access, leaving the 16 without direct access completely idle. This happens even when I set the affinity of the processes to specific cores, including the 16 cores without direct memory access.

So far I have not been able to figure out a way to stop this from happening. The thing I'm thinking about investigating next is whether it matters that all of these child processes share the same parent -- if they had different parents, would the task scheduler stop doing this?

Now, you might think -- maybe the task scheduler is doing the right thing -- maybe those goofy cores that are disconnected from RAM are useless in this scenario. I can't prove it, but I don't think that's right. This seems like a degenerate situation to me -- I think the task scheduler is confused. But I admit I haven't carefully examined that.

Anyway.... any thoughts?
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
Waco
Maximum Gerbil
Posts: 4850
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 5:30 pm

Do you have the option to change the memory interleaving settings in the BIOS/UEFI? If your workload is truly not memory bandwidth limited, it might be worth only having a single NUMA domain exposed to the scheduler.
Victory requires no explanation. Defeat allows none.
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 6:09 pm

The problem I'm encountering is that sometimes (not always) the task scheduler dumps all threads onto the 16 cores that have direct memory access, leaving the 16 without direct access completely idle. This happens even when I set the affinity of the processes to specific cores, including the 16 cores without direct memory access.


Do you have quantitative performance metrics showing this is a real bottleneck? It might be that the threads are I/O hungry and just want more data before resuming while the I/O ready threads happen to run on the dies with memory access by default. Given the.. interesting design choices AMD made in those parts, I'd frankly be more concerned if the dies that do have memory access were sitting idle outside of a toy in-cache workload.

As for forking from a certain parent that should not matter unless you are using a particular cgroup where the child processes occupy different cgroups inherited from their parents that are also in different cgroups. Remember: all processes trace their descent back to PID 1.

More on cgroups: https://en.m.wikipedia.org/wiki/Cgroups
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
Duct Tape Dude
Gerbil Elite
Posts: 721
Joined: Thu May 02, 2013 12:37 pm

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 6:14 pm

From the docs:
mcparallel evaluates the expr expression in parallel to the current R process. Everything is shared read-only (or in fact copy-on-write) between the parallel process and the current process, i.e.no side-effects of the expression affect the main process. The result of the parallel execution can be collected using mccollect function.

The "no side-effects" part is technically true until a point: since you're running as a readonly forked subprocess, the results of the each subprocess probably need to be serialized and reported back to the main process (quite CPU-intensive for large answers). Do these processes always start on different cores before being rescheduled to the same NUMA node?

Also, how many simulations are "a bunch?" If it's under ~10k per run, I'd consider splitting each out into an external job of some sort and having 64 processes to handle jobs, and maybe one more to orchestrate them all, so you get a completely isolated execution environment per job, each with its own NUMA node access etc. ie: use a brand new process instead of child processes so there's zero parent/child communication.

I'm assuming that Node.js child processes are structured and behave similarly. As a caveat in Node.js, my tests in returning a bunch of data from many child processes was actually faster via TCP (and especially UDP, except the kernel silently drops packets if ring buffers fill) to the main thread than standard pipes from child to parent due to the synchronous serialization required by pipes (If your application takes way longer to calculate an answer vs return an answer, this is irrelevant).
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 6:47 pm

If it is IO (either memory or piping to a parent process if that occurs) then an excellent tool to use for profiling is the Linux perf utility: https://perf.wiki.kernel.org/index.php/Main_Page

It's quite powerful (and maybe too complicated) but it can probably help you track down your problems.
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 6:53 pm

Oh and I have no idea if this works with AMD or if they have their own version, but cmt-cat has a very powerful memory bandwidth monitor: https://github.com/intel/intel-cmt-cat
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Thu Jun 27, 2019 10:29 pm

All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Fri Jun 28, 2019 5:31 am

just brew it! wrote:
All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?

Good question — I’m not 100% sure. I’m using an option within the R mcparallel function to set affinity, and I can’t vouch for its efficacy.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Fri Jun 28, 2019 5:36 am

chuckula wrote:
If it is IO (either memory or piping to a parent process if that occurs) then an excellent tool to use for profiling is the Linux perf utility: https://perf.wiki.kernel.org/index.php/Main_Page

It's quite powerful (and maybe too complicated) but it can probably help you track down your problems.

Thanks! I’ll give it a try
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Fri Jun 28, 2019 5:48 am

Duct Tape Dude wrote:
From the docs:
mcparallel evaluates the expr expression in parallel to the current R process. Everything is shared read-only (or in fact copy-on-write) between the parallel process and the current process, i.e.no side-effects of the expression affect the main process. The result of the parallel execution can be collected using mccollect function.

The "no side-effects" part is technically true until a point: since you're running as a readonly forked subprocess, the results of the each subprocess probably need to be serialized and reported back to the main process (quite CPU-intensive for large answers). Do these processes always start on different cores before being rescheduled to the same NUMA node?

Also, how many simulations are "a bunch?" If it's under ~10k per run, I'd consider splitting each out into an external job of some sort and having 64 processes to handle jobs, and maybe one more to orchestrate them all, so you get a completely isolated execution environment per job, each with its own NUMA node access etc. ie: use a brand new process instead of child processes so there's zero parent/child communication.

I'm assuming that Node.js child processes are structured and behave similarly. As a caveat in Node.js, my tests in returning a bunch of data from many child processes was actually faster via TCP (and especially UDP, except the kernel silently drops packets if ring buffers fill) to the main thread than standard pipes from child to parent due to the synchronous serialization required by pipes (If your application takes way longer to calculate an answer vs return an answer, this is irrelevant).


They do start on different cores before being rescheduled.

I think you might be right that the best option is to have totally separate processes so that I can explicitly control NUMA access. I think the first variation on that idea that I will try is to start four separate instances of R, each bound to a node. Then I’ll run 16 child processes from each one using mcparallel. If that doesn’t work, then I’ll dump mcparallel altogether as you’re suggesting.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Fri Jun 28, 2019 9:55 am

blastdoor wrote:
just brew it! wrote:
All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?

Good question — I’m not 100% sure. I’m using an option within the R mcparallel function to set affinity, and I can’t vouch for its efficacy.

As a troubleshooting aid, you can check and/or modify the current affinity mask of individual threads with the "taskset" command.

You can find the PIDs of the individual threads by doing a "ps -AT" (second column of output is the PID of the thread).

The "htop" tool is also useful when dealing with threaded applications. I suggest you install it if you haven't already done so.
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 8:16 am

just brew it! wrote:
blastdoor wrote:
just brew it! wrote:
All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?

Good question — I’m not 100% sure. I’m using an option within the R mcparallel function to set affinity, and I can’t vouch for its efficacy.

As a troubleshooting aid, you can check and/or modify the current affinity mask of individual threads with the "taskset" command.

You can find the PIDs of the individual threads by doing a "ps -AT" (second column of output is the PID of the thread).

The "htop" tool is also useful when dealing with threaded applications. I suggest you install it if you haven't already done so.


I'm trying a variation on some of folks' suggestions. I started two separate R instances:
numactl -m 0 R
numactl -m 2 R

Within the first instance, I'm using mcparallel() to fork 32 threads to logical cores 1-16 and 32-48. Within the second instance, it's logical cores 17-32 and 49-64.

So far this appears to be working, in the sense that the child processes really are pinned to one core (thanks for the suggestion!) and I appear to have 100% overall utilization.

Oh, and I also rewrote how I'm dispatching child processes. Previously, I was creating 64 processes and dispatching 1000 monte carlo replications to each process (one replication takes in the ballpark of 1 to 2 minutes to run). Now, I'm doing a load-balancing thing, where I dispatch just 1 replication to each process. As soon as a process finishes, I dispatch another replication to that core. So, if some cores are completing jobs faster, they will end up getting more jobs. I'm keeping track of how many jobs each core completes -- I'm curious to see what that looks like when it's all done.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 8:23 am

Ugh -- i spoke too soon.

I'm using glances to monitor things, and I found some documentation that you can see info about core affinity for the top process by pressing 'e'.

For a while, it was fine -- the top process was pinned to a single core. But just now, the pin broke and the top process is associated with 16 cores.

....and now I just checked again, and it's back to a single core.

Hmm. Each process lasts about a minute before dying, so if it breaks free of its affinity pin, it doesn't have time to cause too much havoc. But why are processes breaking free of their affinity pin? How can that happen?
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
chuckula
Minister of Gerbil Affairs
Posts: 2109
Joined: Wed Jan 23, 2008 9:18 pm
Location: Probably where I don't belong.

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 8:40 am

blastdoor wrote:
Ugh -- i spoke too soon.

I'm using glances to monitor things, and I found some documentation that you can see info about core affinity for the top process by pressing 'e'.

For a while, it was fine -- the top process was pinned to a single core. But just now, the pin broke and the top process is associated with 16 cores.

....and now I just checked again, and it's back to a single core.

Hmm. Each process lasts about a minute before dying, so if it breaks free of its affinity pin, it doesn't have time to cause too much havoc. But why are processes breaking free of their affinity pin? How can that happen?


I wouldn't panic too much about a process migrating cores if all the cores are busy.

I keyed in on one of the changes you just made: Making each forked process run just one Monte Carlo instead of a batch of 1000. Assuming each simulation takes a non-trivial amount of time (and by non-trivial I mean more than say... 10 milliseconds or so) then the overall cost of forking a process for each simulation will be negligible. I'm speculating that you were previously encountering a situation in which the simulations were not necessarily completing at the same speed and you ended up with the "only 16 cores in use" situation as the final sets of simulations finished up and the scheduler just put them on the dies with direct RAM access (which is probably a good idea).
4770K @ 4.7 GHz; 32GB DDR3-2133; Officially RX-560... that's right AMD you shills!; 512GB 840 Pro (2x); Fractal Define XL-R2; NZXT Kraken-X60
--Many thanks to the TR Forum for advice in getting it built.
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 8:58 am

Have you confirmed that it is a single process/thread that starts out with single-core affinity (then subsequently loses it), or is it possible that some of them simply aren't getting their affinity set properly to begin with? If they only run for about a minute apiece, you're going to need to watch the actual process/thread IDs to determine which case it is.

This really sounds like a buggy framework that simply isn't setting the core affinity properly.
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 9:50 am

chuckula wrote:
blastdoor wrote:
Ugh -- i spoke too soon.

I'm using glances to monitor things, and I found some documentation that you can see info about core affinity for the top process by pressing 'e'.

For a while, it was fine -- the top process was pinned to a single core. But just now, the pin broke and the top process is associated with 16 cores.

....and now I just checked again, and it's back to a single core.

Hmm. Each process lasts about a minute before dying, so if it breaks free of its affinity pin, it doesn't have time to cause too much havoc. But why are processes breaking free of their affinity pin? How can that happen?


I wouldn't panic too much about a process migrating cores if all the cores are busy.

I keyed in on one of the changes you just made: Making each forked process run just one Monte Carlo instead of a batch of 1000. Assuming each simulation takes a non-trivial amount of time (and by non-trivial I mean more than say... 10 milliseconds or so) then the overall cost of forking a process for each simulation will be negligible. I'm speculating that you were previously encountering a situation in which the simulations were not necessarily completing at the same speed and you ended up with the "only 16 cores in use" situation as the final sets of simulations finished up and the scheduler just put them on the dies with direct RAM access (which is probably a good idea).


Not a bad guess, but no -- there was a big list of processes sitting at very low level CPU utilization, and they'd cycle. In addition, I looked at the temperature of the four dies -- two were much hotter. So I'm very confident that it really was dumping everything onto the two memory-connected dies.

Previously, each Monte Carlo replication was taking 1-2 minutes to run. Now with the new approach, they are taking less than a minute to run. So I'm cautiously optimistic this is working better now...
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 9:56 am

just brew it! wrote:
Have you confirmed that it is a single process/thread that starts out with single-core affinity (then subsequently loses it), or is it possible that some of them simply aren't getting their affinity set properly to begin with? If they only run for about a minute apiece, you're going to need to watch the actual process/thread IDs to determine which case it is.

This really sounds like a buggy framework that simply isn't setting the core affinity properly.


That's a good question -- I haven't completely confirmed that yet.

It could very well be a buggy framework, but if so it's a bug that interacts differently with my Threadripper 2990wx than with Xeons. I had the opportunity to run on an AWS server, also with Ubuntu Linux. I was the one to set up R and install packages on both systems (mine and AWS), and I just followed standard procedures for installing R. I don't see this issue on the Xeon AWS machines at all -- no hint of the issue.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 10:15 am

Are the other systems also NUMA?
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 10:24 am

just brew it! wrote:
Are the other systems also NUMA?


I had used the c5.18xlarge instance type. Amazon says that's a Xeon Platinum, Skylake system with 72 logical cores. I don't know for sure, but I presume that means a two socket system.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 10:26 am

blastdoor wrote:
just brew it! wrote:
Are the other systems also NUMA?


I had used the c5.18xlarge instance type. Amazon says that's a Xeon Platinum, Skylake system with 72 logical cores. I don't know for sure, but I presume that means a two socket system.

Yeah, I would assume so.
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 11:43 am

Uh oh... I think the issue here might be that I'm dumb.

I had totally forgotten that many months ago I had installed numad! I just noticed it pop up in glances. I've killed it and now I'm going to watch and see if that does the trick.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 11:54 am

LOL... yeah, a background service whose job it is to mess around with CPU affinity behind your back in an attempt to automatically "fine tune" things will certainly cause confusion if you've forgotten it is there and are trying to manage the affinity manually!

"Pick up gun, aim at foot, pull trigger!"

I'll bet that was it.
Nostalgia isn't what it used to be.
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 12:57 pm

just brew it! wrote:
LOL... yeah, a background service whose job it is to mess around with CPU affinity behind your back in an attempt to automatically "fine tune" things will certainly cause confusion if you've forgotten it is there and are trying to manage the affinity manually!

"Pick up gun, aim at foot, pull trigger!"

I'll bet that was it.


Yup -- it sure looks like this was it.

Well, I guess the upside is that I learned a thing or two along the way. But sheesh.... I am dumb.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 1:58 pm

We all do "dumb" stuff from time to time. Optimal scheduling on NUMA can be tricky, and it's easy to accidentally outsmart yourself! :lol:

Unrelated to NUMA but still in the "outsmart yourself" vein, I've seen people make a righteous mess of things by (ab)using the fadvise() system call, believing that they could do a better job of managing the disk I/O caches than the default cache management algorithms.

Edit: And the fact that you learned something along the way means that overall it was probably a net win!
Nostalgia isn't what it used to be.
 
anotherengineer
Gerbil Jedi
Posts: 1688
Joined: Fri Sep 25, 2009 1:53 pm
Location: Northern, ON Canada, Yes I know, Up in the sticks

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 2:38 pm

Life doesn't change after marriage, it changes after children!
 
just brew it!
Administrator
Posts: 54500
Joined: Tue Aug 20, 2002 10:51 pm
Location: Somewhere, having a beer

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 2:41 pm

anotherengineer wrote:

Yeah, was wondering how long it would be before someone posted that... :roll: :lol:
Nostalgia isn't what it used to be.
 
anotherengineer
Gerbil Jedi
Posts: 1688
Joined: Fri Sep 25, 2009 1:53 pm
Location: Northern, ON Canada, Yes I know, Up in the sticks

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 2:42 pm

just brew it! wrote:
anotherengineer wrote:

Yeah, was wondering how long it would be before someone posted that... :roll: :lol:


you're welcome :)
Life doesn't change after marriage, it changes after children!
 
blastdoor
Gerbil Elite
Topic Author
Posts: 846
Joined: Fri Apr 08, 2005 4:12 pm
Location: Real America

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 4:30 pm

So this is maybe kind of interesting... I think maybe I'm able to observe variation in performance across CCXs. Here's the numa topology reported by numaclt -H:

available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
node 0 size: 32125 MB
node 0 free: 22334 MB
node 1 cpus: 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47
node 1 size: 0 MB
node 1 free: 0 MB
node 2 cpus: 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
node 2 size: 32225 MB
node 2 free: 21901 MB
node 3 cpus: 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63
node 3 size: 0 MB
node 3 free: 0 MB

As described previously, I launched two instances of R, one bound to node 0 and one bound to node 2.

For each R instance, I spin off 1 child process for each of 64,000 Monte Carlo replications. That is, I spin off 32 processes and wait for them to complete. As they complete, I spin off a new process to replace the one that completed. I use a table to keep track of how many jobs are completed by each logical core.

When it was all done, I plotted jobs complete by core and I noticed an interesting pattern -- there are four distinct clumps for each memory node. Each clump consists of 8 logical cores (4 physical cores). The average number of jobs completed in each clump are:

clump 1: 2376
clump 2: 2275
clump 3: 2162
clump 4: 1187

Or, normalizing to clump 1:
clump 1: 1
clump 2: 0.96
clump 3: 0.91
clump 4: 0.50

So I'm wondering -- are we seeing difference across CCXs here, with the best performance coming from the CCXs closest to RAM and the worst from the one furthest away? And how about that drop-off between 3 and 4?

I want to repeat this and also compare it to runs in which I use fewer logical cores to get a better sense of how performance scales. I'll also compare it to a scenario in which I don't et any affinity and just let the scheduler handle things.
1. iMac 27" (2020) i7 10700k; AMD Radeon Pro 5500 XT 8 GB; 64 GB RAM; 500 GB internal SSD + external box of SSDs
2. ThreadRipper 2990wx; Ubuntu; Headless; 64GB RAM
3. MacBook Pro (2017); Core i7-7820HQ; 16 GB RAM
 
Waco
Maximum Gerbil
Posts: 4850
Joined: Tue Jan 20, 2009 4:14 pm
Location: Los Alamos, NM

Re: Task Scheduling, PPID, forking, NUMA

Sat Jun 29, 2019 11:46 pm

That's interesting, I would have expected 2 "clumps" at higher speeds and two a bit slower if they were memory constrained. 3 and 1 is...odd.
Victory requires no explanation. Defeat allows none.
 
Captain Ned
Global Moderator
Posts: 28704
Joined: Wed Jan 16, 2002 7:00 pm
Location: Vermont, USA

Re: Task Scheduling, PPID, forking, NUMA

Sun Jun 30, 2019 10:34 am

NUMA? What does Dirk Pitt's crew have to do with this?
What we have today is way too much pluribus and not enough unum.

Who is online

Users browsing this forum: No registered users and 1 guest
GZIP: On