Personal computing discussed
Moderators: renee, SecretSquirrel, notfred
The problem I'm encountering is that sometimes (not always) the task scheduler dumps all threads onto the 16 cores that have direct memory access, leaving the 16 without direct access completely idle. This happens even when I set the affinity of the processes to specific cores, including the 16 cores without direct memory access.
mcparallel evaluates the expr expression in parallel to the current R process. Everything is shared read-only (or in fact copy-on-write) between the parallel process and the current process, i.e.no side-effects of the expression affect the main process. The result of the parallel execution can be collected using mccollect function.
just brew it! wrote:All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?
chuckula wrote:If it is IO (either memory or piping to a parent process if that occurs) then an excellent tool to use for profiling is the Linux perf utility: https://perf.wiki.kernel.org/index.php/Main_Page
It's quite powerful (and maybe too complicated) but it can probably help you track down your problems.
Duct Tape Dude wrote:From the docs:mcparallel evaluates the expr expression in parallel to the current R process. Everything is shared read-only (or in fact copy-on-write) between the parallel process and the current process, i.e.no side-effects of the expression affect the main process. The result of the parallel execution can be collected using mccollect function.
The "no side-effects" part is technically true until a point: since you're running as a readonly forked subprocess, the results of the each subprocess probably need to be serialized and reported back to the main process (quite CPU-intensive for large answers). Do these processes always start on different cores before being rescheduled to the same NUMA node?
Also, how many simulations are "a bunch?" If it's under ~10k per run, I'd consider splitting each out into an external job of some sort and having 64 processes to handle jobs, and maybe one more to orchestrate them all, so you get a completely isolated execution environment per job, each with its own NUMA node access etc. ie: use a brand new process instead of child processes so there's zero parent/child communication.
I'm assuming that Node.js child processes are structured and behave similarly. As a caveat in Node.js, my tests in returning a bunch of data from many child processes was actually faster via TCP (and especially UDP, except the kernel silently drops packets if ring buffers fill) to the main thread than standard pipes from child to parent due to the synchronous serialization required by pipes (If your application takes way longer to calculate an answer vs return an answer, this is irrelevant).
blastdoor wrote:just brew it! wrote:All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?
Good question — I’m not 100% sure. I’m using an option within the R mcparallel function to set affinity, and I can’t vouch for its efficacy.
just brew it! wrote:blastdoor wrote:just brew it! wrote:All other issues aside, if the scheduler is ignoring explicit affinity settings, that's a serious bug. Are you sure you're really setting the affinity?
Good question — I’m not 100% sure. I’m using an option within the R mcparallel function to set affinity, and I can’t vouch for its efficacy.
As a troubleshooting aid, you can check and/or modify the current affinity mask of individual threads with the "taskset" command.
You can find the PIDs of the individual threads by doing a "ps -AT" (second column of output is the PID of the thread).
The "htop" tool is also useful when dealing with threaded applications. I suggest you install it if you haven't already done so.
blastdoor wrote:Ugh -- i spoke too soon.
I'm using glances to monitor things, and I found some documentation that you can see info about core affinity for the top process by pressing 'e'.
For a while, it was fine -- the top process was pinned to a single core. But just now, the pin broke and the top process is associated with 16 cores.
....and now I just checked again, and it's back to a single core.
Hmm. Each process lasts about a minute before dying, so if it breaks free of its affinity pin, it doesn't have time to cause too much havoc. But why are processes breaking free of their affinity pin? How can that happen?
chuckula wrote:blastdoor wrote:Ugh -- i spoke too soon.
I'm using glances to monitor things, and I found some documentation that you can see info about core affinity for the top process by pressing 'e'.
For a while, it was fine -- the top process was pinned to a single core. But just now, the pin broke and the top process is associated with 16 cores.
....and now I just checked again, and it's back to a single core.
Hmm. Each process lasts about a minute before dying, so if it breaks free of its affinity pin, it doesn't have time to cause too much havoc. But why are processes breaking free of their affinity pin? How can that happen?
I wouldn't panic too much about a process migrating cores if all the cores are busy.
I keyed in on one of the changes you just made: Making each forked process run just one Monte Carlo instead of a batch of 1000. Assuming each simulation takes a non-trivial amount of time (and by non-trivial I mean more than say... 10 milliseconds or so) then the overall cost of forking a process for each simulation will be negligible. I'm speculating that you were previously encountering a situation in which the simulations were not necessarily completing at the same speed and you ended up with the "only 16 cores in use" situation as the final sets of simulations finished up and the scheduler just put them on the dies with direct RAM access (which is probably a good idea).
just brew it! wrote:Have you confirmed that it is a single process/thread that starts out with single-core affinity (then subsequently loses it), or is it possible that some of them simply aren't getting their affinity set properly to begin with? If they only run for about a minute apiece, you're going to need to watch the actual process/thread IDs to determine which case it is.
This really sounds like a buggy framework that simply isn't setting the core affinity properly.
just brew it! wrote:Are the other systems also NUMA?
blastdoor wrote:just brew it! wrote:Are the other systems also NUMA?
I had used the c5.18xlarge instance type. Amazon says that's a Xeon Platinum, Skylake system with 72 logical cores. I don't know for sure, but I presume that means a two socket system.
just brew it! wrote:LOL... yeah, a background service whose job it is to mess around with CPU affinity behind your back in an attempt to automatically "fine tune" things will certainly cause confusion if you've forgotten it is there and are trying to manage the affinity manually!
"Pick up gun, aim at foot, pull trigger!"
I'll bet that was it.
anotherengineer wrote:
just brew it! wrote:anotherengineer wrote:
Yeah, was wondering how long it would be before someone posted that...