Elsewhere in this forum, there was a very good parallel-programming / multithreaded programming topic going on. It seems like a large number of forum members have expertise in this field. I hope to get this topic restarted here.
My experiences with parallel programming are relatively low. I've bought myself a Threadripper 1950x earlier this year, and have begun to explore it (NUMA issues and all). I've also tried to explore OpenCL with the AMD R9 290x, but that development environment is poor... and SIMD doesn't seem to fit most code in general. (Heck, its hard enough to figure out how to use multiple cores of a CPU!).
My current hobby project is writing an AI for an obscure game. In it, I've got a large number of relations that need to be inner-joined together, as well as a large number of board positions to explore (kinda like chess: each move creates a branch in the search tree. And it makes sense to explore all branches in parallel, if possible). And since CPUs have a limited form of SIMD available through AVX, I've been trying to integrate SIMD into my program. Alas, my best efforts for SIMD on the CPU are to just use the SSE / AVX registers as 128-bit or 256-bit bitmasks... and using AND / OR / NOT operations on those. Rather primitive to say the least, but working on 128-bits at a time is the ideal for Threadripper.
My current subtask for this program is to parallelize the inner-join operator. My use of relational theory will revolve around dozens, maybe hundreds of inner-joins going on at once. Because this is such a costly operator, it seems natural to try to apply parallel programming to it. Across my research, I found an interesting inner-join algorithm called the Leapfrog Triejoin, which seems like it can be parallelized easily. The Leapfrog Triejoin visualizes the inner-join like a sequential search, and sequential searches can be easily parallelized (Ex: If you have 32-threads, you have each thread sequentially scan 1/32nd of the array).
Ironically, I can more easily imagine this working with OpenCL, rather than with typical CPU programming. I guess the grass is always greener on the other side. Now that I'm trapped on the CPU, I end up discovering a good algorithm to use OpenCL for...
------------
EDIT: Oh yeah, one problem I have right now is that std::vector is unaware of NUMA allocations. Since I'm running on Windows (not Linux), I can't just rely upon a "first touch" policy to set the std::vector to the correct NUMA node.
I did a brief search for NUMA-aware C++ containers. There's some stuff in Boost but nothing for a data-structure like std::vector. My C++ Allocator knowledge is pretty bad, but in the worst case, I might have to just write a C++ Allocator.