Personal computing discussed

Moderators: SecretSquirrel, just brew it!

 
dragontamer5788
Gerbil First Class
Topic Author
Posts: 158
Joined: Mon May 06, 2013 8:39 am

Multithreaded / Parallelism topic

Tue Oct 16, 2018 12:44 pm

Elsewhere in this forum, there was a very good parallel-programming / multithreaded programming topic going on. It seems like a large number of forum members have expertise in this field. I hope to get this topic restarted here.

My experiences with parallel programming are relatively low. I've bought myself a Threadripper 1950x earlier this year, and have begun to explore it (NUMA issues and all). I've also tried to explore OpenCL with the AMD R9 290x, but that development environment is poor... and SIMD doesn't seem to fit most code in general. (Heck, its hard enough to figure out how to use multiple cores of a CPU!).

My current hobby project is writing an AI for an obscure game. In it, I've got a large number of relations that need to be inner-joined together, as well as a large number of board positions to explore (kinda like chess: each move creates a branch in the search tree. And it makes sense to explore all branches in parallel, if possible). And since CPUs have a limited form of SIMD available through AVX, I've been trying to integrate SIMD into my program. Alas, my best efforts for SIMD on the CPU are to just use the SSE / AVX registers as 128-bit or 256-bit bitmasks... and using AND / OR / NOT operations on those. Rather primitive to say the least, but working on 128-bits at a time is the ideal for Threadripper.

My current subtask for this program is to parallelize the inner-join operator. My use of relational theory will revolve around dozens, maybe hundreds of inner-joins going on at once. Because this is such a costly operator, it seems natural to try to apply parallel programming to it. Across my research, I found an interesting inner-join algorithm called the Leapfrog Triejoin, which seems like it can be parallelized easily. The Leapfrog Triejoin visualizes the inner-join like a sequential search, and sequential searches can be easily parallelized (Ex: If you have 32-threads, you have each thread sequentially scan 1/32nd of the array).

Ironically, I can more easily imagine this working with OpenCL, rather than with typical CPU programming. I guess the grass is always greener on the other side. Now that I'm trapped on the CPU, I end up discovering a good algorithm to use OpenCL for...

------------

EDIT: Oh yeah, one problem I have right now is that std::vector is unaware of NUMA allocations. Since I'm running on Windows (not Linux), I can't just rely upon a "first touch" policy to set the std::vector to the correct NUMA node.

I did a brief search for NUMA-aware C++ containers. There's some stuff in Boost but nothing for a data-structure like std::vector. My C++ Allocator knowledge is pretty bad, but in the worst case, I might have to just write a C++ Allocator.
 
NTMBK
Gerbil XP
Posts: 304
Joined: Sat Dec 21, 2013 11:21 am

Re: Multithreaded / Parallelism topic

Tue Oct 16, 2018 3:13 pm

I'd recommend taking a look at Intel's TBB libraries. They have some good high level algorithms for parallel reduction, along with some good thread-safe data structures.
 
dragontamer5788
Gerbil First Class
Topic Author
Posts: 158
Joined: Mon May 06, 2013 8:39 am

Re: Multithreaded / Parallelism topic

Tue Oct 16, 2018 3:48 pm

NTMBK wrote:
I'd recommend taking a look at Intel's TBB libraries. They have some good high level algorithms for parallel reduction, along with some good thread-safe data structures.


Oh yeah, I'm currently deciding between TBB and Microsoft PPL. From what I can tell, they are both very well designed libraries. TBB seems more active, but PPL is built-into Visual Studio and has more examples using the lambda [] syntax of C++ through its documentation.

PPL seems to have some degree of NUMA-awareness through its "Scheduler Groups", but I haven't really played with it yet.
 
dragontamer5788
Gerbil First Class
Topic Author
Posts: 158
Joined: Mon May 06, 2013 8:39 am

Re: Multithreaded / Parallelism topic

Thu Oct 18, 2018 10:07 pm

https://homes.cs.washington.edu/~chushu ... 5_join.pdf

So this paper implements two things:

1. A cluster-based multi-join algorithm with a single communication round. This is unfortunately an "externally parallel" algorithm, designed for MPI or some other message-passing infrastructure on a cluster of 64-servers... so its not relevant to me (although it may be relevant to others??)

2. An implementation of a highly-efficient (but non-parallel) multiway join.

The #2 is what I'm focusing on, because of all of the multiway joins I've discovered, its the one that makes the most sense. Its based on the Leapfrog Triejoin paper I posted earlier, except its clearly more cache-friendly and uses less memory: It uses sorted arrays instead of building a B-Tree, and doesn't seem to be any slower asymptotically. The coolest thing about Leapfrog Triejoin (and "Tributary Join", that this newer paper calls the array-based version) is that its worst-case optimal: a 3-way join of (A(x,y) inner-join B(y,z) inner-join C(x,z)) will take at most O(n^(3/2)) time, even without any optimizations.

While any traditional database would take O(n^2) time with 2-at-a-time joins. ( (A-join-B) THEN join-C is the traditional approach. Tributary join does A,B, AND C at the same time). This extends out arbitrarily, you can perform an 8-way join or a 20-way join in a single Tributary join step.

----------------

Anyway, since I'm using constraint programming with lots of these "cyclical" joins, I think I finally found the inner-join algorithm that best works for my use case. The algorithm is relatively simple (at least, as simple as a simultaneous 3+-way join-merges can get), and IMO, seems to hold huge potential to be parallelized on a SMP / cohesive-memory computer (like Threadripper).

Who is online

Users browsing this forum: No registered users and 2 guests