Parallel processing a Tree on GPU

Question

I have seen a few papers on parallel/GPU processing of trees, but after briefly looking through them I wasn't able to grasp what they did. The closest to a helpful explanation was found in Parallelization: Binary Tree Traversal in this diagram:

But having a difficult time following the paper.

Wondering if one could outline an algorithm for parallel processing of a tree. Somehow I can imagine this being possible, and seeing papers on it suggests it is, but I can't really think of what you would do to make it happen.

If it's any help, specifically I'm wondering how to traverse a B+tree to find the matches.

Update

Here is another diagram (from here) which seems to shed some light, but having difficulty understanding.

The paper you link seems to be talking about normal, recursively defined trees. Do you have a link to something talking about trees on a GPU? — Caleth, Apr 25 '18 at 15:01
@Caleth There are a few papers that mention GPU tree traversals, but can't tell if they are good. "Recently, irregular algorithms such as ... kd-tree traversals have been implemented on GPUs" (https://docs.lib.purdue.edu/cgi/viewcontent.cgi?article=1450&context=ecetr) "For example, in our transformed Barnes-Hut kernel we load a partial node that only contains the position vector of the current node and its type" in section "Transforming CPU traversals for the GPU". — Lance, Apr 25 '18 at 15:39
@Caleth I think [this](http://on-demand.gputechconf.com/gtc/2014/presentations/S4668-transformations-gpu-execution-tree-traversals.pdf) is a presentation for the paper. They have concepts _autoropes_, _lockstepping_, and _warps_, which adds to it. — Lance, Apr 25 '18 at 15:43
[This one](http://dicl.unist.ac.kr/publications/icpp2016.pdf) says "We develop **data parallel** a tree traversal algorithm - _Parallel Scanning and Backtracking_ (PSB) that processes multiple branches of a tree node in parallel ... To the best of our knowledge, this is the first work that parallelizes kNN query processing on the n-ary tree structured index for the GPU." — Lance, Apr 25 '18 at 15:47
...But reading section "Parallel Scan and Backtrack for kNN Query" I don't see anything about GPU vectors and whatnot to actually make it possible. Not sure where I'm missing stuff. — Lance, Apr 25 '18 at 15:53
Almost there, but not quite. "I.e., the derivation of the additional structural information of dTree is executed on the GPU, while the CPU task is to run the outer loop over qTree nodes, and to prepare the information for the GPU." http://www.adms-conf.org/shnaiderman_adms12.pdf — Lance, Apr 25 '18 at 22:20
Hmm, I think perhaps these papers are referencing CUDA which has the concept of threads. https://cs.nyu.edu/courses/spring12/CSCI-GA.3033-012/lecture5.pdf Still don't get it lol. — Lance, Apr 25 '18 at 23:20
I think you would need to be really careful doing anything with trees on a GPU if you want any kind of performance. Since many (most?, all?) GPUs are lousy at handling branches, and branches are kind of the defining feature of trees. — JonasH, Jun 30 '23 at 09:39

Caleth · Accepted Answer · 2018-04-25T13:24:04.797

It's very simple.

When you would recurse down the left child then the right child, instead start a task that recurses down the left, and continue down the right. When you are done with the right, you may have to wait on the left, but in a balanced tree that won't be long

When you are at depth N in your tree, you have the potential for 2^N cores working

That specific diagram is only spawning tasks at specific depths, probably because it isn't queueing tasks. That is using knowledge of the structure of the tree to not overload the scheduler

If you have a task queue and a thread pool, the implementation can leverage those to not have to worry about the quantity of tasks spawned.

Say you have

class Node 
{
    int data; // or w/e

    Node * left;
    Node * right;

    template <typname Action>
    void SerialTraverse(Action action)
    {
        action(data); // Pre-order traversal

        if (left) left->SerialTraverse(action);
          // Traverse the left
        if (right) right->SerialTraverse(action);
          // Traverse the right
    }
}

namespace library 
{
    template<typename Task, typename Class, typename Arg>
    std::future<void> enqueue_task(Task task, Class * obj, Arg arg);
    // on some other thread, call "obj->task(arg);"
    // returns a handle upon which we can wait
}

Then you can change SerialTraverse to

    template <typname Action>
    void ParallelTraverse(Action action)
    {
        action(data);
        std::future<void> fut;
        if (left) fut = library::enqueue_task(ParallelTraverse, left, action);
          // start traversing left on another thread

        if (right) right->ParallelTraverse(action);
          // traverse right on this thread, in parallel to traversing left

        if (fut.valid()) fut.wait();
          // wait for the left traversal to end
    }

To paraphrase Caleth's answer, to force individual threads (spawned tasks) to descend a particular branch (say, one of the colors in the diagram), simply pass in "a list of turns (left, right) to be taken from the starting point" as an input to each task. — rwong, Apr 25 '18 at 10:48
@rwong or just a reference to whichever node it would start from — Caleth, Apr 25 '18 at 10:54
Not really following. So taking WebGL as an example, what is the input array of data, like `[nodeid, lnodeid, rnodeid, nodeid, lnodeid, ...]` or something, where the block size is 3, and you return something in the array like the turn to go on next? That's what I gather (still not quite following) from @rwong comment. Ah so you do two full-depth branches at the same time, then move onto the next pair of branches? (Thinking of a B+tree instead of a binary tree like the diagram suggests). — Lance, Apr 25 '18 at 13:03
@LancePollard I've added an example of transforming a serial tree algorithm to an equivalent parallel algorithm. The important point is that you *start* both actions on the left and right children without waiting for the other to be done — Caleth, Apr 25 '18 at 13:20
@Caleth thank you, I can kind of see how that would be parallel using threads, wondering though how this would be done on the GPU with limited data. Reflected that in the question. Thank you. — Lance, Apr 25 '18 at 13:42
I guess b/c GPU can't use recursion, whereas your example uses recursion. — Lance, Apr 25 '18 at 13:48

Magnus Wootton · Answer 2 · 2020-02-25T22:48:03.743

You can think its better to have a gpu call per level of the tree, but I think its simpler if you just put it all in 1 call, do the steps in an iteration, and get your cores in operation back via amount of simultaneous whole accesses.

Heres a setup to do a key store in a tree, on the gpu.

use a 32bit uint texture, and use this format-> [NEXTFREEPOSITION][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER][NODE MATCHING STRING][PARALLEL POINTER][SERIAL POINTER]... next free position points here at the end to tack another iop on.

the parallel pointer points to the leg beside the one your up to, and the serial pointer points deeper into the tree. If ever the parallel node is null, it means you are at the end, and you have a no match, and you abort. When you are at the last level, you put the leaf contents in, and complete successfully.

You hurt your gpu quite a bit by hopping around on the texture randomly, but there isnt too much to do, because its all spacially divided parallel wise and you get away with log series levels.

Heres an example of the reading code, done in direct compute.

//its only really short.

[numthreads(32, 32, 1)]  //put your 200k parallel accesses here. :)
void cs_zc_read_read(uint3 DTid : SV_DispatchThreadID )
{
  //get the first link pointer with a kickstart array, cause the start reuses 
  //too much and it makes too many parallel nodes.
 uint link=buffer1[buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+0].f].f; 
 uint i;
 while(i<ZCKEY32;i++)
 {
  uint match32=buffer0[(DTid.x+DTid.y*RES)*ZCKEY32+i].f 
  uint match232;
  uint lnk[2];
  match232=buffer2[link];  
  lnk[0]=buffer2[link+2]; 
  lnk[1]=buffer2[link+1];
  if(match32==match232)
  {
   link=lnk[0];   //this steps in series.
   i++;
   if(i==ZCKEY32)
   {
    //success, its an exact match.
    link=lnk[0]; //get the routing position 
    //get two routings.
    BufferOut[(DTid.x+DTid.y*RES)*2+0].f=buffer2[link+1].f;
    BufferOut[(DTid.x+DTid.y*RES)*2+1].f=buffer2[link+2].f;
   }
  }
  else
  {
   link=lnk[1];   //this steps in parallel
   if(link==0){i=100000;} //abort, its a no match
  }
 }
}

Dont worry about parallelizing the write because its a pain. =p But if you wanted it to kick ass more, you collect multivalues using ternary codes, and you get more flexibility, but you have to multiply the cost by the amount your collecting in parallel, and it doesnt matter if its just more iteration in the single call, because 'once youve saturated your cores more parallel threads dont gain you no more speed.'

score 0 · Answer 3 · answered Jun 30 '23 at 09:01

0

you can probably put cpu code in the shader for a binary tree access, just convert it across pretty much exactly the same, then u can do a query a core, so if u have 2048 processors, u can do 2048 queries in the same speed as one.

answered Jun 30 '23 at 09:01

Magnus Wootton

1

Parallel processing a Tree on GPU

3 Answers3