How to optimize/parallelize the following clustering/joining algorithm:

Question

I have relatively-small algorithm that takes up ~60% of the total run-time of my scientific code (57 lines of 3600), so I would like to find a way to optimize what I'm doing and make the code order-independent so that I can apply a cilk_for parallel strcture.

Here's what it does, verbally: I have an std::vector of pointers to custom objects called Segment (vector<Segment*> newSegment). Each Segment contains a std::vector of integers (mesh indices). In this function, I would like to find any Segment that overlaps with any another, with overlap being defined as the member indices overlapping on the number line. If they do overlap, I would like to join them together (insert the A.indices into B.indices) and delete one (delete A).

ex. 1: A.indices={1,2,3} B.indices={4,5,6} do not overlap; do nothing

ex. 2: A.indices={1,2,4} B.indices={3,5,6} do overlap; A= deleted B.indices={1,2,3,4,5,6}

The overlaps are sparse, but existent.

Here's the current code:

main algorithm:

//make sure segments don't overlap
for (unsigned i = 0; i < newSegment.size(); ++i) {
    if (newSegment[i]->size() == 0) continue;
    for (unsigned j = i + 1; j < newSegment.size(); ++j) {
        if (newSegment[i]->size() == 0) continue;
        if (newSegment[j]->size() == 0) continue;
        int i1 = newSegment[i]->begin();
        int i2 = static_cast<int>(newSegment[i]->end());
        int j1 = newSegment[j]->begin();
        int j2 = static_cast<int>(newSegment[j]->end());
        int L1 = abs(i1 - i2); 
        int L2 = abs(j1 - j2); 
        int dist = max(i1,i2,j1,j2) - min(i1,i2,j1,j2);

        //if overlap, fold segments together
        //copy indices from shorter segment to taller segment
        if (dist <= L1 + L2) {
            unsigned more, less;
            if (newSegment[i]->slope == newSegment[j]->slope) {
                if (value_max[i] > value_max[j]) {
                    more = i;
                    less = j;
                } else {
                    more = j;
                    less = i;
                }
            } else if (newSegment[i]->size() == 1) {
                more = j; less = i;
            } else if (newSegment[j]->size() == 1) {
                more = i; less = j;
            } else assert(1 == 0);
              while(!newSegment[less]->indices.empty()) {
                unsigned index = newSegment[less]->indices.back();
                newSegment[less]->indices.pop_back();
                newSegment[more]->indices.push_back(index);
            }
        }
    }

}//end overlap check

//delete empty segments
vector<unsigned> delList;
for (unsigned i = 0; i < newSegment.size(); ++i) {
    if (newSegment[i]->size() == 0) {                            //delete empty
        delList.push_back(i);
        continue;
    }
}
while (delList.size() > 0) {
    unsigned index = delList.back();
    delete newSegment.at(index);
    newSegment.erase(newSegment.begin() + index);
    delList.pop_back();
}

Relevant Segment object class definition and member functions:

class Segment{

    public:
    Segment();
    ~Segment();

    unsigned size();
    int begin();
    unsigned end();
    std::vector<int> indices;
    double slope;
};

int Segment::begin() {
    if (!is_sorted(indices.begin(),indices.end()))      std::sort(indices.begin(),indices.end());
    if (indices.size() == 0) return -1; 
    return indices[0];
}

unsigned Segment::end() {
    if (!is_sorted(indices.begin(),indices.end()))    std::sort(indices.begin(),indices.end());
    return indices.back();
}

unsigned Segment::size() {
    unsigned indSize = indices.size();
    if (indSize == 1) {
        if (indices[0] == -1) return 0;
    }   
    return indSize;
}

Ideas:

Since I don't care about the order of the Segment objects, they could be in an orderless container?
In my algorithm, I find overlap by looking at the first and last indices of each segment. I do an std::is_sorted (and then maybe a std::sort) when I fetch the indices because the list can change when more indices are inserted. Maybe I could put the indices in a std::set rather than std::vector to save the explicit sort-checking/sorting?
I'm pretty sure that by editing the indices as I go, this makes it order-dependent. Perhaps, I could break the code into the following organization using the concept of an undirected graph to make it order-independent:
- edge discovery (without modifying indices)
- join clusters of connected nodes (Segment objects that overlap) using a graph traversal
- delete empty Segment objects

Questions

Are either of the ideas above worthwhile or negligible to performance?
How else can I optimize it?
How (if not the above) can I make the algorithm order-independent?

Is is_sorted() implemented by checking an internal flag (that gets set by sort() and unset by mutators) or does it compare elements to each other until it finds ones that are out of order? In fact, I don't see any mutators on Segment, so is that meant to be an immutable class? (if so, you could try ensuring sorted-ness once on construction) — Ixrec, Nov 08 '15 at 16:46
@lxrec `std::is_sorted` ( [link](http://www.cplusplus.com/reference/algorithm/is_sorted/) ). I don't have any mutator methods per se, I just `push_back` directly to the public `indices` vector (e.g. `newSegment[j]->indices.push_back(i)`). I do only `push_back` to `indices` in only two other places in the code, so I could `std::sort` after each one, as well as sort when I combine them in this algorithm, and then delete the `is_sorted`/`sort` when accessing. — Stershic, Nov 08 '15 at 17:17
The merging part (of the original algorithm) sounds like what a [disjoint set (union-find) algorithm (en.wikipedia.org)](https://en.wikipedia.org/wiki/Disjoint-set_data_structure) would do. — rwong, Nov 08 '15 at 21:36

Mike Nakis · Accepted Answer · 2015-11-08T20:29:14.123

4

The is_sorted() function is probably expensive, and so you should avoid it. Why not sort everything in one go right at the beginning before entering the loops?

The best way to optimize your code is by inventing a new algorithm which avoids the nested loops of N, because that has a complexity of O(N^2) (see "big-Oh notation".) See Bart van Ingen Schenau's comment below on how to achieve this.

edited Nov 08 '15 at 20:29

answered Nov 08 '15 at 17:52

Mike Nakis

32,003
7
76
111

The second `if (newSegment[i]->size() == 0) continue;` statement may, in fact, be needed since the `indices` can be removed if `i` overlaps with an earlier `j`. What's the performance consequence of sorting everything before entering the loops and after insertion VS making `indices` a `std::set` instead? – Stershic Nov 08 '15 at 18:01
I think the overlap problem is pretty clear regardless of the rest of my code. I think the pairwise search for overlap is unavoidably O(N^2) (right?), so I want to make it as quick and parallel as possible. – Stershic Nov 08 '15 at 18:02
4

@Stershic: The search for overlap isn't necessarily O(N^2). If you can change the order of the segments, you could order them by starting element. Then you would only have to look at adjacent segments to detect an overlap (you have an overlap if `Segment[i].end() > Segment[i+1].begin()`). – Bart van Ingen Schenau Nov 08 '15 at 19:50
Exactly, what @Stershic said. And in order to sort by starting element, you need to sort the elements in each segment first. – Mike Nakis Nov 08 '15 at 20:23
(okay, I removed my suggestion to get rid of the first `if (newSegment[i]->size() == 0) continue;`.) – Mike Nakis Nov 08 '15 at 20:26
1

As for using sets, they theoretically have a O(1) access time just as arrays do, but in actuality they are burdened by a non-negligible constant factor, which is not accounted for by big-oh notation. In any case, I do not really see how sets could be of any use to you here. – Mike Nakis Nov 08 '15 at 20:33
1

I would recommend sorting the indices within each segment once in the beginning, and then either re-sorting after a merge, or, perhaps even better yet, merging the segments in such a way that no re-sorting is necessary. (It is easy to do when you know that both arrays are already sorted, see http://programmers.stackexchange.com/q/267406/41811.) – Mike Nakis Nov 08 '15 at 20:34
Thanks @MikeNakis and BartvanIngenSchenau for helping me remove the nested loop. Any tips on making the loop order-independent to parallelize? I expect I need to separate the overlap identification from the merging (different loops), but I still can't think of how to make the merging parallel-friendly. – Stershic Nov 08 '15 at 23:26
Well, if you find a group of segments that need to be merged into one, and you know that there is no preceding segment that the group needs to be merged into, nor any succeeding segment that will need to be merged into the group, then this is pretty much a standalone group of segments, so you can offload the merging to a separate thread. – Mike Nakis Nov 08 '15 at 23:41
But be prepared to be disappointed: even if you are careful enough to not fall into any of the common performance killing pitfalls of multi-threading, (like launching your own thread instead of using a thread from a threadpool,) dealing with threads tends to have huge overheads, so unless you are working with huge data sets, multi-threading is more likely to give you a degradation, rather than an improvement, of performance. – Mike Nakis Nov 08 '15 at 23:42
Also, merging sorted arrays of integers is a highly memory-bound operation, meaning that it does not involve any calculations in-between memory accesses, but no matter how many cores your CPU has, it still has only one address bus and only one data bus, so even if you achieve the perfect parallelization, you might still see no improvement, because your bottleneck is likely to be in the memory accesses. – Mike Nakis Nov 08 '15 at 23:47
@MikeNakis: C++ sets have O(log N) access, as do all tree structures. Finding an element in a vector is O(1) by index, O(N) by value in an unsorted vector but O(log N) for a sorted vector. – MSalters Nov 30 '15 at 16:44
@MSalters it has been a while since I dealt with this question & answer, so perhaps I have forgotten some context, but I fail to see why you are mentioning this. (And it surprises me to hear that a set in C++ is O(log N), I would expect it to be a hashset and be O(1).) – Mike Nakis Nov 30 '15 at 17:34
@MSalters oh, I see now. – Mike Nakis Nov 30 '15 at 17:43
@MSalters so, I guess I had `unordered_set` in mind. Thanks for pointing this out. I cannot go back and change a comment. – Mike Nakis Nov 30 '15 at 17:46

score 0 · Answer 2 · edited Apr 12 '17 at 07:31

I came to identical algorithm than @BartVanIngenSchenau in this comment Basically sort the set of segments based on the min element of each segment. Then two adjacent element overlap if and only if Segment[i].max >= Segment[i+1].min

But I would like to add that sorting looks unnecessary at all and only keeping the max and the min element. Just update them when merging segments. (segment1+segment2).min = min(segment1.min,segment2.min) and (segment1+segment2).max = max(segment1.max,segment2.max) Moreover if the segment are sorted by min element you have (Segment[i]+Segment[i+1]).min = segment[i].min (but this last thing could be premature optimization.) I noted + the merge of two segments.

For cache locality, the best for merging might be to have a layout simimar to the following layout

ptr_to_2nd_segment
n_elt_of_1st_segment,
min_elt_of_1st_segment,
[
[other_elts_of_1st_segment,]
max_elt_of_1st_segment,]

ptr_to_3rd_segment
n_elt_of_2nd_segment,
min_elt_of_2nd_segment,
[
[other_elts_of_2nd_segment,]
max_elt_of_2nd_segment,]

...

merging two elements in this configuration would be quite simple, it would be just a matter of updating ptr to next element, adding the number of elements,shifting the second segment elements and swapping the max elements if necessary. That would let some junk after each merge (8 bytes on 32 bits architecture and 16 bytes on 64 bits architecture). Knowing if you can support such junk is application dependent (moreover, you could doing a kind of garbage collection between two iteration of the algorithm.)

For parallelizing, once the set of segments are sorted by min element, you can divide in n part the set of segments and doing the merge independently. Then only merge at the border of each parts. But as @MikeNakis says in this comment as merging is quite memory bound, they might not be well parallelize

How to optimize/parallelize the following clustering/joining algorithm:

2 Answers2