83

This is a rather conceptual question, but I was hoping I could get some good advice on this. A lot of the programming I do is with (NumPy) arrays; I often have to match items in two or more arrays that are of different sizes and the first thing I go to is a for-loop or even worse, a nested for-loop. I want to avoid for-loops as much as possible, because they are slow (at least in Python).

I know that for a lot of things with NumPy there are pre-defined commands that I just need to research, but do you (as more experienced programmers) have a general thought process that comes to mind when you have to iterate something?

So I often have something like this, which is awful and I want to avoid it:

small_array = np.array(["one", "two"])
big_array = np.array(["one", "two", "three", "one"])

for i in range(len(small_array)):
    for p in range(len(big_array)):
        if small_array[i] == big_array[p]:
            print "This item is matched: ", small_array[i]

I know there are multiple different ways to achieve this in particular, but I am interested in a general method of thinking, if it exists.

gnat
  • 21,442
  • 29
  • 112
  • 288
turnip
  • 1,657
  • 2
  • 15
  • 21
  • 10
    You're looking for *functional programming*: lambda expressions, higher-order functions, generating expressions etc. Google those. – Kilian Foth Aug 26 '14 at 12:03
  • 42
    `I want to avoid for-loops as much as possible because they are slow (at least in Python).` Sounds like you're solving the wrong problem here. If you need to iterate over something, you need to iterate over something; you'll take a similar performance hit no matter which Python construct you use. If your code is slow it's not because you have `for` loops; it's because you're doing unnecessary work or doing work on the Python side that could be done on the C side. In your example you're doing extra work; you could've done it with one loop instead of two. – Doval Aug 26 '14 at 12:06
  • 25
    @Doval Unfortunately not -- *in NumPy*. A Python for loop performing elementwise addition can easily be several times(!) slower than the vectorized NumPy operator (which is not only written in C but uses SSE instructions and other tricks) for realistic array sizes. –  Aug 26 '14 at 12:10
  • @delnan Right; I addressed that with "doing work on the Python side that could be done on the C side". But he says for loops are "slow (at least in Python)" as if they were somehow slower than other Python code. – Doval Aug 26 '14 at 12:12
  • @Doval From the context I assumed OP was speaking absolutely (compared to a C loop, it *is* slow). –  Aug 26 '14 at 12:20
  • If your arrays are slow you may want to think about using other datastructures, like a binary tree. http://www.laurentluce.com/posts/binary-search-tree-library-in-python/ – Pieter B Aug 26 '14 at 12:43
  • Seems like these comments could be answers. – JeffO Aug 26 '14 at 12:55
  • 1
    I agree with the above. If your concerned about performance, then look at data structures for a quicker method of iterating over data. There are TONS of data structures to learn and implement. You might even find that your current thought process for solving problems is causing your application to be slow. There's a lot to consider here and without further information about your ideals and programming style we can't really do anything other than recommend different styles of problem solving (e.g. data structures). – Cameron McKay Aug 26 '14 at 13:14
  • 47
    Some of the comments above seem to misunderstand the question. When programming in NumPy, you get best results if you can [*vectorize* your computation](http://docs.scipy.org/doc/numpy/user/whatisnumpy.html) — that is, replace explicit loops in Python with whole-array operations in NumPy. This is conceptually very different from ordinary programming in Python, and takes time to learn. So I think it's reasonable for the OP to ask for advice on how to go about learning to do it. – Gareth Rees Aug 26 '14 at 14:29
  • The way I read that link on vectorizing data is that it mainly hides the looping and makes it more efficient but it still takes place. Correct me if I'm wrong but will numpy arrays transform O(n^2) operations into O(n log n) operations? If you don't handle complexity you're just delaying your bottlneck. – Pieter B Aug 26 '14 at 14:48
  • 3
    @PieterB: Yes, that's right. "Vectorization" is not the same as "choosing the best algorithm". They are two separate sources of difficulty in coming up with efficient implementations and so it's best to think about them one at a time. – Gareth Rees Aug 26 '14 at 14:53
  • 2
    It's worth noting that (unlike in MATLAB and other languages), `numpy`'s `vectorize` function does not actually provide any performance benefit. The vectorized operations *provided by `numpy`*, on the other hand, are implemented in C and are very quick. The thing in question here is "implemented in C" vs "implemented in Python", not "loops vs no loops". – Patrick Collins Aug 26 '14 at 19:14
  • 2
    I don't know anything about Python specifically, but conceptually, the very first thing that came to my mind here is set-based operations. That is one of my favorite things when spending time in database code is taking even the most complex loops, and converting them into single set based operations. It ALWAYS GREATLY increases the performance. Sounds like what @GarethRees is talking about in his comment. – eidylon Aug 26 '14 at 23:40
  • @eidylon You are right, even when the worst case is still the same O(n^2), there are a lot you can do for the best case and average case and actual performance will increase a lot. The two techniques in set-based operations make maximum use of available memory and performance comes to something like O(nlogn) and O(n) in many cases. Big gain here. – InformedA Aug 27 '14 at 03:39
  • As a non-pythonist, I have this idea, but I don't know whether numpy or python has support for these things. For each item in `small_array`, create a tuple of (constant 1, item index in `small_array`, item value in `small_array`). Likewise, for each item in `big_array`, create a tuple of (constant 2, item index in `big_array`, item value in `big_array`). Since the tuples are of same type, they can be concatenated. Group by the third column - the item values, in this case strings. Grouping is similar to sort but is not necessarily ordered (for example, sort by hash value). – rwong Aug 27 '14 at 07:27
  • (Continued.) After grouping, read out each group, and sort out by the actual string values. When a consecutive runs of identical string values are found, read out the first column (which says whether it is in `small_array` or `big_array`), and the second column (which gives the item's index in the corresponding array). Posting as comment because I don't know if such operations are possible in python or numpy. – rwong Aug 27 '14 at 07:30
  • One intermediate step you can take that might help you get there is to start working with [list-comprehensions](http://www.secnetix.de/olli/Python/list_comprehensions.hawk). For example your loop above can be turned into this: `[x for x in big_array if x in small_array]` assuming big_array and small_array are lists. – JimmyJames Dec 13 '16 at 15:54

3 Answers3

94

This is a common conceptual difficulty when learning to use NumPy effectively. Normally, data processing in Python is best expressed in terms of iterators, to keep memory usage low, to maximize opportunities for parallelism with the I/O system, and to provide for reuse and combination of parts of algorithms.

But NumPy turns all that inside out: the best approach is to express the algorithm as a sequence of whole-array operations, to minimize the amount of time spent in the slow Python interpreter and maximize the amount of time spent in fast compiled NumPy routines.

Here's the general approach I take:

  1. Keep the original version of the function (which you are confident is correct) so that you can test it against your improved versions both for correctness and speed.

  2. Work from the inside out: that is, start with the innermost loop and see if can be vectorized; then when you've done that, move out one level and continue.

  3. Spend lots of time reading the NumPy documentation. There are a lot of functions and operations in there and they are not always brilliantly named, so it's worth getting to know them. In particular, if you find yourself thinking, "if only there were a function that did such-and-such," then it's well worth spending ten minutes looking for it. It's usually in there somewhere.

There's no substitute for practice, so I'm going to give you some example problems. The goal for each problem is to rewrite the function so that it is fully vectorized: that is, so that it consists of a sequence of NumPy operations on whole arrays, with no native Python loops (no for or while statements, no iterators or comprehensions).

Problem 1

def sumproducts(x, y):
    """Return the sum of x[i] * y[j] for all pairs of indices i, j.

    >>> sumproducts(np.arange(3000), np.arange(3000))
    20236502250000

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            result += x[i] * y[j]
    return result

Problem 2

def countlower(x, y):
    """Return the number of pairs i, j such that x[i] < y[j].

    >>> countlower(np.arange(0, 200, 2), np.arange(40, 140))
    4500

    """
    result = 0
    for i in range(len(x)):
        for j in range(len(y)):
            if x[i] < y[j]:
                result += 1
    return result

Problem 3

def cleanup(x, missing=-1, value=0):
    """Return an array that's the same as x, except that where x ==
    missing, it has value instead.

    >>> cleanup(np.arange(-3, 3), value=10)
    ... # doctest: +NORMALIZE_WHITESPACE
    array([-3, -2, 10, 0, 1, 2])

    """
    result = []
    for i in range(len(x)):
        if x[i] == missing:
            result.append(value)
        else:
            result.append(x[i])
    return np.array(result)

Spoilers below. You'll get much the best results if you have a go yourself before looking at my solutions!

Answer 1

np.sum(x) * np.sum(y)

Answer 2

np.sum(np.searchsorted(np.sort(x), y))

Answer 3

np.where(x == missing, value, x)

Gareth Rees
  • 1,449
  • 10
  • 9
8

To make things faster you have to read up on your data structures and use the ones appropriate.

For non trivial sizes of small array and big array (let's say small=100 elements and big=10.000 elements) one way to do it is to sort the small array, then iterate over big-array and use a binary search to find matching elements in the small array.

This would make the maximum time complexity, O(N log N) (and for small small-arrays and very big big-arrays it's closer to O(N)) where your nested loop solution is O(N^2)

However. what data structures are most efficient is highly dependent on the actual problem.

Pieter B
  • 12,867
  • 1
  • 40
  • 65
-3

You can use a dictionary to optimize performance significantly

This is another example:

locations = {}
for i in range(len(airports)):
    locations[airports["abb"][i][1:-1]] = (airports["height"][i], airports["width"][i])

for i in range(len(uniqueData)):
    h, w = locations[uniqueData["dept_apt"][i]]
    uniqueData["dept_apt_height"][i] = h
    uniqueData["dept_apt_width"][i] = w