58

I have, for example, this table

+-----------------+
| fruit  | weight |
+-----------------+
| apple  |   4    |
| orange |   2    |
| lemon  |   1    |
+-----------------+

I need to return a random fruit. But apple should be picked 4 times as frequent as Lemon and 2 times as frequent as orange.

In more general case it should be f(weight) times frequently.

What is a good general algorithm to implement this behavior?

Or maybe there are some ready gems on Ruby? :)

PS
I've implemented current algorithm in Ruby https://github.com/fl00r/pickup

Deduplicator
  • 8,591
  • 5
  • 31
  • 50
fl00r
  • 691
  • 1
  • 6
  • 8

5 Answers5

59

The conceptually simplest solution would be to create a list where each element occurs as many times as its weight, so

fruits = [apple, apple, apple, apple, orange, orange, lemon]

Then use whatever functions you have at your disposal to pick a random element from that list (e.g. generate a random index within the proper range). This is of course not very memory efficient and requires integer weights.


Another, slightly more complicated approach would look like this:

  1. Calculate the cumulative sums of weights:

    intervals = [4, 6, 7]
    

Where an index of below 4 represents an apple, 4 to below 6 an orange and 6 to below 7 a lemon.

  1. Generate a random number n in the range of 0 to sum(weights).
  2. Go through the list of cumulative sums from start to finish to find the first item whose cumulative sum is above n. The corresponding fruit is your result.

This approach requires more complicated code than the first, but less memory and computation and supports floating-point weights.

For either algorithm, the setup-step can be done once for an arbitrary number of random selections.

lennon310
  • 3,132
  • 6
  • 16
  • 33
Benjamin Kloster
  • 2,227
  • 18
  • 20
  • 2
    the interval solution seems nice – Jalayn May 29 '12 at 09:16
  • 1
    This was my first thought :). But what if I have got table with 100 fruits and weight could be around 10k? It will be very large array and this will be not as efficient as I want. This is about first solution. Second solution looks good – fl00r May 29 '12 at 09:25
  • 1
    I've implement this algorithm in Ruby https://github.com/fl00r/pickup – fl00r May 29 '12 at 14:26
  • 3
    alias method is **the defacto way to handle this** I'm honestly astonished at the number of posts that repeat the same code over and over again, all while *ignoring alias method*. [*for god sake you get constant time performance!*](https://en.wikipedia.org/wiki/Alias_method) – Krupip Nov 07 '18 at 14:21
  • 1
    I know this is been 8 years old already, at the time of writing, but shouldn't it be "the **first** item whose cumulative sum is above `n`" instead of "the last item whose cumulative sum is above n"? If it were the latter, the last item in the list will always be selected. – Sean Francis N. Ballais Dec 31 '20 at 07:45
32

Here's an algorithm (in C#) that can select random weighted element from any sequence, only iterating through it once:

public static T Random<T>(this IEnumerable<T> enumerable, Func<T, int> weightFunc)
{
    int totalWeight = 0; // this stores sum of weights of all elements before current
    T selected = default(T); // currently selected element
    foreach (var data in enumerable)
    {
        int weight = weightFunc(data); // weight of current element
        int r = Random.Next(totalWeight + weight); // random value
        if (r >= totalWeight) // probability of this is weight/(totalWeight+weight)
            selected = data; // it is the probability of discarding last selected element and selecting current one instead
        totalWeight += weight; // increase weight sum
    }

    return selected; // when iterations end, selected is some element of sequence. 
}

This is based on the following reasoning: let's select first element of our sequence as "current result"; then, on each iteration, either keep it or discard and choose new element as current. We can calculate the probability of any given element to be selected in the end as a product of all probabilities that it wouldn't be discarded in subsequent steps, times probability that it would be selected in the first place. If you do the math, you'd see that this product simplifies to (weight of element)/(sum of all weights), which is exactly what we need!

Since this method only iterates over the input sequence once, it works even with obscenely large sequences, provided that sum of weights fits into an int (or you can choose some bigger type for this counter)

Ghost4Man
  • 115
  • 1
  • 4
Nevermind
  • 699
  • 4
  • 8
  • 2
    I would benchmark this before assuming it's better just because it iterates once. Generating just as many random values isn't exactly fast either. – Jean-Bernard Pellerin Jun 20 '13 at 22:59
  • 1
    @Jean-Bernard Pellerin I did, and it is actually faster on large lists. Unless you use cryptographically strong random generator (-8 – Nevermind Jun 21 '13 at 07:28
  • Should be the accepted answer imo. I like this better than the "interval" and "repeated entry" approach. – Vivin Paliath Jul 21 '15 at 19:33
  • 2
    I just wanted to say I've come back to this thread 3 or 4 times in the last couple years to use this method. This method has repeatedly succeeded in providing the answers I need quickly enough for my purposes. I wish I could upvote this answer every time I came back to use it. – Jim Yarbro Feb 15 '16 at 08:23
  • 1
    Nice solution if you really only have to choose once. Otherwise, doing the prep-work for the solution in the first answer once is far more efficient. – Deduplicator Jul 20 '18 at 10:10
  • +1 Just a word of caution to anyone using this: weightFunc should return a positive number as the above method won't work with negative weights. – Patrick Klug Jul 29 '19 at 12:35
22

Already present answers are good and I'll expand on them a bit.

As Benjamin suggested cumulative sums are typically used in this kind of problem:

+------------------------+
| fruit  | weight | csum |
+------------------------+
| apple  |   4    |   4  |
| orange |   2    |   6  |
| lemon  |   1    |   7  |
+------------------------+

To find an item in this structure, you can use something like Nevermind's piece of code. This piece of C# code that I usually use:

double r = Random.Next() * totalSum;
for(int i = 0; i < fruit.Count; i++)
{
    if (csum[i] > r)
        return fruit[i];
}

Now to the interesting part. How efficient is this approach and what's most efficient solution? My piece of code requires O(n) memory and run in O(n) time. I don't think it can be done with less than O(n) space but time complexity can be much lower, O(log n) in fact. The trick is to use binary search instead of regular for loop.

double r = Random.Next() * totalSum;
int lowGuess = 0;
int highGuess = fruit.Count - 1;

while (highGuess >= lowGuess)
{
    int guess = (lowGuess + highGuess) / 2;
    if ( csum[guess] < r)
        lowGuess = guess + 1;
    else if ( csum[guess] - weight[guess] > r)
        highGuess = guess - 1;
    else
        return fruit[guess];
}

There is also a story about updating weights. In the worst case updating weight for one element causes the update of cumulative sums for all elements increasing the update complexity to O(n). That too can be cut down to O(log n) using binary indexed tree.

Emperor Orionii
  • 481
  • 2
  • 7
  • Good point about binary search – fl00r May 29 '12 at 14:28
  • Nevermind's answer doesn't need extra space, so it's O(1), but adds runtime complexity by repeatedly generating random numbers and evaluating the weight function (which, depending on the underlying problem, could be costly). – Benjamin Kloster May 29 '12 at 15:05
  • 1
    What you claim to be "more readable version" of my code is actually not. Your code needs to know total sum of weights, and cumulative sums, in advance; mine doesn't. – Nevermind May 29 '12 at 17:58
  • @Benjamin Kloster My code only calls weight function once per element - you can't do any better than that. You're right about random numbers, though. – Nevermind May 29 '12 at 17:59
  • @Nevermind: You only call it once per call to the pick-function, so if the user calls it twice, the weight function is called again for each element. Of course you could cache it, but then you're not O(1) for space complexity anymore. – Benjamin Kloster May 29 '12 at 19:29
  • @BenjaminKloster everyday pseudo-random number generators are O(1) and they aren't no more complex than math functions such as square root. Computation heavy RNGs are rarely used outside of cryptography. Even implementations of metaheuristic algorthms (and your problem resembles to roulette wheel selection used in genetic algorithms) use quick RNGs over slow but "more random" RNGs. – Emperor Orionii May 29 '12 at 19:43
  • @Nevermind readability is subjective, I'll edit out that part. Actually that whole sentence doesn't make sense :) – Emperor Orionii May 29 '12 at 19:57
  • @Benjamin Kloster The same would happen with any other code that doesn't cache weights. Except, I guess, a method that returns *several* random weighted elements in a single pass - which is perfectly possible. – Nevermind May 30 '12 at 04:28
  • @Nevermind: Just to be clear: I'm not arguing that your algorithm isn't very efficient. But Emperor stated in the answer above that it can't be done with less than O(n) space (although, on second reading, he might refer specifically to his algorithm, not the problem in general). Your algorithm can, but only if you don't cache the weights. All other algorithms given here cache the weights more or less implicitly because they have to build some data structure prior to finding the solution. – Benjamin Kloster May 30 '12 at 11:49
  • @Benjamin Kloster Yes, but any other algorithm still has to calculate weight before "caching", so it doesn't improve efficiency at all. Except for binary search, I guess, when averaged over many picks from the same list... in this case, you have a point (-8 – Nevermind May 31 '12 at 07:04
10

This is a simple Python implementation:

from random import random

def select(container, weights):
    total_weight = float(sum(weights))
    rel_weight = [w / total_weight for w in weights]

    # Probability for each element
    probs = [sum(rel_weight[:i + 1]) for i in range(len(rel_weight))]

    slot = random()
    for (i, element) in enumerate(container):
        if slot <= probs[i]:
            break

    return element

and

population = ['apple','orange','lemon']
weights = [4, 2, 1]

print select(population, weights)

In genetic algorithms this select procedure is called Fitness proportionate selection or Roulette Wheel Selection since:

  • a proportion of the wheel is assigned to each of the possible selections based on their weight value. This can be achieved by dividing the weight of a selection by the total weight of all the selections, thereby normalizing them to 1.
  • then a random selection is made similar to how the roulette wheel is rotated.

Roulette wheel selection

Typical algorithms have O(N) or O(log N) complexity but you can also do O(1) (e.g. Roulette-wheel selection via stochastic acceptance).

manlio
  • 4,166
  • 3
  • 23
  • 35
  • Do you know what the original source for this image is? I want to use it for a paper but need to make sure of attribution. – Malcolm MacLeod May 10 '17 at 09:57
  • @MalcolmMacLeod Sorry, it's used in a lot of GA papers / sites but I don't know who is the author. – manlio May 18 '17 at 15:24
  • Python's [random.choices](https://docs.python.org/3/library/random.html#random.choices) function actually supports weights, if somebody's looking for a Python solution now, and so does [numpy.random.choice](https://numpy.org/doc/stable/reference/random/generated/numpy.random.choice.html). – Sharat Apr 01 '21 at 06:56
0

This gist is doing exactly what you are asking for.

public static Random random = new Random(DateTime.Now.Millisecond);
public int chooseWithChance(params int[] args)
    {
        /*
         * This method takes number of chances and randomly chooses
         * one of them considering their chance to be choosen.    
         * e.g. 
         *   chooseWithChance(0,99) will most probably (%99) return 1
         *   chooseWithChance(99,1) will most probably (%99) return 0
         *   chooseWithChance(0,100) will always return 1.
         *   chooseWithChance(100,0) will always return 0.
         *   chooseWithChance(67,0) will always return 0.
         */
        int argCount = args.Length;
        int sumOfChances = 0;

        for (int i = 0; i < argCount; i++) {
            sumOfChances += args[i];
        }

        double randomDouble = random.NextDouble() * sumOfChances;

        while (sumOfChances > randomDouble)
        {
            sumOfChances -= args[argCount -1];
            argCount--;
        }

        return argCount-1;
    }

you can use it like that:

string[] fruits = new string[] { "apple", "orange", "lemon" };
int choosenOne = chooseWithChance(98,1,1);
Console.WriteLine(fruits[choosenOne]);

The above code will most probably (%98) return 0 which is index for 'apple' for the given array.

Also, this code tests the method provided above:

Console.WriteLine("Start...");
int flipCount = 100;
int headCount = 0;
int tailsCount = 0;

for (int i=0; i< flipCount; i++) {
    if (chooseWithChance(50,50) == 0)
        headCount++;
    else
        tailsCount++;
}

Console.WriteLine("Head count:"+ headCount);
Console.WriteLine("Tails count:"+ tailsCount);

It gives an output something like that:

Start...
Head count:52
Tails count:48
  • 2
    Programmers is [about](http://programmers.stackexchange.com/tour) conceptual questions and answers are expected to explain things. Throwing code dumps instead of explanation is like copying code from IDE to whiteboard: it may look familiar and even sometimes be understandable, but it feels weird... just weird. [Whiteboard doesn't have compiler](http://stackoverflow.com/q/5508110/839601) – gnat Jul 03 '15 at 07:45
  • You are right, I was focused on code so I forgot to tell how it works. I will add an explanation about how it works. – ramazan polat Jul 06 '15 at 06:23