More Accurate Random in C

Question

I have 3 IPs and every IP has a weight, I want to return the IP's according to its weights using the random function. For example if we have 3 IP's

X with weight 3
Y with weight 3
and Z with weight 3

I want to return X in 33.3% of cases and Y in 33.3% of cases and Z in 33.3% of cases, depending on random function in C.

I have tried this code :

double r = rand() / (double)RAND_MAX;
double denom = 3 + 3 + 3;
if (r < 3 / denom) {
// choose X
} else if (r < (3 + 3) / denom) {
// choose Y 
} else {
// choose Z
}

I repeat the function 1000 and I get: Choose x 495 times and Choose y 189 times and choose z 316 times. But what I want is to get X:333 y:333 Z:334 How can the weighted random be more accurate?

random will always be random..., unless you go with a true round robin — ratchet freak, Sep 01 '13 at 15:23
Do you understand about variance in random distributions? What you've described is an example of a multinomial distribution, so the variance in each of your variables will be n * (0.33) * (0.66), where n is the number of trials. Your results will always jitter around the mean values of 0.33. — Charles E. Grant, Sep 01 '13 at 16:18
For what it's worth, I just ran a variant of that algorithm using FreeBSD's rand() routine, and while it didn't look particularly balanced by eye, it never approached the skew in the results that centosuser posted. — Aidan Cully, Sep 01 '13 at 17:25
why random and not round-robin (return the 3 IPs in repeated order)? — Philipp, Sep 01 '13 at 18:14
Why are you converting to double? What platform are you using (i.e. how big is `double`.) I ran your code and I get `x:317 y:354 z:329`, which is very reasonable. What happens when you add `srand(time(NULL))`? — Gort the Robot, Sep 01 '13 at 20:43
@StevenBurnap some versions of rand() have poor randomness. They're ok, but if you are expecting an even distribution of the low bits, you will be disappointed. See http://boallen.com/random-numbers.html — , Sep 01 '13 at 21:00
Yes, I'm wondering if the conversion to double is throwing things off. — Gort the Robot, Sep 01 '13 at 21:07
Random means having a ["lack of pattern or predictability in events"](http://en.wikipedia.org/wiki/Randomness). Your definition of random seems a little different. Given a trillion tries you'd expect a random distribution to be approximately 1/3 1/3 1/3, but only approximately. **If with only 1000 tries you got 333, 333 and 334 I would strongly suspect that the random function was broken** — Richard Tingle, Sep 02 '13 at 12:11
rand() % 3. Random values will never have a perfectly uniform distribution over a finite sample (like 1000). — Jeff-Inventor ChromeOS, Aug 14 '14 at 18:20

score 4 · Accepted Answer · edited Apr 13 '17 at 12:13

(This part of the question was associated with revision 2 and is still quite applicable to that problem)

The standard approach to doing this is to populate an array with the appropriate weights:

[x, x, x, x, x, x, y, y, y, y, z, z]
 1  2  3  4  5  6  1  2  3  4  1  2

and then pick one at random.

Please note that random is random. It is very unlikely that you will ever a perfect distribution.

If you want to guarantee the distribution, you need to do (mostly) away with the random.

Instead, shuffle the above array. For example, one shuffling would give you:

[x, y, z, y, x, x, y, x, z, y, x, x]

And then loop over the array again and again returning the current index. It isn't random anymore, but it will guarantee the distribution remains what you are looking for always. This is just a slightly more 'random' version of a round robin distribution.

If this is attempting to load balance something, the round robin is a better approach and you might want to consider a evenly distributed round robin:

[x, y, x, y, x, z]

This has the same distribution as above and tries to keep everything at an even distance so not as to saturate any one resource.

As an aside, to the question of rand being poor, you may be dealing with an older standard library. From rand3 man page

The versions of rand() and srand() in the Linux C Library use the
same random number generator as random(3) and srandom(3), so the
lower-order bits should be as random as the higher-order bits.
However, on older rand() implementations, and on current
implementations on different systems, the lower-order bits are much
less random than the higher-order bits.  Do not use this function in
applications intended to be portable when good randomness is needed.
(Use random(3) instead.)

This can be demonstrated using the following C program that spits out the consecutive low byte and low nybbles from rand(). I happen to have a more modern system and don't have access to anything that might demonstrate this low order less randomness.

You can test it on your system with the following code which looks at the low byte, low 4 bits, and low 2 bits.

#include <stdlib.h>
#include <stdio.h>

int main (int argc, char** argv) {
    int i, r;
    int a8, a4, a2;

    srand(0);
    FILE *fp0xff = fopen("low8.out", "w");
    FILE *fp0x0f = fopen("low4.out", "w");
    FILE *fp0x03 = fopen("low2.out", "w");
    for(i = 0; i < 10000; i++) {
        /* 1 of 4 */
        r = rand();
        a8 = r & 0xff;
        a4 = r & 0x0f;
        a2 = r & 0x03;

        fprintf(fp0xff, "%c",a8);

        /* 2 of 4 */
        r = rand();
        a8 = r & 0xff;
        a4 = (a4 << 4) | (r & 0x0f);
        a2 = (a2 << 2) | (r & 0x03);

        fprintf(fp0xff, "%c",a8);
        fprintf(fp0x0f, "%c",a4);

        /* 3 of 4 */
        r = rand();
        a8 = r & 0xff;
        a4 = r & 0x0f;
        a2 = (a2 << 2) | (r & 0x03);

        fprintf(fp0xff, "%c",a8);

        /* 4 of 4 */
        r = rand();
        a8 = r & 0xff;
        a4 = (a4 << 4) | (r & 0x0f);
        a2 = (a2 << 2) | (r & 0x03);

        fprintf(fp0xff, "%c",a8);
        fprintf(fp0x0f, "%c",a4);
        fprintf(fp0x03, "%c",a2);
    }
    fclose(fp0xff);
    fclose(fp0x0f);
    return 0;
}

The output of this program can then be examined using ent which runs a number of tests of randomness against a stream of bytes.

Running this against the low2.out on my system produces:

Entropy = 7.982762 bits per byte.

Optimum compression would reduce the size
of this 10000 byte file by 0 percent.

Chi square distribution for 10000 samples is 240.46, and randomly
would exceed this value 73.46 percent of the times.

Arithmetic mean value of data bytes is 127.6318 (127.5 = random).
Monte Carlo value for Pi is 3.150060024 (error 0.27 percent).
Serial correlation coefficient is -0.002842 (totally uncorrelated = 0.0).

Which is frankly, a reasonably good random stream.

The reason I am testing the low two bytes is that you are doing things with % 9 which works only with these low bits. You may find that your random number generator is old and might need to implement your own (if that degree of randomness is something that you need to work with).

One approach would be, well, to implement your own. The Mersenne twister is quite well regarded and you can find implementations for a wide range of languages quite easily.

The other thing to do would be to shift the random number you get down 8 bits to get rid of any low order less randomness and use those bits.

You can get an idea of the poor quality of rand in some generators from Random.org.

For, example, rand() called by php on Windows produces the following:

enter image description here

(from Random.org)

Which is quite certainly, not a good random.

There's a bit more to this and math that can cause some uneven distribution even with even distributions.

The code:

double r = rand() / (double)RAND_MAX;

rand() returns a number in the range [0,32767]. This range isn't evenly divisible by 9 (or 3). Thus, some values will be over sampled. Assume we are dealing with % 100. The values from [0,67] will be found more frequently than the values from [68,99]. This is introducing a bias to the random which is problematic.

Yep. The code isn't using % n instead uses a floating point. Now you have 2.000001 problems.

You've got code that is approximately:

int dst = (rand() * 1.0 / RAND_MAX) * 99

This can be described as hilariously non-uniform. There is only one input that can produce the maximum value because of truncation.

(32765 * 1.0 / 32767) * 99 = 98
(32766 * 1.0 / 32767) * 99 = 98
(32767 * 1.0 / 32767) * 99 = 99

No matter how you try to fix this with rand, you will have this problem you will run into the pigeonhole principle which states that if you try to put n items into m containers where n > m, at least one container will have more than one item. This seems obvious but its what you are running into with trying to map rand into another range.

If you really only want a range of [0,2], there are appropriate calls with that can give you the uniform distribution that you are after.

The following ode (from video linked below) uses the Mersenne twister to produce a uniform distribution:

#include <iostream>
#include <random>

int main() {
    std::random_device rd;
    std::mt19937 mt(rd());
    std::uniform_int_distribution<int> dist(0, 99);
    for (int i = 0; i < 16; ++i) {
        std::cout << dist(mt) << " ";
    }
    std::cout << std::endl;
}

Note that this is using a deterministic engine with a random seed. It is not cryptographically secure, but it does have high quality fast random data. The key is that you are using the distribution correctly. You could use random_device for all your data, but thats consuming entropy from your system that you probably don't need to spend there (running out of entropy on ServerFault). The twister is good enough for random load balancing.

A really good watch on this is the video rand() Considered Harmful which addresses lots of things about rand. I've borrowed much of this section from that video.

Particular points from the video:

Non-uniform distribution (3:56 .. 5:50 )
(src * 1.0 / RAND_MAX) * 99 is hilariously non-uniform (6:12 .. 7:29)
2.0000001 problems (10:40)
Hello, Random World (12:58 .. 15:30)

If you really want things to be perfectly balanced ((333, 333, 334) events for each of three choices 1000 times), this isn't random. Use a round robin instead.

round robin is good when the weights is equals ,,, but when the weights are different ,,there isn't any better solution than using the random ,,but the problem that it is not accurate — , Sep 01 '13 at 15:57
@centosuser, you haven't really understood MichaelT's solution. The example he gives doesn't have equal weights. He's using the weights listed in your original post before your edit. — Charles E. Grant, Sep 01 '13 at 16:26
@centosuser using any weights, you can shuffle the weighted array and loop over that. It doesn't matter what the weights are. It is also possible, that you've got a poor random number generator, and I went into that area with this latest edit. — , Sep 01 '13 at 16:42
@CharlesE.Grant Its also possible, that he's dealing with a poor random number generator. There are still quite a few of them out there. Combined with looking effectively at `%9` he's often looking at the low bits which are often less than ideal. There are some approaches to addressing this (like shifting the low bits off or using a different random function). — , Sep 01 '13 at 16:44

score 0 · Answer 2 · answered Sep 01 '13 at 15:25

I would usually do something like this:

int r = rand();
if (r % 3 == 0) {
// choose X
} else if (r % 3 == 1) {
// choose Y 
} else {
// choose Z
}

It's probably not perfect (I think RAND_MAX might give a Y instead of Z or something like that), but if the random number generator works properly, you'd expect to see a good distribution of X, Y and Z. The expected deviance from a uniform distribution would be less than the sample size you're using now.

score 0 · Answer 3 · answered Sep 01 '13 at 21:06

If you want to truly guarantee that you'll get exactly the distribution you want, you can create a pool of values that you pull from:

int xpool = 333;
int ypool = 333;
int zpool = 334;

int x = 0;
int y = 0;
int z = 0;

for(unsigned int trial=0;trial<1000;trial++) {

    unsigned int r = rand() % (xpool + ypool + zpool);

    if(r < xpool) {
        x++;
        xpool--;
    } 
    else if(r < xpool + ypool ) {
        y++;
        ypool--;
    }
    else {
        z++;
        zpool--;
    }
}
cout << "x: " << x << " y:" << y << " z:" << z << endl;

This will guarantee that you always have an exact distribution, but the order will be as random as rand() can make it. This works well if the number of choices is low, and you know how many trials you will have ahead of time.

It depends on what you want. If you want something to give a good weighted distribution with infinite trials, use @MichaelT's answer. If you want to guarantee an exact distribution and know the number of trials, the do something like this.

More Accurate Random in C

3 Answers3

Linked