Archive for the ‘random numbers’ Category
In a recent blog post, Daniel Lemire explicitly demonstrated that vectorising random number generators using SIMD instructions could give useful speed-ups. This reminded me of the one of the first times I played with the Julia language where I learned that Julia’s random number generator used a SIMD-accelerated implementation of Mersenne Twister called dSFMT to generate random numbers much faster than MATLAB’s Mersenne Twister implementation.
Just recently, I learned that MATLAB now has its own SIMD accelerated version of Mersenne Twister which can be activated like so:
seed=1; rng(seed,'simdTwister')
This new Mersenne Twister implementation gives different random variates to the original implementation (which I demonstrated is the same as Numpy’s implementation in an older post) as you might expect
>> rng(1,'Twister') >> rand(1,5) ans = 0.4170 0.7203 0.0001 0.3023 0.1468 >> rng(1,'simdTwister') >> rand(1,5) ans = 0.1194 0.9124 0.5032 0.8713 0.5324
So it’s clearly a different algorithm and, on CPUs that support the relevant instructions, it’s about twice as fast! Using my very unscientific test code:
format compact
number_of_randoms = 10000000
disp('Timing standard twister')
rng(1,'Twister')
tic;rand(1,number_of_randoms);toc
disp('Timing SIMD twister')
rng(1,'simdTwister')
tic;rand(1,number_of_randoms);toc
gives the following results for a typical run on my Dell XPS 15 9560 which supports AVX instructions
number_of_randoms =
    10000000
Timing standard twister
Elapsed time is 0.101307 seconds.
Timing SIMD twister
Elapsed time is 0.057441 seconds
The MATLAB documentation does not tell us which algorithm their implementation is based on but it seems to be different from Julia’s. In Julia, if we set the seed to 1 as we did for MATLAB and ask for 5 random numbers, we get something different from MATLAB:
julia> using Random
julia> Random.seed!(1);
julia> rand(1,5)
1×5 Array{Float64,2}:
 0.236033  0.346517  0.312707  0.00790928  0.48861
The performance of MATLAB’s new generator is on-par with Julia’s although I’ll repeat that these timing tests are far far from rigorous.
julia> Random.seed!(1); julia> @time rand(1,10000000); 0.052981 seconds (6 allocations: 76.294 MiB, 29.40% gc time)
Something that became clear from my recent comparison of Numpy’s Mersenne Twister implementation with MATLAB’s is that there is something funky going on with seed 0 in MATLAB. A discussion in the comments thread helped uncover what was going on. In short, seed 0 gives exactly the same random numbers as seed 5489 in MATLAB (unless you use their deprecated rand(‘twister’,0) syntax).
This is a potential problem for anyone who performs lots of simulations that make use of random numbers such as monte-carlo simulations. One common work-flow is to run the same program hundreds of times where only the seed differs between runs. This is probably good enough to ensure that each simulation uses a random number stream that is statistically independent from all of the others — There is a risk that some streams will overlap but the probability is low and most people are content to live with that risk.
The practical upshot of this is that if you intend on sticking with Mersenne Twister for your MATLAB monte-carlo simulations, it might be wise to avoid seed 0. Alternatively, move to a random number generator that guarantees non-overlapping, independent streams – something that any implementation of Mersenne Twister cannot do.
Here’s a demo run in MATLAB 2014a on Windows 7.
>> format long >> rng(0) >> rand(1,5)' ans = 0.814723686393179 0.905791937075619 0.126986816293506 0.913375856139019 0.632359246225410 >> rng(5489) >> rand(1,5)' ans = 0.814723686393179 0.905791937075619 0.126986816293506 0.913375856139019 0.632359246225410
When porting code between MATLAB and Python, it is sometimes useful to produce the exact same set of random numbers for testing purposes. Both Python and MATLAB currently use the Mersenne Twister generator by default so one assumes this should be easy…and it is…provided you use the generator in Numpy and avoid the seed 0.
Generate some random numbers in MATLAB
Here, we generate the first 5 numbers for 3 different seeds in MATLAB. Our aim is to reproduce these in Python.
>> format long >> rng(0) >> rand(1,5)' ans = 0.814723686393179 0.905791937075619 0.126986816293506 0.913375856139019 0.632359246225410 >> rng(1) >> rand(1,5)' ans = 0.417022004702574 0.720324493442158 0.000114374817345 0.302332572631840 0.146755890817113 >> rng(2) >> rand(1,5)' ans = 0.435994902142004 0.025926231827891 0.549662477878709 0.435322392618277 0.420367802087489
Python’s default random module
According to the documentation,Python’s random module uses the Mersenne Twister algorithm but the implementation seems to be different from MATLAB’s since the results are different. Here’s the output from a fresh ipython session:
In [1]: import random In [2]: random.seed(0) In [3]: [random.random() for _ in range(5)] Out[3]: [0.8444218515250481, 0.7579544029403025, 0.420571580830845, 0.25891675029296335, 0.5112747213686085] In [4]: random.seed(1) In [5]: [random.random() for _ in range(5)] Out[5]: [0.13436424411240122, 0.8474337369372327, 0.763774618976614, 0.2550690257394217, 0.49543508709194095] In [6]: random.seed(2) In [7]: [random.random() for _ in range(5)] Out[7]: [0.9560342718892494, 0.9478274870593494, 0.05655136772680869, 0.08487199515892163, 0.8354988781294496]
The Numpy random module
Numpy’s random module, on the other hand, seems to use an identical implementation to MATLAB for seeds other than 0. In the below, notice that for seeds 1 and 2, the results are identical to MATLAB’s. For a seed of zero, they are different.
In [1]: import numpy as np
In [2]: np.set_printoptions(suppress=True)
In [3]: np.set_printoptions(precision=15)
In [4]: np.random.seed(0)
In [5]: np.random.random((5,1))
Out[5]: 
array([[ 0.548813503927325],
       [ 0.715189366372419],
       [ 0.602763376071644],
       [ 0.544883182996897],
       [ 0.423654799338905]])
In [6]: np.random.seed(1)
In [7]: np.random.random((5,1))
Out[7]: 
array([[ 0.417022004702574],
       [ 0.720324493442158],
       [ 0.000114374817345],
       [ 0.30233257263184 ],
       [ 0.146755890817113]])
In [8]: np.random.seed(2)
In [9]: np.random.random((5,1))
Out[9]: 
array([[ 0.435994902142004],
       [ 0.025926231827891],
       [ 0.549662477878709],
       [ 0.435322392618277],
       [ 0.420367802087489]])
Checking a lot more seeds
Although the above interactive experiments look convincing, I wanted to check a few more seeds. All seeds from 0 to 1 million would be a good start so I wrote a MATLAB script that generated 10 random numbers for each seed from 0 to 1 million and saved the results as a .mat file.
A subsequent Python script loads the .mat file and ensures that numpy generates the same set of numbers for each seed. It outputs every seed for which Python and MATLAB differ.
On my mac, I opened a bash prompt and ran the two scripts as follows
matlab -nodisplay -nodesktop -r "generate_matlab_randoms" python python_randoms.py
The output was
MATLAB file contains 1000001 seeds and 10 samples per seed Random numbers for seed 0 differ between MATLAB and Numpy
System details
- Late 2013 Macbook Air
- MATLAB 2014a
- Python 2.7.7
- Numpy 1.8.1
I occasionally get emails from researchers saying something like this
‘My MATLAB code takes a week to run and the cleaner/cat/my husband keeps switching off my machine before it’s completed — could you help me make the code go faster please so that I can get my results in between these events’
While I am more than happy to try to optimise the code in question, what these users really need is some sort of checkpointing scheme. Checkpointing is also important for users of high performance computing systems that limit the length of each individual job.
The solution – Checkpointing (or ‘Assume that your job will frequently be killed’)
The basic idea behind checkpointing is to periodically save your program’s state so that, if it is interrupted, it can start again where it left off rather than from the beginning. In order to demonstrate some of the principals involved, I’m going to need some code that’s sufficiently simple that it doesn’t cloud what I want to discuss. Let’s add up some numbers using a for-loop.
%addup.m
%This is not the recommended way to sum integers in MATLAB -- we only use it here to keep things simple
%This version does NOT use checkpointing
mysum=0;
for count=1:100
    mysum = mysum + count;
    pause(1);           %Let's pretend that this is a complicated calculation
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
Using a for-loop to perform an addition like this is something that I’d never usually suggest in MATLAB but I’m using it here because it is so simple that it won’t get in the way of understanding the checkpointing code.
If you run this program in MATLAB, it will take about 100 seconds thanks to that pause statement which is acting as a proxy for some real work. Try interrupting it by pressing CTRL-C and then restart it. As you might expect, it will always start from the beginning:
>> addup Completed iteration 1 Completed iteration 2 Completed iteration 3 Operation terminated by user during addup (line 6) >> addup Completed iteration 1 Completed iteration 2 Completed iteration 3 Operation terminated by user during addup (line 6)
This is no big deal when your calculation only takes 100 seconds but is going to be a major problem when the calculation represented by that pause statement becomes something like an hour rather than a second.
Let’s now look at a version of the above that makes use of checkpointing.
%addup_checkpoint.m
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
end
for count = countmin:100
    mysum = mysum + count;
    pause(1);           %Let's pretend that this is a complicated calculation
    %save checkpoint
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    fprintf('Saving checkpoint\n');
    save('checkpoint.mat');
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
Before you run the above code, the checkpoint file checkpoint.mat does not exist and so the calculation starts from the beginning. After every iteration, a checkpoint file is created which contains every variable in the MATLAB workspace. If the program is restarted, it will find the checkpoint file and continue where it left off. Our code now deals with interruptions a lot more gracefully.
>> addup_checkpoint No checkpoint file found - starting from beginning Saving checkpoint Completed iteration 1 Saving checkpoint Completed iteration 2 Saving checkpoint Completed iteration 3 Operation terminated by user during addup_checkpoint (line 16) >> addup_checkpoint Checkpoint file found - Loading Saving checkpoint Completed iteration 4 Saving checkpoint Completed iteration 5 Saving checkpoint Completed iteration 6 Operation terminated by user during addup_checkpoint (line 16)
Note that we’ve had to change the program logic slightly. Our original loop counter was
for count = 1:100
In the check-pointed example, however, we’ve had to introduce the variable countmin
for count = countmin:100
This allows us to start the loop from whatever value of countmin was in our last checkpoint file. Such minor modifications are often necessary when converting code to use checkpointing and you should carefully check that the introduction of checkpointing does not introduce bugs in your code.
Don’t checkpoint too often
The creation of even a small checkpoint file is a time consuming process. Consider our original addup code but without the pause command.
%addup_nopause.m
%This version does NOT use checkpointing
mysum=0;
for count=1:100
    mysum = mysum + count;
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
On my machine, this code takes 0.0046 seconds to execute. Compare this to the checkpointed version, again with the pause statement removed.
%addup_checkpoint_nopause.m
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
end
for count = countmin:100
    mysum = mysum + count;
    %save checkpoint
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    fprintf('Saving checkpoint\n');
    save('checkpoint.mat');
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
This checkpointed version takes 0.85 seconds to execute on the same machine — Over 180 times slower than the original! The problem is that the time it takes to checkpoint is long compared to the calculation time.
If we make a modification so that we only checkpoint every 25 iterations, code execution time comes down to 0.05 seconds:
%Checkpoint every 25 iterations
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
end
for count = countmin:100
    mysum = mysum + count;
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    if mod(count,25)==0
        %save checkpoint   
        fprintf('Saving checkpoint\n');
        save('checkpoint.mat');
    end
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
Of course, the issue now is that we might lose more work if our program is interrupted between checkpoints. Additionally, in this particular case, the mod command used to decide whether or not to checkpoint is more expensive than simply performing the calculation but hopefully that isn’t going to be the case when working with real world calculations.
In practice, we have to work out a balance such that we checkpoint often enough so that we don’t stand to lose too much work but not so often that our program runs too slowly.
Checkpointing code that involves random numbers
Extra care needs to be taken when running code that involves random numbers. Consider a modification of our checkpointed adding program that creates a sum of random numbers.
%addup_checkpoint_rand.m
%Adding random numbers the slow way, in order to demo checkpointing
%This version has a bug
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
    rng(0);     %Seed the random number generator for reproducible results
end
for count = countmin:100
    mysum = mysum + rand();
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    pause(1); %pretend this is a complicated calculation
    %save checkpoint
    fprintf('Saving checkpoint\n');
    save('checkpoint.mat');
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
In the above, we set the seed of the random number generator to 0 at the beginning of the calculation. This ensures that we always get the same set of random numbers and allows us to get reproducible results. As such, the sum should always come out to be 52.799447 to the number of decimal places used in the program.
The above code has a subtle bug that you won’t find if your testing is confined to interrupting using CTRL-C and then restarting in an interactive session of MATLAB. Proceed that way, and you’ll get exactly the sum you’ll expect : 52.799447. If, on the other hand, you test your code by doing the following
- Run for a few iterations
- Interrupt with CTRL-C
- Restart MATLAB
- Run the code again, ensuring that it starts from the checkpoint
You’ll get a different result. This is not what we want!
The root cause of this problem is that we are not saving the state of the random number generator in our checkpoint file. Thus, when we restart MATLAB, all information concerning this state is lost. If we don’t restart MATLAB between interruptions, the state of the random number generator is safely tucked away behind the scenes.
Assume, for example, that you stop the calculation running after the third iteration. The random numbers you’d have consumed would be (to 4 d.p.)
0.8147
0.9058
0.1270
Your checkpoint file will contain the variables mysum, count and countmin but will contain nothing about the state of the random number generator. In English, this state is something like ‘The next random number will be the 4th one in the sequence defined by a starting seed of 0.’
When we restart MATLAB, the default seed is 0 so we’ll be using the right sequence (since we explicitly set it to be 0 in our code) but we’ll be starting right from the beginning again. That is, the 4th,5th and 6th iterations of the summation will contain the first 3 numbers in the stream, thus double counting them, and so our checkpointing procedure will alter the results of the calculation.
In order to fix this, we need to additionally save the state of the random number generator when we save a checkpoint and also make correct use of this on restarting. Here’s the code
%addup_checkpoint_rand_correct.m
%Adding random numbers the slow way, in order to demo checkpointing
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
    %use the saved RNG state
    stream = RandStream.getGlobalStream;
    stream.State = savedState;
else % otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
    rng(0);     %Seed the random number generator for reproducible results
end
for count = countmin:100
    mysum = mysum + rand();
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    pause(1); %pretend this is a complicated calculation
    %save the state of the random number genertor
    stream = RandStream.getGlobalStream;
    savedState = stream.State;
    %save checkpoint
    fprintf('Saving checkpoint\n');
    save('checkpoint.mat');
    fprintf('Completed iteration %d \n',count);
end
fprintf('The sum is %f \n',mysum);
Ensuring that the checkpoint save completes
Events that terminate our code can occur extremely quickly — a powercut for example. There is a risk that the machine was switched off while our check-point file was being written. How can we ensure that the file is complete?
The solution, which I found on the MATLAB checkpointing page of the Liverpool University Condor Pool site is to first write a temporary file and then rename it. That is, instead of
save('checkpoint.mat')/pre>
we do
if strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')
            %We are running on a windows machine
            system( 'move /y checkpoint_tmp.mat checkpoint.mat' );
else
            %We are running on Linux or Mac
            system( 'mv checkpoint_tmp.mat checkpoint.mat' );
end
As the author of that page explains ‘The operating system should guarantee that the move command is “atomic” (in the sense that it is indivisible i.e. it succeeds completely or not at all) so that there is no danger of receiving a corrupt “half-written” checkpoint file from the job.’
Only checkpoint what is necessary
So far, we’ve been saving the entire MATLAB workspace in our checkpoint files and this hasn’t been a problem since our workspace hasn’t contained much. In general, however, the workspace might contain all manner of intermediate variables that we simply don’t need in order to restart where we left off. Saving the stuff that we might not need can be expensive.
For the sake of illustration, let’s skip 100 million random numbers before adding one to our sum. For reasons only known to ourselves, we store these numbers in an intermediate variable which we never do anything with. This array isn’t particularly large at 763 Megabytes but its existence slows down our checkpointing somewhat. The correct result of this variation of the calculation is 41.251376 if we set the starting seed to 0; something we can use to test our new checkpoint strategy.
Here’s the code
% A demo of how slow checkpointing can be if you include large intermediate variables
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
    %use the saved RNG state
    stream = RandStream.getGlobalStream;
    stream.State = savedState;
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
    rng(0);     %Seed the random number generator for reproducible results
end
for count = countmin:100
    %Create and store 100 million random numbers for no particular reason
    randoms = rand(10000);
    mysum = mysum + rand();
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    fprintf('Completed iteration %d \n',count);
    
    if mod(count,25)==0
        %save the state of the random number generator
        stream = RandStream.getGlobalStream;
        savedState = stream.State;
        %save and time checkpoint
        tic
        save('checkpoint_tmp.mat');
        if strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')
            %We are running on a windows machine
            system( 'move /y checkpoint_tmp.mat checkpoint.mat' );
        else
            %We are running on Linux or Mac
            system( 'mv checkpoint_tmp.mat checkpoint.mat' );
        end
        timing = toc;
        fprintf('Checkpoint save took %f seconds\n',timing);
    end
    
end
fprintf('The sum is %f \n',mysum);
On my Windows 7 Desktop, each checkpoint save takes around 17 seconds:
Completed iteration 25 
        1 file(s) moved. 
Checkpoint save took 17.269897 seconds
It is not necessary to include that huge random matrix in a checkpoint file. If we are explicit in what we require, we can reduce the time taken to checkpoint significantly. Here, we change
save('checkpoint_tmp.mat');
to
save('checkpoint_tmp.mat','mysum','countmin','savedState');
This has a dramatic effect on check-pointing time:
Completed iteration 25 
        1 file(s) moved. 
Checkpoint save took 0.033576 seconds
Here’s the final piece of code that uses everything discussed in this article
%Final checkpointing demo
if exist( 'checkpoint.mat','file' ) % If a checkpoint file exists, load it
    fprintf('Checkpoint file found - Loading\n');
    load('checkpoint.mat')
    %use the saved RNG state
    stream = RandStream.getGlobalStream;
    stream.State = savedState;
else %otherwise, start from the beginning
    fprintf('No checkpoint file found - starting from beginning\n');
    mysum=0;
    countmin=1;
    rng(0);     %Seed the random number generator for reproducible results
end
for count = countmin:100
    %Create and store 100 million random numbers for no particular reason
    randoms = rand(10000);
    mysum = mysum + rand();
    countmin = count+1;  %If we load this checkpoint, we want to start on the next iteration
    fprintf('Completed iteration %d \n',count);
    
    if mod(count,25)==0 %checkpoint every 25th iteration
        %save the state of the random number generator
        stream = RandStream.getGlobalStream;
        savedState = stream.State;
        %save and time checkpoint
        tic
        %only save the variables that are strictly necessary
        save('checkpoint_tmp.mat','mysum','countmin','savedState');
        %Ensure that the save completed
        if strcmp(computer,'PCWIN64') || strcmp(computer,'PCWIN')
            %We are running on a windows machine
            system( 'move /y checkpoint_tmp.mat checkpoint.mat' );
        else
            %We are running on Linux or Mac
            system( 'mv checkpoint_tmp.mat checkpoint.mat' );
        end
        timing = toc;
        fprintf('Checkpoint save took %f seconds\n',timing);
    end
    
end
fprintf('The sum is %f \n',mysum);
Parallel checkpointing
If your code includes parallel regions using constructs such as parfor or spmd, you might have to do more work to checkpoint correctly. I haven’t considered any of the potential issues that may arise in such code in this article
Checkpointing checklist
Here’s a reminder of everything you need to consider
- Test to ensure that the introduction of checkpointing doesn’t alter results
- Don’t checkpoint too often
- Take care when checkpointing code that involves random numbers – you need to explicitly save the state of the random number generator.
- Take measures to ensure that the checkpoint save is completed
- Only checkpoint what is necessary
- Code that includes parallel regions might require extra care
Ever since I took a look at GPU accelerating simple Monte Carlo Simulations using MATLAB, I’ve been disappointed with the performance of its GPU random number generator. In MATLAB 2012a, for example, it’s not much faster than the CPU implementation on my GPU hardware. Consider the following code
function gpuRandTest2012a(n)
mydev=gpuDevice();
disp('CPU - Mersenne Twister');
tic
CPU = rand(n);
toc
sg = parallel.gpu.RandStream('mrg32k3a','Seed',1);
parallel.gpu.RandStream.setGlobalStream(sg);
disp('GPU - mrg32k3a');
tic
Rg = parallel.gpu.GPUArray.rand(n);
wait(mydev);
toc
Running this on MATLAB 2012a on my laptop gives me the following typical times (If you try this out yourself, the first run will always be slower for various reasons I’ll not go into here)
>> gpuRandTest2012a(10000) CPU - Mersenne Twister Elapsed time is 1.330505 seconds. GPU - mrg32k3a Elapsed time is 1.006842 seconds.
Running the same code on MATLAB 2012b, however, gives a very pleasant surprise with typical run times looking like this
CPU - Mersenne Twister Elapsed time is 1.590764 seconds. GPU - mrg32k3a Elapsed time is 0.185686 seconds.
So, generation of random numbers using the GPU is now over 7 times faster than CPU generation on my laptop hardware–a significant improvment on the previous implementation.
New generators in 2012b
The MATLAB developers went a little further in 2012b though. Not only have they significantly improved performance of the mrg32k3a combined multiple recursive generator, they have also implemented two new GPU random number generators based on the Random123 library. Here are the timings for the generation of 100 million random numbers in MATLAB 2012b
- Get the code – gpuRandTest2012b.m
CPU - Mersenne Twister Elapsed time is 1.370252 seconds. GPU - mrg32k3a Elapsed time is 0.186152 seconds. GPU - Threefry4x64-20 Elapsed time is 0.145144 seconds. GPU - Philox4x32-10 Elapsed time is 0.129030 seconds.
Bear in mind that I am running this on the relatively weak GPU of my laptop! If anyone runs it on something stronger, I’d love to hear of your results.
- Laptop model: Dell XPS L702X
- CPU: Intel Core i7-2630QM @2Ghz software overclockable to 2.9Ghz. 4 physical cores but total 8 virtual cores due to Hyperthreading.
- GPU: GeForce GT 555M with 144 CUDA Cores. Graphics clock: 590Mhz. Processor Clock:1180 Mhz. 3072 Mb DDR3 Memeory
- RAM: 8 Gb
- OS: Windows 7 Home Premium 64 bit.
- MATLAB: 2012a/2012b
In a recent article, Matt Asher considered the feasibility of doing statistical computations in JavaScript. In particular, he showed that the generation of 10 million normal variates can be as fast in Javascript as it is in R provided you use Google’s Chrome for the web browser. From this, one might infer that using javascript to do your Monte Carlo simulations could be a good idea.
It is worth bearing in mind, however, that we are not comparing like for like here.
The default random number generator for R uses the Mersenne Twister algorithm which is of very high quality, has a huge period and is well suited for Monte Carlo simulations. It is also the default algorithm for modern versions of MATLAB and is available in many other high quality mathematical products such as Mathematica, The NAG library, Julia and Numpy.
The algorithm used for Javascript’s math.random() function depends upon your web-browser. A little googling uncovered a document that gives details on some implementations. According to this document, Internet Explorer and Firefox both use 48 bit Linear Congruential Generator (LCG)-style generators but use different methods to set the seed. Safari on Mac OS X uses a 31 bit LCG generator and Version 8 of Chrome on Windows uses 2 calls to rand() in msvcrt.dll. So, for V8 Chrome on Windows, Math.random() is a floating point number consisting of the second rand() value, concatenated with the first rand() value, divided by 2^30.
The points I want to make here are:-
- Javascript’s math.random() uses different algorithms between browsers.
- These algorithms have relatively small periods. For example, a 48-bit LCG has a period of 2^48 compared to 2^19937-1 for Mersenne Twister.
- They have poor statistical properties. For example, the 48bit LCG implemented in Java’s java.util.Random function fails 21 of the BigCrush tests. I haven’t found any test results for JavaScript implementations but expect them to be at least as bad. I understand that Mersenne Twister fails 2 of the BigCrush tests but these are not considered to be an issue by many people.
- You can’t manually set the seed for math.random() so reproducibility is impossible.
Pop quiz: What does the following line of MATLAB code do?
rand('state',10)
If you said ‘It changes the seed of the random number generator to 10’ you get half a point.
‘Only half a point!?’ I hear you say accusingly ‘but it says so in my book [for example, 1-3], why not a full point?’
You only get a full point if you’d said something like ‘It changes the seed of the random number generator to 10 and it also changes the random number generator from the high quality, default Mersenne Twister generator to a lower quality legacy random number generator.‘
OK, how about this one?
rand('seed',10)
This behaves in a very similar manner– it changes both the seed and the type of the underlying generator. However, the random number generator it switches to this time is an even older one that was introduced as far back as MATLAB version 4. It is not very good at all by modern standards!
A closer look
Open up a fresh copy of a recent version of MATLAB and ask it about the random number generator it’s using
>> RandStream.getGlobalStream
ans =
mt19937ar random stream (current global stream)
             Seed: 0
  NormalTransform: Ziggurat
mt1993ar refers to a particular variant of the Mersenne Twister algorithm— an industry strength random number generator that’s used in many software packages and simulations. It’s been the default generator in MATLAB since 2007a. Change the seed using the modern (since 2011a), recommended syntax and ask again:
>> rng(10)
>> RandStream.getGlobalStream
ans =
mt19937ar random stream (current global stream)
             Seed: 10
  NormalTransform: Ziggurat
This is behaving exactly as you’d expect, you ask it to change the seed and it changes the seed…nothing more, nothing less. Now, let’s use the older syntax
>> rand('state',10)
>> RandStream.getGlobalStream
ans =
legacy random stream (current global stream)
  RAND algorithm: V5 (Subtract-with-Borrow), RANDN algorithm: V5 (Ziggurat)
The random number generator has completely changed! We are no longer using the Mersenne Twister algorithm, we are now using a ‘subtract with borrow’ [see reference 4 for implementation details] generator which has been shown to have several undesirable issues [5-7].
Let’s do it again but this time using the even older ‘seed’ version:
>> rand('seed',10)
>> RandStream.getGlobalStream
ans =
legacy random stream (current global stream)
  RAND algorithm: V4 (Congruential), RANDN algorithm: V5 (Ziggurat)
Now, this random number generator is ancient by computing standards. It also has a relatively tiny period of only 2 billion or so. For details see [4]
Why this matters
Now, all of this is well documented so you may wonder why I am making such a big issue out of it. Here are my reasons
- I often get sent MATLAB code for the purposes of code-review and optimisation. I see the old seeding syntax a LOT and the program’s authors are often blissfully unaware of the consequnces.
- The old syntax looks like all it should do is change the seed. It doesn’t! Before 2007a, however, it did!
- The old syntax is written in dozens of books because it was once the default, correct syntax to use.
- Many users don’t read the relevent section of the MATLAB documentation because they have no idea that there is a potential issue. They read a book or tutorial..it says to use rand(‘state’,10) so they do.
- MATLAB doesn’t use the old generators by default any more because they are not very good [4-7]!
- Using these old generators may adversely affect the quality of your simulation.
The bottom line
Don’t do either of these to change the seed of the default generator to 10:
rand('state',10)
rand('seed',10)
Do this instead:
rng(10)
Only if you completely understand and accept the consequences of the older syntax should you use it.
References
1. ‘MATLAB – A practical introduction to programming and problem solving’, 2009,Stormy Attaway
2. MATLAB Guide (Second Edition), 2005, Desmond Higham and Nicholas Higham
3. Essential MATLAB for Engineers and Scientists (Fourth Edition), 2009, Hahn and Valentine
4. Numerical Computing with MATLAB, 2004, Cleve Moler (available online)
5. Why does the random number generator in MATLAB fail a particular test of randomness? The Mathworks, retreived 26th September 2012
6. A strong nonrandom pattern in Matlab default random number generator, 2006, Petr Savicky, retreived 26th September 2012
7. Learning Random Numbers: A Matlab Anomaly, 2008, Petr Savicky and Marko Robnik-Šikonja, Applied Artificial Intelligence, Vol22 issue 3, pp 254-265
Other posts on random numbers in MATLAB
- Parallel Random Numbers in MATLAB #1 – An introduction to random numbers and seeding in MATLAB
- Probability of overlapping subsequences (How likely is it to get overlapping random sequences using two different seeds for the Mersenne Twister algorithm)
A bit of background to this post
I work in the IT department of the University of Manchester and we are currently developing a Condor Pool which is basically a method of linking together hundreds of desktop machines to produce a high-throughput computing resource. A MATLAB user recently submitted some jobs to our pool and complained that all of them gave identical results which is stupid because his code used MATLAB’s rand command to mix things up a bit.
I was asked if I knew why this should happen to which I replied ‘yes.’ I was then asked to advise the user how to fix the problem and I did so. The next request was for me to write some recommendations and tutorials on how users should use random numbers in MATLAB (and Mathematica and possibly Python while I was at it) along with our Condor Pool and I went uncharacteristically quiet for a while.
It turned out that I had a lot to learn about random numbers. This is the first of a series of (probably 2) posts that will start off by telling you what I knew and move on to what I have learned. It’s as much a vehicle for getting the concepts straight in my head as it is a tutorial.
Ask MATLAB for 10 Random Numbers
Before we go on, I’d like you to try something for me. You have to start on a system that doesn’t have MATLAB running at all so if MATLAB is running then close it before proceeding. Now, open up MATLAB and before you do anything else issue the following command
rand(10,1)
As many of you will know, the rand command produces random numbers from the uniform distribution between 0 and 1 and the command above is asking for 10 such numbers. You may reasonably expect that the 10 random numbers that you get will be different from the 10 random numbers that I get; after all, they are random right? Well, I got the following numbers when running the above command on MATLAB 2009b running on Linux.
ans = 0.8147 0.9058 0.1270 0.9134 0.6324 0.0975 0.2785 0.5469 0.9575 0.9649
Look familiar?
Now I’ve done this experiment with a number of people over the last few weeks and the responses can be roughly split into two different camps as follows:
1. Oh yeah, I know all about that – nothing to worry about. It’s pretty obvious why it happens isn’t it?
2. It’s a bug. How can the numbers be random if MATLAB always returns the same set?
What does random mean anyway?
If you are new to the computer generation of random numbers then there is something that you need to understand and that is that, strictly speaking, these numbers (like all software generated ‘random’ numbers) are not ‘truly’ random. Instead they are pseudorandom – my personal working definition of which is “A sequence of numbers generated by some deterministic algorithm in such a way that they have the same statistical properties of ‘true’ random numbers”. In other words, they are not random they just appear to be but the appearance is good enough most of the time.
Pseudorandom numbers are generated from deterministic algorithms with names like Mersenne Twister, L’Ecuyer’s mrg32k3a [1] and Blum Blum Schub whereas ‘true’ random numbers come from physical processes such as radioactive decay or atmospheric noise (the website www.random.org uses atmospheric noise for example).
For many applications, the distinction between ‘truly random’ and ‘pseudorandom’ doesn’t really matter since pseudorandom numbers are ‘random enough’ for most purposes. What does ‘random enough’ mean you might ask? Well as far as I am concerned it means that the random number generator in question has passed a set of well defined tests for randomness – something like Marsaglia’s Diehard tests or, better still, L’Ecuyer and Simard’s TestU01 suite will do nicely for example.
The generation of random numbers is a complicated topic and I don’t know enough about it to do it real justice but I know a man who does. So, if you want to know more about the theory behind random numbers then I suggest that you read Pierre L’Ecuyer’s paper simply called ‘Random Numbers’ (pdf file).
Back to MATLAB…
Always the same seed
So, which of my two groups are correct? Is there a bug in MATLAB’s random number generator or not?
There is nothing wrong with MATLAB’s random number generator at all. The reason why the command rand(10,1) will always return the same 10 numbers if executed on startup is because MATLAB always uses the same seed for its pseudorandom number generator (which at the time of writing is a Mersenne Twister) unless you tell it to do otherwise.
Without going into details, a seed is (usually) an integer that determines the internal state of a random number generator. So, if you initialize a random number generator with the same seed then you’ll always get the same sequence of numbers and that’s what we are seeing in the example above.
Sometimes, this behaviour isn’t what we want. For example, say I am doing a Monte Carlo simulation and I want to run it several times to verify my results. I’m going to want a different sequence of random numbers each time or the whole exercise is going to be pointless.
One way to do this is to initialize the random number generator with a different seed at startup and a common way of achieving this is via the system clock. The following comes straight out of the current MATLAB documentation for example
RandStream.setDefaultStream(RandStream('mt19937ar','seed',sum(100*clock)));
If you are using MATLAB 2011a or above then you can use the following, much simpler syntax to do the same thing
rng shuffle
Do this once per MATLAB session and you should be good to go (there is usually no point in doing it more than once per session by the way….your numbers won’t be any ‘more random’ if you so. In fact, there is a chance that they will become less so!).
Condor and ‘random’ random seeds
Sometimes the system clock approach isn’t good enough either. For example, at my workplace, Manchester University, we have a Condor Pool of hundreds of desktop machines which is perfect for people doing Monte Carlo simulations. Say a single simulation takes 5 hours and it needs to be run 100 times in order to get good results. On one machine that’s going to take about 3 weeks but on our Condor Pool it can take just 5 hours since all 100 simulations run at the same time but on different machines.
If you don’t think about random seeding at all then you end up with 100 identical sets of results using MATLAB on Condor for the reasons I’ve explained above. Of course you know all about this so you switch to using the clock seeding method, try again and….get 100 identical sets of results[2].
The reason for this is that the time on all 100 machines is synchronized using internet time servers. So, when you start up 100 simultaneous jobs they’ll all have the same timestamp and, therefore, have the same random seed.
It seems that what we need to do is to guarantee (as far as possible) that every single one of our condor jobs gets a unique seed in order to provide a unique random number stream and one way to do this would be to incorporate the condor process ID into the seed generation in some way and there are many ways one could do this. Here, however, I’m going to take a different route.
On Linux machines it is possible to obtain small numbers of random numbers using the special files /dev/random and /dev/urandom which are interfaces to the kernel’s random number generator. According to the documentation ‘The random number generator gathers environmental noise from device drivers and other sources into an entropy pool. The generator also keeps an estimate of the number of bit of the noise in the entropy pool. From this entropy pool random numbers are created.’
This kernel generator isn’t suitable for simulation purposes but it will do just fine for generating an initial seed for MATLAB’s pseudorandom number generator. Here’re the MATLAB commands
[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}''');
seed=str2double(seed);
RandStream.setDefaultStream(RandStream('mt19937ar','seed',seed));
In MATLAB 2011a or above you can change this to
[status seed] = system('od /dev/urandom --read-bytes=4 -tu | awk ''{print $2}''');
seed=str2double(seed);
rng(seed);
Put this at the beginning of the MATLAB script that defines your condor job and you should be good to go. Don’t do it more than once per MATLAB session though – you won’t gain anything!
The end or the end of the beginning?
If you asked me the question ‘How do I generate a random seed for a pseudorandom number generator?’ then I think that the above solution answers it quite well. If, however, you asked me ‘What is the best way to generate multiple independent random number streams that would be good for thousands of monte-carlo simulations?‘ then we need to rethink somewhat for the following reasons.
Seed collisions: The Mersenne twister algorithm currently used as the default random number generator for MATLAB uses a 32bit integer seed which means that it can take on 2^32 or 4,294,967,296 different values – which seems a lot! However, by considering a generalisation of the birthday problem it can be seen that if you select such a seed at random then you have a 50% chance choosing two identical seeds after only 65,536 runs. In other words, if you perform 65,536 simulations then there is a 50% chance that two such simulations will produce identical results.
Bad seeds: I have read about (but never experienced) the possibility of ‘bad seeds’; that is some seeds that produce some very strange, non-random results – for the first few thousand iterates at least. This has led to some people advising that you should ‘warm-up’ your random number generator by asking for, and throwing away, a few thousand random numbers before you start using them. Does anyone know of any such bad seeds?
Overlapping or correlated sequences: All pseudorandom number generators are periodic (at least, all the ones that I know about are) – which means that after N iterations the sequence repeats itself. If your generator is good then N is usually large enough that you don’t need to worry about this. The Mersenne Twister used in MATLAB, for instance, has a huge period of (2^19937 – 1)/2 (half of the standard 32bit implementation).
The point I want to make is that you don’t get several different streams of random numbers, you get just one, albeit a very big one. Now, when you choose a seed you are essentially choosing a random point in this stream and there is no guarantee how far apart these two points are. They could be separated by a distance of trillions of points or they could be right next to each other – we simply do not know – and this leads to the possibility of overlapping sequences.
Now, one could argue that the possibility of overlap is very small in a generator such as the Mersenne Twister and I do not know of any situation where it has occurred in practice but that doesn’t mean that we shouldn’t worry about it. If your work is based on the assumption that all of your simulations have used independent, uncorrelated random number streams then there is a possibility that your assumptions could be wrong which means that your conclusions could be wrong. Unlikely maybe, but still no way to do science.
Next Time
Next time I’ll be looking at methods for generating guaranteed independent random number streams using MATLAB’s in-built functions as well as methods taken from the NAG Toolbox for MATLAB. I’ll also be including explicit examples that use this stuff in Condor.
Ask the audience
I assume that some of you will be in the business of performing Monte-Carlo simulations and so you’ll probably know much more about all of this than me. I have some questions
- Has anyone come across any ‘bad seeds’ when dealing with MATLAB’s Mersenne Twister implementation?
- Has anyone come across overlapping sequences when using MATLAB’s Mersenne Twister implementation?
- How do YOU set up your random number generator(s).
I’m going to change my comment policy for this particular post in that I am not going to allow (most) anonymous comments. This means that you will have to give me your email address (which, of course, I will not publish) which I will use once to verify that it really was you that sent the comment.
Notes and References
[1] L’Ecuyer P (1999) Good parameter sets for combined multiple recursive random number generators Operations Research 47:1 159–164
[2] More usually you’ll get several different groups of results. For example you might get 3 sets of results, A B C, and get 30 sets of A, 50 sets of B and 20 sets of C. This is due to the fact that all 100 jobs won’t hit the pool at precisely the same instant.
[3] Much of this stuff has already been discussed by The Mathworks and there is an excellent set of articles over at Loren Shure’s blog – Loren onThe Art of MATLAB.
