Heads or tails. C++ edition

John Edmund Kerrich is known as the man who flipped a coin 10,000 times to test the law of large numbers in practice. It's not reliably known how long that tedious experiment took him, but circumstances helped him one way or another: he was in German captivity, and - however ironic or mocking that may sound - he had a lot of free time. Also, John had an assistant, Eric Christensen, who probably doubled the speed of the experiment.

This story made me think about the heavy price and the time costs by which some pre-computer-era experiments were achieved. With this article I want to pay tribute to those persistent fellows like Kerrich, and show how far we've come in progress. And how you can coolly mix that with C++, optimization, multithreading, and a pinch of somewhat insane programming.

What I want

In Kerrich's experiment heads and tails came up 5067 and 4933 times respectively. Simple arithmetic shows that's 50.67% versus 49.33%. In other words, the results approach the theoretical 50%, although they still show noticeable deviations. I would say very visible deviations. Of course, it all depends on how greedy you are and on the precision you want to achieve. For some people even 50.67% may seem close enough to 50%, and the experiment exhaustive and completely finished.

Looking at those results, I wondered how far one could push the experiment, even if not by fiddling with a real coin. I was more interested in the point when we really dangerously approach 50%, and how much time it would take. The question is, of course, more provocative or, at best, naive than serious, because it contains glaring holes:

Given that, I decided to postpone setting strict bounds and conditions and simply probe this elusive probability - out of curiosity and fun - making the experiment rules as I go.

Primary idea

The first thing I wanted - to flip the coin as many times as the software and hardware would allow, but without going completely nuts in the process. For example, flip the coin as many times as fits into a 32-bit unsigned integer. Even better - 64-bit. Brute-force iterate over the entire range of numbers, toss the coin during the loop, record the result, measure the elapsed time.

I'll run the experiment in the following environment: Windows, x64, Visual Studio 2022, C++20.

Let's go.

Carefully preparing for the experiment

We'll start from the big picture.

First, I needed to write a toy framework on top of which I could flexibly run all the calculations:

I ideally saw all three kinds of work nested like a matryoshka of the following form:

measure( spin(doWork) )
// careful, pseudocode ahead! proper syntax and semantics are not respected here

In other words: we want to measure and display (measure) a repeatedly executed (spin) chunk of work (doWork). As a bonus, the approach should be totally generic, so I could cannibalize this code for my other day-to-day needs if necessary.

Measuring time

I want a measure function that takes a callback of interest as a parameter, runs it, and measures its execution time. In the most basic form I imagine it like this:

using namespace std::chrono;
using Clock = high_resolution_clock;

template<typename Func>
void measure(Func&& f)
{
    auto start = Clock::now();

    f();

    auto dt = Clock::now() - start;
    std::cout << dt << std::endl;
}

Simple as that: start the timer, call the function, record the execution time, print it out. A call might look like this:

void doWork() {
    std::cout << "FFFUUU " << std::endl;
}

measure(doWork);

As a result, the console will print:

FFFUUU
36100ns

It really works, but the current version of measure has problems:

First, let's deal with the first problem. It can be solved by seasoning measure with a pretty good pinch of all the most modern (and maybe not-so-modern) C++ tricks:

template<typename Func, typename... Args>
auto measure(Func&& f, Args... args) -> decltype(f(std::forward<Args>(args)...))
{
    auto start = Clock::now();

    if constexpr (std::is_same_v<decltype(f()), void>) {
        f(std::forward<Args>(args)...);
        auto dt = Clock::now() - start;
        std::cout << dt << std::endl;
    } else {
        auto res = f(std::forward<Args>(args)...);
        auto dt = Clock::now() - start;
        std::cout << dt << std::endl;
        return res;
    }
}

void echo(std::string_view s) {
    std::cout << s << std::endl;
}

int sum(int a, int b) {
    return a + b;
}

int main()
{
    measure(std::bind(echo, "hey!"));
    std::cout << std::endl;

    int ab = measure(std::bind(sum, 12, 2000));
    std::cout << ab << std::endl;
    return 0;
}

Output to the screen:

hey!
151100ns

200ns
2012

Variadic templates, decltype, and constexpr merged in a single ecstasy on a tiny patch of code so that measure turned into a many-faced joker: it can return a value or not; it can take one set of arguments or another.

In fact, we made measure into something like a Python decorator. You can also wrap the repeating piece of code in a lambda:

template<typename Func, typename... Args>
auto measure(Func&& f, Args... args) -> decltype(f(std::forward<Args>(args)...))
{
    auto start = Clock::now();

    auto summarize = [&start]() {
        auto dt = Clock::now() - start;
        std::cout << dt << std::endl;
    };

    if constexpr (std::is_same_v<decltype(f()), void>) {
        f(std::forward<Args>(args)...);
        summarize();
    } else {
        auto res = f(std::forward<Args>(args)...);
        summarize();
        return res;
    }
}

I think I’ll stick with this version.


Now, about the problem of printing time. The STL doesn’t provide a unified solution for smart duration output. If your duration<> is seconds, it’ll just print the integer number of seconds, and the fractional part will be lost — meaning 1.5 seconds will turn into 1s. If your duration<> is milliseconds, it’ll show ms, and so on. See for yourself what the standard library can do:

using namespace std::chrono_literals;

auto d = 987'654'321'978ns;

std::cout << d << std::endl;
std::cout << duration_cast<microseconds>(d) << std::endl;
std::cout << duration_cast<milliseconds>(d) << std::endl;
std::cout << duration_cast<seconds>(d) << std::endl;
std::cout << duration_cast<minutes>(d) << std::endl;

Output to the screen:

200000000000ns
200000000us
200000ms
200s
3min

Note: The kind of time output you see above is available only starting with C++20. Back in C++17, std::cout << d wouldn’t compile at all — you’d have to do it like this: std::cout << d.count(). And the output would lack suffixes like ms, s, min, etc. — just a plain number.

I started wondering how I would actually want to see measurement times. How to visually represent 200,000,000,000 nanoseconds in a sensible way? At first, I leaned toward casting duration sequentially into hours, minutes, seconds, milliseconds, etc., until we got a result greater than zero. Then 200,000,000,000 nanoseconds would turn into 3min. That’s not bad, and the fractional part might not matter much at that scale… or would it?

I hesitated, then decided that if I break the duration into all its components down to nanoseconds — for example, 2min 36s 649ms 551us 558ns — I would get ultimate clarity: readability, precision, and no ambiguity. So I implemented exactly that option:

template <typename Dur>
void prettyPrintDuration(Dur dur)
{
    auto h = duration_cast<hours>(dur);
    if (h.count()) { std::cout << h << " "; dur -= h; }

    auto m = duration_cast<minutes>(dur);
    if (m.count()) { std::cout << m << " "; dur -= m; }

    auto s = duration_cast<seconds>(dur);
    if (s.count()) { std::cout << s << " "; dur -= s; }

    auto ms = duration_cast<milliseconds>(dur);
    if (ms.count()) { std::cout << ms << " "; dur -= ms; }

    auto us = duration_cast<microseconds>(dur);
    if (us.count()) { std::cout << us << " "; dur -= us; }

    auto ns = duration_cast<nanoseconds>(dur);
    if (ns.count()) { std::cout << ns << " "; dur -= ns; }

    std::cout << std::endl;
}

Testing:

using namespace std::chrono_literals;

auto d = 987'654'321'978ns;
prettyPrintDuration(d);

We get:

16min 27s 654ms 321us 978ns

Some might say it’s too long. I’d say I’ll look at this time as deeply as I need to. For our current task, it’s more than sufficient.

Controlled spinning

Another general-purpose function we’ll need: spin. It should perform the same work a specified number of times.

In its simplest form, it could look like this:

template<typename Func>
void spin(Func&& f, size_t count)
{
    for (size_t i = 0; i<count; ++i) {
        f();
    }
}

It’s very simple. But since my idea was to iterate over the entire range of an unsigned integer type, I wanted to delegate all the range-bound calculations to spin as well. At this stage, I don’t yet know whether I’ll be using uint8_t, uint16_t, uint32_t, or uint64_t as the type. I know for sure that uint8_t will be useful for quick tests — 256 iterations will likely fit entirely on the screen and complete almost instantly for any reasonably heavy workload.

Okay, it’s enough to pass the desired type as a template argument:

template<std::unsigned_integral T, typename Func>
void spin(Func&& f)
{
    for (T i = 0; i < std::numeric_limits<T>::max(); ++i) {
        f();
    }
}

But that’s a trap! If you run spin<uint8_t>(...), it will loop from 0 to 254 — one iteration short of the full range [0; 255]. Let’s fix it by changing < to <=:

template<std::unsigned_integral T, typename Func>
void spin(Func&& f)
{
    for (T i = 0; i <= std::numeric_limits<T>::max(); ++i) {
        f();
    }
}

But that’s a trap! Congratulations — you’re in an infinite loop, because any number is always <= its maximum. That’s how it turned out there’s no clean one-liner to iterate over the entire range. Sure, if you give the variable i the type uintmax_t, it will work as intended for 8, 16, and (possibly) 32 bits. But when we get to 64 bits — which uintmax_t is on many platforms — we face the same trap again.

In the end, I came to this awkward compromise:

template<std::unsigned_integral T, typename Func>
void spin(Func&& f)
{
    for (T i = 0; ; ++i) {
        f();
        if (i == std::numeric_limits<T>::max()) break;
    }
}

Carefully probe the counter after each useful work execution, exiting the loop just a moment before disaster strikes.


So now we have everything we need to run work and take measurements:

void doWork() {
    std::cout << "F";
}

int main()
{
    measure([](){ spin<uint8_t>(doWork); });
    return 0;
}

And we get 256 ($2^8$) prime Fs printed to the screen:

FFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFFF6ms 342us 500ns

That’s the end of our preparatory digression from the main topic of the article — let’s start flipping the coin!

Flipping the coin

What we need first and foremost is a way to generate a random boolean, or a zero and one. The canonical C++ way to generate random numbers is:

std::mt19937 gen(std::random_device{}());
std::uniform_int_distribution uni(0, 1);

Everything else for the coin flipping is just a matter of coding without much thought:

using BigInt = uint8_t;
std::mt19937 gen(std::random_device{}());
std::uniform_int_distribution uni(0, 1);
BigInt heads = 0;
BigInt tails = 0;

void doWork() {
    int res = uni(gen);
    if (res == 1) {
        ++heads;
    } else if (res == 0) {
        ++tails;
    } else {
        assert(false);
    }
}

int main()
{
    measure([](){ spin<BigInt>(doWork); });

    const double headsPercent = heads / static_cast<double>(std::numeric_limits<BigInt>::max()) * 100.0;
    const double tailsPercent = tails / static_cast<double>(std::numeric_limits<BigInt>::max()) * 100.0;

    std::cout << "Heads: " << +heads << ", " << headsPercent << "%" << std::endl;
    std::cout << "Tails: " << +tails << ", " << tailsPercent << "%" << std::endl;

    return 0;
}

Note: notice — I wrote +heads and +tails when printing to the screen. This isn’t accidental, and it’s needed for uint8_t, which otherwise gets interpreted by std::cout as a char type, producing ASCII gibberish instead of numbers. The + forces the cast of the symbol to a number and avoids this.

As our "big number" — BigInt — we’ll take uint8_t for testing. We flip the coin with our measure/spin machinery, count the percentages, and get:

2us 700ns
Heads: 122, 47.8431%
Tails: 134, 52.549%

Hooray, we’ve run our first coin-flipping experiment, and it took just 2.7 microseconds! Yes, 256 is not Kerrich’s 10,000, so the percentages scatter quite a bit. But overall, it looks fine. At least that’s what I thought.

However, one must stay vigilant — while tinkering with the code, at some point I accidentally got this perfect result where heads and tails matched exactly:

2us 700ns
Heads: 128, 50.1961%
Tails: 128, 50.1961%

…and I realized my calculations were flawed. As you can see, the sum of probabilities 50.1961% and 50.1961% ends up slightly over 100%.

The cause was dividing by std::numeric_limits<BigInt>::max(). In fact, we should have divided by one number larger. But where to get such a number if this is the maximum? In the end, I decided not to overcomplicate things — just cast that number to double, add one, and then divide. Fortunately, double is flexible enough, though it doesn’t guarantee exact representation of large numbers, so we must remember this solution is approximate. But at the early stage, it’s good enough:

const double headsPercent = heads / (static_cast<double>(std::numeric_limits<BigInt>::max()) + 1.0) * 100.0;
const double tailsPercent = tails / (static_cast<double>(std::numeric_limits<BigInt>::max()) + 1.0) * 100.0;

Cumbersome, but hey — this is C++. Still, after about a dozen tries I hit 128/128 again:

2us 304ns
Heads: 128, 50%
Tails: 128, 50%

And now the percentages are calculated correctly.

By the way, a question for mathematicians: what’s the probability of getting such a probability? :) In a short span of time, I got exactly 128 heads versus 128 tails twice! At larger numbers, such a match is probably extremely unlikely, and I haven’t seen it anywhere beyond 256 coin flips.

Now let’s break the scientist’s record by flipping the coin more than 10,000 times. The uint16_t type allows us to run 65,536 iterations — just what we need. Let’s change BigInt to uint16_t and we get:

569us 400ns
Heads: 32685, 49.8734%
Tails: 32851, 50.1266%

Well — our 50.1266% already has less error than Kerrich’s 50.67%.

But there’s an interesting point. To bring it up, I have to admit I omitted another old experiment.

Half a century before John Edmund Kerrich flipped a coin 10,000 times, another scientist, Karl Pearson, did the same with his students. They ran even more iterations — a crazy 24,000 flips, with results of 12,012 to 11,988, or 50.05% versus 49.95%.

My generator with 65,536 coin flips performs two and a half times more flips, yet the percentage more often stays around \[50.1%; 50.2%\], which is “worse” than Pearson’s results, although sometimes it approaches his results and sometimes slightly undershoots them. From this, we could cautiously conclude that 24,000 or 65,000 coin flips don’t differ significantly in terms of statistical results.

Either Pearson lied.

Or my generator lies.

Or I’m lying.

Isolating the experiment

As you might guess, the next plan is to push further. We’ll use 32-bit and 64-bit ranges for iteration and see the results. I suspect some keen readers already sense something is off, but at this stage I felt completely fine and suspected nothing. More on that later…

I didn’t want to change the BigInt type, recompile the program, and rerun the experiment every time. Automation, remember? Besides, the idea struck me that it would be nice to see a summary table with a progression — experiments run consecutively for 8-, 16-, 32-, and 64-bit ranges, with results calculated and logged.

But for such a multi-step experiment, our code wasn’t ready at all. What were we missing?:

Alright, let’s start with something simpler and declare the long-overdue Experiment class:

template<typename BigInt>
struct Experiment
{
    std::mt19937 gen {std::random_device{}()};
    std::uniform_int_distribution<int> uni {0, 1};
    BigInt heads = 0;
    BigInt tails = 0;

    void tossCoin() {
        int res = uni(gen);
        if (res == 1) {
            ++heads;
        }
        else if (res == 0) {
            ++tails;
        }
        else {
            assert(false);
        }
    }
};

Well, that was too simple. So let’s move on to the tricky part.

Iterating over types — that’s something in the realm of templates, metaprogramming, right? Looks like it. Let’s think step-by-step: we want something like

for (type : Types) {
    f<type>();
}
// careful, pseudocode ahead! proper syntax and semantics are not respected here

where Types is a list of types, and f is a templated function. Only this is pseudocode.

But how do we get actual C++? There are variadic templates, whose unpacking can essentially "generate" code as a virtual list, similar to the good old macros. That is, the task can be reduced to something like this:

template<typename ...Types, typename Func>
void forEachType(Types..., Func f)
{
    (f<Types>(), ...);
}
// careful, pseudocode ahead! proper syntax and semantics are not respected here

So you don’t have to rack your brain, here’s a hint: the expression (f<Types>(), ...); should theoretically expand into

(f<uint8_t>(), f<uint16_t>(), f<uint32_t>(), f<uint64_t>());

That is, into a comma operator that lists the expression for each type and executes each in sequence. Just don’t forget — we’re still in pseudocode, and as it stands it won’t compile!

But our last pseudocode wasn’t far from the truth, and if we file it down a bit into proper C++, it will work:

template<typename... Ts, typename F>
void forEachType(F&& f, Ts...)
{
    (f.template operator()<Ts> (), ...);
}

int main()
{
    forEachType([]<typename T>() {
        std::cout << +std::numeric_limits<T>::max() << std::endl;
    }, uint8_t{}, uint16_t{}, uint32_t{}, uint64_t{});
}

Just don’t ask me anything about template operator(), please — I barely understand its necessity myself, but it only works with it.

Output to the screen:

255
65535
4294967295
18446744073709551615

Let me say right away — I don’t like this implementation for several reasons:

Both problems can be almost solved by a trick — hiding the type list behind an additional entity:

template<typename... Ts>
struct TypeList {};

Now the variadic template is hidden behind the TypeList object, and from the outside — it’s a single unified object, which can be moved forward into forEachType. Here’s how it works:

template<typename... Ts>
struct TypeList {};

template<typename... Ts, typename F>
void forEachType(TypeList<Ts...>, F&& f)
{
    (f.template operator()<Ts> (), ...);
}

int main()
{
    forEachType(TypeList<uint8_t, uint16_t, uint32_t, uint64_t>{}, []<typename T>() {
        std::cout << +std::numeric_limits<T>::max() << std::endl;
    });
}

Yes, we’re still creating a fake object — TypeList<>{} — but there’s only one, so I consider it a good compromise. We ended up with another neat trick worth keeping for other projects.

First results

Now we can execute our plan and run the experiment with iterations of different sizes:

8 bits
8us 200ns
Heads: 120, 46.875%
Tails: 136, 53.125%

16 bits
571us 600ns
Heads: 32735, 49.9496%
Tails: 32801, 50.0504%

32 bits
35s 839ms 225us 500ns
Heads: 2147474281, 49.9998%
Tails: 2147493015, 50.0002%

64 bits

Well, 4 billion flips from a 32-bit domain after 35 seconds already give a pretty good result: 50.0002% versus 49.9998%. That’s significantly better than the experiments of Messrs. Pearson and Kerrich.

You might think I forgot to copy-paste part of the text and accidentally skipped the 64 bits result, but no. Those who sensed something fishy back a chapter ago — rejoice, you were right: running the full 64-bit range can take an enormous amount of time.

I didn’t give up and decided to wait a bit longer — I let the program run overnight. In the morning, I found the console still stuck on the same unchanged screen, and I decided to wait a bit longer. Only when that didn’t help, and by evening the screen still showed nothing below the 64 bits line — only then did I start suspecting something.

Let’s look at how many numbers fit into different bit widths:

 8 bit: 256
16 bit: 65536
32 bit: 4294967296
64 bit: 18446744073709551616

If you look closely, you’ll see the sequence grows quadratically, and 18,446,744,073,709,551,616 is nothing but 4,294,967,296 × 4,294,967,296. In other words, if flipping a coin in the 32-bit domain took me 35 seconds, then the 64-bit domain would finish execution in about 4,294,967,296 times longer…

35 * 4 294 967 296 seconds =
150 323 855 360 seconds =
2 505 397 589.33 minutes =
41 756 626.48 hours =
1 739 859.44 days =
4 766.74 years

Oh indeed.

The date of Jesus’ second coming is closer to us than the date of experiment completion.

In short, traversing the full 64-bit range is evidently unrealistic. Even if we imagine flipping the coin in the 32-bit domain takes just 1 millisecond, the 64-bit domain would still take about

4 294 967 296 (four billions) milliseconds =
4 294 967.296 (four millions) seconds =
71 582.79 minutes =
1 193.05 hours =
49.7 days

Alright — half a month! Not four thousand years, of course, but getting a 32-bit run down to 1 millisecond seems unrealistic anyway.


At this point it became clear the current plan had to change, because a 32-bit range is too small for us, and a 64-bit range is unimaginably huge, and however many times we flip the coin in the end, it will be more than 4,294,967,296 (four billion) but far less than 18,446,744,073,709,551,616 (eighteen quintillion).

Keeping these numbers in mind, I adjusted the course of the experiment:

Optimizations

The plan is simple — we’ll try to squeeze the maximum speed out of coin flipping. The more coins we flip in a unit of time, the better.

To iteratively improve performance, we need objective measurements. If a code tweak produces a faster result (while preserving correct algorithm functionality), we keep it and move on.

A key question remains: how to get objective measurements. I’ll tell you right away — I won’t bother making the benchmark perfectly sterile. There will always be people saying: “you measured wrong.” For example:

So here are the measurement conditions:

Another important point: to make measurements comparable, we must define the number of coin flips we will measure. Before optimizations, it was hard to guess a suitable number. One could think that 4,294,967,296 is a good candidate since it took 35 seconds — long enough to be meaningful. But if our optimizations are too effective, that number could drop to mere milliseconds, at which point measurement differences would fall into statistical noise.

I’ll fudge this retroactively, because — as you understand — the article is written after the experiment is finished, and I know what awaits at the end. So I’ll settle on 68,719,476,736 iterations as my benchmark. That’s 4,294,967,296 × 16, i.e. sixteen runs of a 32-bit number range.

Why not choose a nice round number, you ask? — After all, if we no longer need all those $2^{32}$, $2^{64}$, why not do 10 billion, 50 billion flips, etc.?

The first answer: I forgot and didn’t think about it! — even after scrapping the original idea I just kept multiplying 4,294,967,296 by bigger and bigger values. Brain deformation, nothing else.

The second answer: we’ll need it later. During further optimizations, we’ll end up with a restriction: the number of flips must be a multiple of 64, and later this requirement will increase, but it will always be tied to powers of two. With my brutal 4,294,967,296 this condition is automatically satisfied.

Let’s lock in what we have

Let’s lock the main pre-optimization code so you have a fairly complete picture:

template<typename Func>
void spin(BigInt n, Func&& f)
{
    for (BigInt i = 0; i < n; ++i) {
        f();
    }
}

struct Experiment
{
    std::mt19937 gen {std::random_device{}()};
    std::uniform_int_distribution<int> uni {0, 1};
    BigInt heads = 0;
    BigInt tails = 0;

    void tossCoin() {
        int res = uni(gen);
        if (res == 1) {
            ++heads;
        }
        else if (res == 0) {
            ++tails;
        }
        else {
            assert(false);
        }
    }
};

int main()
{
    constexpr BigInt N = 68'719'476'736ll;

    Experiment experiment;
    measure([&experiment, N]() {
        spin(N,
            std::bind(&Experiment::tossCoin, &experiment)
        );
    });
}

Execution time of this code:

Time:  20min 41s 897ms 383us 600ns
Heads: 34 359 699 204, 49.9999%
Tails: 34 359 777 532, 50.0001%

A painfully long 20 minutes and 42 seconds to flip the coin 68 billion times. That’s our starting point.

Notice the console output — I made a small improvement to display large numbers in a formatted style: 34 359 699 204 instead of 34359699204. Adding spaces to a string isn’t hard, and readability for numbers of this magnitude improves dramatically.

How to optimize

During my research, I tested quite a lot of optimization hypotheses — some successful, some dead ends. Below is a brief summary of the successful ones. The chronology will be a bit chaotic, but that reflects my real journey, jumping from one insight to another.

Dissecting randomness

First, I decided to tackle these guys:

std::mt19937 gen {std::random_device{}()};
std::uniform_int_distribution<int> uni {0, 1};

They were too suspicious, and there were too many of them.

First, I tried switching to a classic C approach: srand + rand. That was a disaster — it was slow.

Then I experimented with other distribution classes. Some gave completely incorrect results, not converging to 50/50. Others worked and were even slightly faster than std::uniform_int_distribution, for example std::bernoulli_distribution. Plus, that distribution naturally produces bool. But the speed gain was minimal.

That made me wonder: do I even need a distribution? What if I just throw it away? Essentially, all I need is a zero or a one. The generator itself — std::mt19937 and its friends — can generate random numbers directly via operator(). The catch is that we’ll get some random number (its type depends on the generator class), and we need to convert it into the range \[0; 1\].

I naively reasoned that if I just take the zero-th bit of that number, it should be sufficiently random. It turned out not all generators handle this well, but generators like std::minstd_rand and std::mt19937 continued producing distributions tending toward 50/50, so I confidently threw away std::uniform_int_distribution and changed the coin flip code to this:

-int res = uni(gen);
+int res = gen() & 1;

Results:

Time:  10min 50s 93ms 886us 900ns
Heads: 34 359 738 368, 50%
Tails: 34 359 738 368, 50%

Whoa! We just doubled the program speed out of nowhere. That was quite unexpected — I didn’t think I’d get such a huge boost so quickly and easily.

Some might say I sacrificed correct randomness for optimization. Honestly, I don’t care about academic “correctness” of randomness as long as the program produces a reasonable picture: as the number of coin flips grows, the ratio of heads to tails approaches 50/50. After all, when Kerrich flipped his coin, his distribution wasn’t perfect either — the coin wasn’t perfect, his hand wasn’t perfect — and presumably, a person’s individual tossing style could theoretically influence the outcome.

We record a ×1.9 speedup from removing the uniform distribution.

Grouping bits

I already mentioned that gen() generates a random integer result, which we then reduce to a single bit. But why waste all those bits when we could use them all? std::mt19937 produces a 32-bit number per call, and using all of it would be equivalent to 32 coin flips at once.

The only problem with this idea — our Experiment class flips one coin at a time, and the outer function spin calls Experiment::tossCoin N times. If we want to stay in this paradigm, we need to get creative and call the generator once every 32 iterations, taking cached results for the others. My implementation was awkward:

struct Experiment
{
    Experiment()
    {
        regenBits();
    }

    using Generator = std::mt19937;
    using BitsType = Generator::result_type;

    BigInt heads = 0;
    BigInt tails = 0;

    Generator gen {std::random_device{}()};
    BitsType bits{};
    int bitsCounter = 0;

    void tossCoin() {
        if (yieldBit()) {
            ++heads;
        } else {
            ++tails;
        }
    }

    bool yieldBit() {
        if (bitsCounter >= std::numeric_limits<BitsType>::digits) {
            regenBits();
        }

        return (bits >> bitsCounter++) & 1;
    }

    void regenBits() {
        bitsCounter = 0;
        bits = gen();
    }
};

But I didn’t overcomplicate things — I just measured it as is:

Time:  7min 8s 600ms 933us
Heads: 34 359 585 551, 49.9998%
Tails: 34 359 891 185, 50.0002%

Another three minutes shaved off, giving us a speedup of ×1.5. That’s a very good result. What’s especially satisfying is that this code still has room for optimization in how we “spin” the coin-flipping function.

Restructuring the spin

It makes sense to move spin inside the Experiment class, giving it more freedom and flexibility. This should eliminate the awkward caching and the outdated need to flip a coin exactly once per call — since we can now flip 32 coins “pseudo-parallel.”

We also need an efficient way to “split” the number into bits and count zeros and ones. In practice, we can reduce this to counting either ones or zeros, and get the other half by subtracting from 32. For now, we’ll manually count using bit operations:

template <std::unsigned_integral T>
unsigned countOnes(T val) {
    unsigned n = 0;
    for (int i = 0; i < std::numeric_limits<T>::digits; ++i) {
        n += val & 1;
        val >>= 1;
    }
    return n;
}

Someone might notice that this algorithm could be optimized by breaking out of the loop as soon as val reaches zero. For example, like this:

while (val) {
	n += val & 1;
	val >>= 1;
}

And that makes sense algorithmically and logically — but on my machine, it runs twice as slow. I’d bet that the gigachad for loop benefits from SIMD and loop unrolling, while the while loop in machine code just keeps checking val against zero, and branch mispredictions keep tripping up the CPU again and again.

In its new form, with additional powers, the Experiment class now looks like this:

struct Experiment
{
    using Generator = std::mt19937;
    using BitsType = Generator::result_type;

    BigInt n = 0;
    BigInt heads = 0;
    BigInt tails = 0;

    Generator gen {std::random_device{}()};
    static constexpr size_t BITS_COUNT = std::numeric_limits<BitsType>::digits;

    Experiment(BigInt n_)
        : n(n_) {
    }

    void spin() {
        for (BigInt i = 0; i < n; i += BITS_COUNT) {
            BitsType bits = gen();
            heads += countOnes(bits);
        }

        tails = n - heads;
    }
};

Here’s how it’s called in the benchmark:

Experiment experiment(n);
measure(std::bind(&Experiment::spin, &experiment));

Note the BITS_COUNT, which implicitly requires that the number of coin flips be a multiple of 32 bits.

Measuring the time:

Time:  19s 826ms 601us
Heads: 34 359 613 380, 49.9998%
Tails: 34 359 863 356, 50.0002%

At first I thought I was imagining it, but no — instead of minutes of waiting, the same number of coin flips now took just twenty seconds! That’s an enormous speedup of ×21.6 — something phenomenal.

Yes, part of the boost came from the inefficiency of the previous implementation, but I was still very happy with such results.

Counting bits properly

In fact, it was obvious to me from the start that counting one-bits in a number could be done more efficiently — for example, by using intrinsics or platform-dependent solutions. But as a true C++ enthusiast, I chose to enjoy the benefits of C++20 and its std::popcount function, which returns the number of one-bits in any unsigned integer.

-heads += countOnes(bits);
+heads += std::popcount(bits);
Time:  5s 460ms 958us 600ns
Heads: 34 359 743 765, 50%
Tails: 34 359 732 971, 50%

×3.63. And that’s no surprise: the call to std::popcount() unfolds into a single machine instruction — POPCNT — and no solution can really compete with that level of efficiency.

Expanding the generator

I started to take a closer look at the whole lineup of generators provided by the STL. Here they are:

std::minstd_rand0
std::minstd_rand
std::mt19937
std::mt19937_64
std::ranlux24_base
std::ranlux48_base
std::ranlux24
std::ranlux48
std::knuth_b

It turned out that some generators produce 64-bit numbers, which is a potential ticket to a new twofold optimization! Without much thought, we make a one-line replacement:

-using Generator = std::mt19937;
+using Generator = std::mt19937_64;

and we get an improved result:

Time:  2s 554ms 51us 200ns
Heads: 34 359 573 276, 49.9998%
Tails: 34 359 903 460, 50.0002%

Now we simultaneously flip 64 coins instead of 32, and naturally get a corresponding boost: ×2.14. Also note that the number of flips must now be a multiple of 64.

For the sake of clarity, let’s explicitly state this invariant so we’re sure to catch it if it’s ever violated:

Experiment(BigInt n_)
    : n(n_) {
    if (n % (sizeof(BigInt) * CHAR_BIT) != 0) {
        std::cerr << "Error: n must be a multiple of 64, got " << n << std::endl;
        std::abort();
    }
}

Automatic Multithreading

In C++17, the language gained an update to the <algorithm> header: functions like std::for_each, std::sort, std::find, etc., gained the ability to run in parallel mode. You just need to pass std::execution::par_unseq as the first argument to the algorithm (other policies make little sense for our task):

std::sort(std::execution::par_unseq, v.begin(), v.end());

And the vector will magically sort itself using all your cores. Well, not exactly — the STL decides on its own how to parallelize, how many threads to use, and whether to parallelize at all — you’re only giving it a recommendation via the execution policy.

But if you’re excited and think you can now just turn all your loops and standard algorithm calls into parallel ones, keep in mind: synchronization and data race prevention are still your responsibility. You must be acutely aware of this. A simple example:

std::vector<int> v(1000, 1);
int sum = 0;

std::for_each(std::execution::par_unseq, v.begin(), v.end(),
    [&](int x) {
        sum += x; // data race!
    }
);

Imagine that the STL spun up 1,000 threads (we don’t actually know how many) that all started incrementing sum at the same time. This variable is not atomic, so you’ve got trouble — the result is unlikely to equal 1,000.

In general, I decided that such parallel algorithms are the nicest and most casual entry point for introducing multithreading into my code, so the first thing I did was start experimenting with these algorithms.

At first, I tried to dive headfirst into this parallel world by writing a naive std::for_each while ignoring data races. Why? Just to quickly feel the execution speed of the algorithm without synchronization primitives or atomics, so I could know what speed I could expect in the limit (spoiler: it’s not as simple as I thought, and it doesn’t work that way). Code:

void spin() {
    BigInt steps = static_cast<size_t>(n / BITS_COUNT);
    std::vector<BigInt> indices(steps);
    std::iota(indices.begin(), indices.end(), 0);

    std::for_each(std::execution::par_unseq,
        indices.begin(), indices.end(),
        [this](size_t idx) {
            BitsType bits = gen();        // BUG: data race on random generator
            heads += std::popcount(bits); // BUG: non-atomic access from several threads
        }
    );

    tails = n - heads;
}

The results of its execution:

Time:  11s 199ms 665us 700ns
Heads: 6 069 669 426, 8.83253%
Tails: 62 649 807 310, 91.1675%

We immediately notice an obvious inaccuracy in the count — a total divergence from 50/50. I chalk this up to thread-unsafe modification of heads. What disappointed me even more: I got 11 seconds, which is four times worse than single-threaded execution. But how could that be?

Purely out of curiosity, I continued exploring parallel algorithms and decided I could rewrite my std::for_each using std::transform + std::reduce to eliminate the race condition for heads, and remove the race for the generator by giving each thread its own copy:

void spin() {
    BigInt steps = static_cast<size_t>(n / BITS_COUNT);
    std::vector<BigInt> res(steps);

    std::transform(std::execution::par_unseq,
        res.begin(), res.end(),
        res.begin(),
        [this](BitsType) {
            thread_local Generator gen{std::random_device{}()};
            BitsType bits = gen();
            return std::popcount(bits);
        }
    );

    heads = std::reduce(std::execution::par_unseq, res.begin(), res.end());
    tails = n - heads;
}

Notice how cleverly we’ve injected thread_local here — in this case, it’s almost the only way to achieve the desired result, since we don’t control the threads directly.

Execution results:

Time:  2s 250ms 10us 700ns
Heads: 34 359 867 682, 50.0002%
Tails: 34 359 609 054, 49.9998%

That is — execution time didn’t improve compared to our single-threaded version. But it also didn’t get worse. And it should have improved.

What else could be improved in the paradigm of parallel STL algorithms was unclear to me. Moreover, STL algorithms do things internally that we can’t really control, so I decided to switch to manual thread management in order to enable precise optimizations.

Manual Multithreading

Disappointed with parallel STL algorithms, we move to the task of splitting the coin tosses across manually created threads. Fortunately, our task is isolated and simple, and doesn’t require shared resources — except for the generator, of course. But we’ve already cloned the generator for different threads in the previous chapter and saw that randomness still works fine. Some might say this is outright wrong, that there should ideally be one generator for the whole task; I would again answer that as long as randomness behaves correctly, I consider it working.

Briefly, the algorithm is:

Here’s the implementation:

void spin() {
    const BigInt steps = n / BITS_COUNT;
    const size_t threadsCount = std::max(std::thread::hardware_concurrency(), 1u);
    const BigInt chunkSize = steps / threadsCount;

    const auto thread = [this, chunkSize]() -> BigInt {
        Generator gen {std::random_device{}()};
        BigInt localHeads = 0;
        for (BigInt i = 0; i < chunkSize; ++i) {
            localHeads += std::popcount(gen());
        }
        return localHeads;
    };

    std::vector<std::future<BigInt>> threadHeads(threadsCount - 1);
    for (auto& f : threadHeads) {
        f = std::async(thread);
    }

    heads += thread();
    for (auto&& fut : threadHeads) {
        heads += fut.get();
    }

    tails = n - heads;
}

Some interesting points:

Execution time:

Time:  126ms 479us 900ns
Heads: 12 884 816 423, 49.9997%
Tails: 12 884 987 353, 50.0003%

Here it is — the payoff from multithreading. We register yet another huge speedup of x20.2!

It’s worth noting that the improvement heavily depends on the machine. I have 16 logical cores. Also, the local random generator plays a major role. At first glance, having 16 copies of std::mt19937_64 might seem wasteful — especially since each such generator weighs about 4.9KB of memory! Yes, objects of this class are that large. But if the generator was shared, we wouldn’t squeeze as much out of multithreading — even if we used a shared generator without locks (or pretended it was thread-safe), threads would constantly invalidate each other’s cache lines where the generator resides. I believe that’s why the implementation with std::for_each from the previous chapter took 11 seconds, while std::transform + std::reduce gave 2 seconds. The latter algorithm had no shared data, so threads didn’t “spill” each other’s cache.

I should also add that besides std::async, I experimented with std::thread combined with std::promise + std::future and with std::packaged_task. But the performance was about the same, and the boilerplate was heavier, so we stick to std::async as the most interface-friendly solution for our task.

Custom Generator

Back when I was learning to write shaders, I remembered how often I had to whip up a pseudo-random generator using primitive math operations — multiplication, shifting, addition. And it worked fine, because randomness felt random enough — and what else did you need?

I wondered how realistic it would be to replace the heavy but high-quality std::mt19937_64 with a hand-written version. After some time, I ended up with this candidate replacement:

struct LCG64 {
    using result_type = BigInt;
    uint64_t state;

    LCG64(uint64_t seed = 1) : state(seed) {}

    uint64_t operator()() {
        state = state * 6364136223846793005 + 1;
        return state;
    }
};

Don’t let its simplicity fool you! I tried generating a number of values with it and displaying them in binary form. For display purposes, a neat trick is to use std::bitset:

LCG64 gen{std::random_device{}()};
uint64_t bits = gen();
std::cout << std::bitset<64>(bits) << std::endl;

Here’s the output of a number of 64-bit numbers produced by the generator:

0101100010010111001110101000000011111000110110011000010111110100
0100001110101001110010100110001100001110101101001001011111100101
1101001000111110110110110010101010010001011000100100111001000010
0000101001110111000110111010100001110111100001000111111110011011
1111111000100110100000000010011100000110110011110101001101000000
1011110101100110111111101011110110011110000101110000100111100110
1110010000001000101010001000100101000010110100111101011101101111
0011001110001110001011000000110000101110101101111110111110000100
1000100000110000011111111010111100001101111110111001011000110101
0100100000010001000010010000110001101101100101101011001001010010
1001010010100101110001111001110100100111101011100000011001101011
1101110011111010111011000100111111001111000011010011010111010000
0001001010101111010001011111100100000101000101001010010110010001
1011100010110000100011100011000101001111111010100111011000100101

I don’t know about you, but to me the distribution of zeros and ones looks completely random, without patterns, and seems close to 50/50. In any case, this is easy to verify. We just need to change one line in our code:

-using Generator = std::mt19937_64;
+using Generator = LCG64;

and we run the benchmark:

68 719 476 736 rounds
Time:  93ms 529us 800ns
Heads: 34 359 798 049, 50.0001%
Tails: 34 359 678 687, 49.9999%

And nothing broke. Judging by the result — 50.0001% / 49.9999% — our homemade generator works as intended. Moreover, it runs faster than the previous one, giving us another x1.4 boost. I had expected a slightly bigger gain, but apparently, std::mt19937_64 is already quite efficient at its job.

Even more hardcore

I was fairly certain that using std::thread::hardware_concurrency() threads for the task was the most comfortable mode, and anything beyond that would just degrade performance. I found out I was wrong purely by accident, deciding out of boredom and desperation to double the number of threads. Then quadruple it… then octuple it… my eyebrows rose in proportion, and I finally settled at std::thread::hardware_concurrency() * 64 — here was the sweet spot with results:

Time:  75ms 91us 300ns
Heads: 34 359 767 603, 50%
Tails: 34 359 709 133, 50%

And that’s another nice x1.25 speed boost.

How can this even work? On a machine with 8 physical cores and 16 logical cores, the best performance comes with 1024 threads! Strange. What about constant context switches, cache contention, and so on? I can suggest a few possible factors:

These are just my guesses about where the truth might lie. Crowd-sourced insights would be welcome.

SIMD

It might seem that SIMD won’t help in our task — SIMD instructions are inefficient in a “horizontal” scenario, i.e., when working within a single SIMD object. SIMD is designed for extremely efficient vector-to-vector operations. Moreover, we’re basically bottlenecked on literally a single machine instruction — POPCNT — which is very hard to beat. The hot path of our algorithm is right here:

    for (BigInt i = 0; i < chunkSize; ++i) {
        BitsType bits = gen();
->      localHeads += std::popcount(bits);  // hot
    }

But I still decided to try my luck. This time I turned to the tempter — ChatGPT — because I had no desire to dive into the brutal SIMD syntax myself, and I didn’t know what variety of instructions might theoretically suit the task.

My request was concise (because that’s how you have to deal with “iron beasts”): I have a handful of uint64_t numbers, and I want to perform a massive SIMD attack on them to count all their one bits — returning the result as the total sum of set bits across all numbers.

First, I was pleased to learn that AVX-512 literally has a SIMD-popcount — the instruction _mm512_popcnt_epi64. They even suggested a wrapper function I could conveniently use:

uint64_t avx512_popcount_sum(const uint64_t* data, size_t n) {
    __m512i acc = _mm512_setzero_si512();
    size_t i = 0;
    for (; i + 8 <= n; i += 8) {
        __m512i v = _mm512_loadu_si512(&data[i]);
        __m512i pc = _mm512_popcnt_epi64(v);
        acc = _mm512_add_epi64(acc, pc);
    }
    // horizontal reduce
    uint64_t result = _mm512_reduce_add_epi64(acc);
    // tail
    for (; i < n; i++)
        result += std::popcount(data[i]);
    return result;
}

The code wasn’t even particularly scary or long, as often happens with SIMD.

I was very eager to enable AVX-512 in my Visual Studio project, so I mindlessly rebuilt the old version of the program (still without SIMD instructions and without the suggested function) and ran it. Right from the start, I got a crash with the message “illegal instruction.” This meant two things:

const double headsPercent = static_cast<double>(experiment.heads) / n * 100.0;
const double tailsPercent = static_cast<double>(experiment.tails) / n * 100.0;

// And where exactly did he plan to use AVX-512 here? I'm not digging into the assembly, sorry

Of course, I was a bit disappointed by this turn of events, because this elite SIMD could actually perform eight POPCNT in parallel. Sure, it probably wouldn’t have been a clean x8 speedup, but there would have been something.

So in the end I was left alone with AVX2 — at least that one’s available, I know it, I’ve used it. The problem is, it doesn’t even have anything like mm512_popcnt_epi64. What it does have is someone’s _algorithm showing how, with nothing more than some duct tape the instructions we do have, you can shuffle the bits to death until, miraculously, you get the number of set bits from all four 64-bit integers fed in. Which means not only are we limited to, at best, a 4× boost, but we can’t even really count on that, since the solution is guaranteed to be suboptimal from a SIMD standpoint. Add in the overhead of moving numbers into and out of SIMD registers, and you’re looking at something close to “break-even.” Still, if you don’t test, you don’t know. So I decided to give it a try.

First of all, here’s what ChatGPT brought me for dinner:

uint64_t avx2Popcount(const uint64_t* data, size_t n) {
    static const __m256i lut = _mm256_setr_epi8(
        0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4,
        0,1,1,2,1,2,2,3,1,2,2,3,2,3,3,4
    );

    __m256i total = _mm256_setzero_si256();

    size_t i = 0;
    for (; i + 4 <= n; i += 4) {
        __m256i v = _mm256_loadu_si256(reinterpret_cast<const __m256i*>(&data[i]));

        __m256i lo = _mm256_and_si256(v, _mm256_set1_epi8(0x0F));
        __m256i hi = _mm256_and_si256(_mm256_srli_epi16(v, 4), _mm256_set1_epi8(0x0F));

        __m256i cnt = _mm256_add_epi8(
            _mm256_shuffle_epi8(lut, lo),
            _mm256_shuffle_epi8(lut, hi)
        );

        total = _mm256_add_epi64(total, _mm256_sad_epu8(cnt, _mm256_setzero_si256()));
    }

    // Reduce total (4×64 → 1×64)
    __m128i low128  = _mm256_castsi256_si128(total);
    __m128i high128 = _mm256_extracti128_si256(total, 1);
    __m128i sum128  = _mm_add_epi64(low128, high128);
    uint64_t result = _mm_cvtsi128_si64(sum128) + static_cast<uint64_t>(_mm_extract_epi64(sum128, 1));

    // Tail
    for (; i < n; i++)
        result += std::popcount(data[i]);

    return result;
}

Disgusting.

But we still have to prepare our code just to get the privilege of using this divine function. We need a way to generate a certain number of uint64_t values to feed into it. The question is — how many? Truth is, nobody can give you a definitive answer here, you’ll have to tune and tweak this number experimentally. So let’s declare a global constant for that:

static constexpr size_t SIMD_BATCH = 4;

Let’s keep it at 4 for now. That’s basically just one round of __m256, and it’s probably not enough, but at least it’s a start. We’ll have plenty of time to tweak that constant once everything else is in place.

Let’s recall how each thread is currently generating and processing numbers:

Generator gen {std::random_device{}()};
BigInt localHeads = 0;
for (BigInt i = 0; i < chunkSize; ++i) {
    localHeads += std::popcount(gen());
}

So right now we generate the next number and immediately count its bits. Now we need to generate SIMD_BATCH numbers at once and then pass them into avx2Popcount. That means we’ll have to allocate a bit of memory for an array of numbers and do a little loop black magic:

Generator gen{ std::random_device{}() };
BigInt localHeads = 0;
BitsType bits[SIMD_BATCH];

for (BigInt i = 0; i < chunkSize / SIMD_BATCH; ++i) {
    for (int j = 0; j < SIMD_BATCH; ++j) {
        bits[j] = gen();
    }
    localHeads += avx2Popcount(&bits[0]);
}

We also have to keep in mind that now the number of coin tosses must be divisible not just by 64, like before, but by SIMD_BATCH * 64, since that’s exactly how many bits (coin tosses) we generate per SIMD round. A non-divisible count is something we definitely don’t want, because handling leftover tails that don’t fit into SIMD means extra costs we don’t need. It’s much simpler to just make the iteration count fit the requirement. So let’s do that:

static constexpr size_t SIMD_BATCH_BITS = SIMD_BATCH * sizeof(BigInt) * CHAR_BIT;

...

Experiment(BigInt n_)
    : n(n_) {
    if (n % SIMD_BATCH_BITS != 0) {
        std::cerr << "Error: n must be a multiple of " << SIMD_BATCH_BITS << ", got " << n << std::endl;
        std::abort();
    }
}

And let’s not forget that SIMD_BATCH itself has to be a multiple of four, since we’re working with __m256. So somewhere we’ll drop in a nice little static_assert:

static_assert(SIMD_BATCH % 4 == 0, "shouldda be multiple of four!");

Now we can tune SIMD_BATCH without fear of ending up with gibberish — if something’s off, the compiler will yell at us and we’ll fix it. I settled on the sweet spot of 32 for my setup, and with that I got the following result:

Time:  60ms 906us 200ns
Heads: 34 359 807 145, 50.0001%
Tails: 34 359 669 591, 49.9999%

x1.23. Sure, it’s not four times faster, but at least I didn’t end up back at square one after rewriting the code.

Optimization Results

That wraps up this optimization saga. Now we can evaluate what we’ve done and lock in the results. Here’s a summary table of all the optimizations:

| Milestone | Time | Speedup |

| -------------------- | ----------------------------- | ------- |

| Measurement Start | 20min 41s 897ms 383us 600ns | - |

| Without distribution | 10min 50s 93ms 886us 900ns | x1.9 |

| Lazy generator | 7min 8s 600ms 933us | x1.5 |

| Restructuring | 19s 826ms 601us | x21.6 |

| std::popcount() | 5s 460ms 958us 600ns | x3.6 |

| 64-bit generator | 2s 554ms 51us 200ns | x2.1 |

| Multithreading | 126ms 479us 900ns | x20.2 |

| Custom Generator | 93ms 529us 800ns | x1.4 |

| More threads | 75ms 91us 300ns | x1.25 |

| SIMD | 60ms 906us 200ns | x1.23 |

Final Speedup: x20,235!

Honestly, there should’ve been a nice illustrative chart instead of a table, but the speedup was so massive that the chart, as soon as it started, just plummeted to the X-axis due to an absurdly large Y-range. Still, I tried to improvise and present to you my little hack — a chart with nested ranges:

In short, I don’t know any better way to show the scale of a twenty-thousand-fold speedup — the result is just too good :) Read the chart like this — as soon as it drops to the X-axis, move to the next chart, which zooms in on the Y-axis. If the labels are hard to read, the main chart covers a range of \[0; 1400\] seconds, the second-level chart covers \[0; 20\] seconds, and the third — \[0; 0.14\] seconds.

It’s also worth calculating the speed of our automated coin tosser. 68,719,476,736 tosses in about 61 milliseconds — which equals:

68 719 476 736 / 61ms
1 128 283 766 / ms
1 128 283 / us
1 128 / ns

Well, 1k/ns isn’t exactly 300k/ns, but still — not bad.

What else could speed this up?

Oh boy, I tried a bunch of other things:

But none of this changed execution speed. At this point, I think my optimization powers are exhausted. Still, I have plans for the future:

If any of this happens, I’ll write a separate article about it.

Big and Small Numbers

If you thought the article was logically wrapping up here, you’re sorely mistaken.

Now, armed with the fastest coin toss generator I could possibly make — a thousand tosses per nanosecond — we can make one last leap and turn our attention to the distribution of toss results.

We know the time to reach a perfect distribution tends to infinity. So our only real goal is to capture how close we can get to an ideal distribution in a reasonable time.

For this task, we need to represent very small floating-point numbers with high precision — the more decimal places we can accurately calculate, the better. And here lies a problem with double.

What’s wrong with it? It’s a floating-point number, and even if it were 128-bit or 256-bit, we can’t guarantee which numbers it will represent exactly and which it won’t. You can check this easily — I once wrote an IEEE-754 calculator that shows even something like 0.1 can’t be represented exactly as a floating-point number. A float will store something like 0.10000000149011611938, a 64-bit double0.10000000000000000555, and so on.

To compute precisely, we need fixed-point numbers, often called decimals. The idea is simple: under the hood, they store an integer — the number of fractional units. A typical example is money: under the hood we store everything in kopeks, and outwardly we treat them as rubles and their fractions. Money always has two decimal places. That’s fixed-point for you. C++ STL doesn’t have this by default.

We’re going to implement a decimal type for our very narrow task:

Note: earlier we used percentages to represent distributions. Now we’re ditching that and using 0.5 instead of 50%. This saves unnecessary conversions and unifies the idea of a small number.

For our decimal type, it makes sense to use a uint64_t under the hood. Since all numbers will be less than one, we can dedicate all digits to the fractional part. To fully exploit uint64_t, we can refer to the constant std::numeric_limits<uint64_t>::digits10, which gives the maximum number of decimal digits we can store without overflow. For uint64_t this is 19. Indeed, the maximum number for uint64_t is 18,446,744,073,709,551,615 — 20 digits, but not every number with 20 digits fits in 64 bits. So 19 is safe, and even 9,999,999,999,999,999,999 fits without breaking anything. That means we can have 19 decimal places of precision. I think that’s enough. If not, we can always simulate 128 bits.

Let’s implement the skeleton of the Decimal class:

class Decimal
{
public:
    Decimal(size_t digitsAfterComma, uint64_t underlying_ = 0)
       : underlying(underlying_)
    {
        if (digitsAfterComma > std::numeric_limits<uint64_t>::digits10) {
            std::cerr << "Error: you've exceeded limit of digits after fixed point" << std::endl;
            std::abort();
        }
        scale = static_cast<size_t>(std::pow(10, digitsAfterComma));
    }

    std::string toString() const {
        std::string res("0.");
        res += std::to_string(underlying);
        while (res[res.size() - 1] == '0') {
            res.resize(res.size() - 1);
        }

        return res;
    }

private:
    uint64_t underlying{};
    size_t scale{};
};

The public interface is almost nonexistent for now, but we’ll keep adding to it as we progress and figure out exactly what we need. Notice that we’ve made it possible to create numbers of any precision and capped digitsAfterComma at a maximum of 19 digits. We’ll need numbers of varying precision later, but more on that later.

Division Challenges

Let’s recall how we calculated the ratio of heads to tails:

const double headsPercent = static_cast<double>(experiment.heads) / n * 100.0;
const double tailsPercent = static_cast<double>(experiment.tails) / n * 100.0;

Let’s take our case with 68,719,476,736 tosses, which we’ve so painstakingly optimized. Imagine we got 34,359,743,765 heads and 34,359,732,971 tails. I punched the numbers into the Windows calculator — the ratio comes out as:

0.5000000785366864875 / 0.4999999214633135125.

By the way, these are numbers from the chapter “Counting Bits Properly,” and double there already choked and just showed 50%/50%. Or maybe the double stream formatting decided to be lazy. Either way, we want more precision.

So, to calculate the desired Decimal, we need to divide 34,359,743,765 by 68,719,476,736. Hmm. But how should we divide?

Cast to double and then somehow convert to Decimal? Pointless — we’d lose all precision and defeat the whole purpose of Decimal. Make a Decimal directly from these large numbers? They won’t fit, because in our Decimal all digits are devoted to the fractional part.

At some point, I realized mathematically it doesn’t matter whether you divide 34359743765 / 68719476736 or 0.34359743765 / 0.68719476736 — the result is the same! That means we can simply pull these numbers into our target Decimal as Decimal(19, 34'359'743'765), and nothing bad happens. All that remains is to divide one Decimal by another.

How do you properly divide Decimal numbers? Obviously, you can’t just divide the underlying variables — in our case that would give 34359743765 / 68719476736 = 0. You could cheat by casting both underlying values to double and dividing, but again, that kills the whole point of Decimal.

The correct way is a bit craftier: it’s integer division, but first you multiply the numerator by the scale, then perform integer division, and then divide the result by the scale again. That is:

343597437650000000000000000000 / 68719476736 = 5000000785366864874

5000000785366864874 / 10000000000000000000 = 0.5000000785366864874

Hmm. Yeah, the scale for numbers with 19 digits after the decimal point is indeed (10^{19}), which, as we recall, is already pushing the limits of what uint64_t can hold. And multiplying it by another huge number is a no-go — instant overflow and no way to compute anything. Frustrating — we need to do an impossible multiplication just to divide it right back. Even worse, in reality, the result will still end up close to 0.5 anyway.

I paced around, scratched my head over this for a while. Sat down with pen and paper. And you know what? On paper, it worked perfectly, because I could do long division of any numbers by hand, unlike this dumb machine… Wait, OH SHI~ Long division! We’re going to simulate long division.

Long division is an interesting thing — it either stops at some point and gives a finite number, or it goes on forever, spitting out gems like 0.16666666…. Either way, the process can be represented as a digit generator — you can always ask for one more digit; and you either get it or you’re told it’s the end. And that’s just a wonderful concept.

Without thinking too long, I sketched such a generator. Keep in mind, it’s heavily optimized for our case — the dividend is always smaller than the divisor, and the digits we output are those after 0.:

class Divider
{
public:
    Divider(BigInt nom_, BigInt denom_)
        : nom(nom_)
        , denom(denom_)
    {}

    int operator()() {
        if (nom == 0) {
            return -1;
        }

        nom *= 10;
        if (nom < denom) {
            return 0;
        }

        BigInt acc = denom;
        while (acc <= nom) {
            acc += denom;
        }
        acc -= denom;

        int digit = static_cast<int>(acc / denom);
        nom -= acc;

        return digit;
    }

private:
    BigInt nom{};
    BigInt denom{};
};

I might have missed some edge cases, but overall it works. Here’s an example of how such a division generator operates:

// 1 / 12 = 0.083333...
{
    Divider d(1, 12);
    d(); // 0
    d(); // 8
    d(); // 3
    d(); // 3
    d(); // 3
    d(); // 3
}

// 1 / 2 = 0.5
{
    Divider d(1, 2);
    d(); // 5
    d(); // -1
    d(); // -1
    d(); // -1
}

Here’s how I plan to pull it all together and perform the division:

We wrap all of this in a single function:

Decimal calcPercent(BigInt nom, BigInt denom, size_t digitsAfterComma) {
    Decimal res(digitsAfterComma);
    Divider divider(nom, denom);

    size_t scale = res.getScale() / 10;

    while (scale != 0) {
        int digit = divider();
        if (digit < 0) {
            break;
        }
        res.addFracts(digit * scale);
        scale /= 10;
    }

    // rounding
    int digit = divider();
    if (digit >= 5) {
        res.addFracts(1);
    }

    return res;
}

We see that the Decimal class needs a bit of an interface upgrade: it needs an addFracts() method. Playing prophet here, I’ll also add that we’ll need an isHalf() method — it’ll come in handy later, you’ll see:

class Decimal
{
    ...
    void addFracts(size_t fracts) {
       underlying += fracts;
    }

    bool isHalf() const {
        return underlying == 5 * (scale/10);
    }
    ...
};

Now we’re ready. Didn’t think dividing numbers could be this complicated. But hey, it was fun.

We replace the double-division with our new, precise method:

const Decimal headsPercent = calcPercent(experiment.heads, n, 19);
const Decimal tailsPercent = calcPercent(experiment.tails, n, 19);

and run our usual optimization test:

68 719 476 736 rounds
Time:  67ms 411us 500ns
Heads: 34 359 684 850, 0.4999992212106008083
Tails: 34 359 791 886, 0.5000007787893991917

Now we’re talking: this isn’t some measly 50.0001%, this is precision at your fingertips. Time to move to the next stage.

Mining Coins

Now, using Decimal, I want to get the most precise approach to 50/50 possible within a reasonable time. The idea is:

This becomes an endlessly running precision miner. It’ll tell us how long and how many coin tosses it takes to achieve increasingly precise distributions. When/if we mine precision to 19 digits, we’ll consider the experiment fully complete for technical reasons.

First, for convenience, we need to modify the Experiment class so its spin method can be called incrementally any number of times, and its state will accumulate: time measurements and the number of coin tosses already made must be saved — there’s no point starting from scratch every time. Previous results and measurements are valid and save resources. Let’s make the necessary adjustments to the class:

struct Experiment
{
    ...

    BigInt heads = 0;
    BigInt tossed = 0;
    nanoseconds timePassed{};

    void spin(BigInt n) {
        ... // all the work and `heads` calculation

        // measurements
        tossed += n;
        auto dt = Clock::now() - start;
        timePassed += dt;
    }

    BigInt getTails() const {
        return tossed - heads;
    }
};

Here’s the rough version of the miner:

constexpr size_t MAX_PRECISION = std::numeric_limits<uint64_t>::digits10;
constexpr BigInt SPIN_STEP = 1ull << 21;

Experiment ex;

const auto reportSuccess = [&ex](size_t precision) -> bool {
    Decimal heads = calcPercent(ex.heads, ex.tossed, precision);
    Decimal tails = calcPercent(ex.getTails(), ex.tossed, precision);

    if (heads.isHalf() || tails.isHalf()) {
        std::cout << "Tossed: " << prettifyBigInt(ex.tossed) << std::endl;
        std::cout << "Heads:  " << prettifyBigInt(ex.heads) << ", " << heads.toString() << std::endl;
        std::cout << "Tails:  " << prettifyBigInt(ex.getTails()) << ", " << tails.toString() << std::endl;
        std::cout << "Time:   " << prettyDuration(ex.timePassed) << std::endl << std::endl;
        return true;
    }

    return false;
};

for (size_t precision = 1; precision <= MAX_PRECISION;	++precision) {
    std::cout << "precision: " << precision << '\n';

    while (true) {
        ex.spin(SPIN_STEP);
        if (reportSuccess(precision)) {
            break;
        }
    }
}

First run:

precision: 1
Tossed: 2 097 152
Heads:  1 048 560, 0.5
Tails:  1 048 592, 0.5
Time:   3ms 275us 200ns

precision: 2
Tossed: 4 194 304
Heads:  2 097 912, 0.5
Tails:  2 096 392, 0.5
Time:   4ms 796us 600ns

precision: 3
Tossed: 6 291 456
Heads:  3 147 050, 0.5
Tails:  3 144 406, 0.5
Time:   6ms 403us 800ns

precision: 4
Tossed: 16 777 216
Heads:  8 388 366, 0.5
Tails:  8 388 850, 0.5
Time:   13ms 117us 700ns

precision: 5
Tossed: 20 971 520
Heads:  10 485 662, 0.5
Tails:  10 485 858, 0.5
Time:   16ms 578us 700ns

precision: 6
Tossed: 1 333 788 672
Heads:  666 894 992, 0.5
Tails:  666 893 680, 0.5
Time:   711ms 941us 100ns

...

We cautiously assume it works. But there are problems.

First, it feels underwhelming that whenever we break a record and our Decimal, hitting its precision limit, proudly says: "I am exactly 0.5!", we print that to the screen — which is always just 0.5. So it’s unclear why we even bother showing it. At the same time, it’d be nice to see the actual number we got.

Here’s my trick: I’ll compute the division for the Decimal at the miner’s current precision, but in parallel I’ll compute the same division again at maximum possible precision. The decision of whether we’ve achieved the desired precision will be made based on the “rough” Decimal, while the displayed value will be the “fine-grained” Decimal. Sounds confusing, but the code will make it clearer:

Decimal headsRough = calcPercent(ex.heads, ex.tossed, precision);
Decimal tailsRough = calcPercent(ex.getTails(), ex.tossed, precision);

Decimal headsDetailed = calcPercent(ex.heads, ex.tossed, MAX_PRECISION);
Decimal tailsDetailed = calcPercent(ex.getTails(), ex.tossed, MAX_PRECISION);

if (headsRough.isHalf() || tailsRough.isHalf()) {
    // display `headsDetailed` and `tailsDetailed`
}

The second problem isn’t as obvious — I felt it more than I saw it. Look at the number of iterations needed for each new record — it’s different every time. I began to suspect that at least at low precisions, one round could easily “grab” several records at once, jumping, for example, from 0.51 to 0.5006.

The fix for this, even if it’s a hypothetical issue, is simple — before starting each experiment, check whether the experiment has already beaten the precision level we’re about to target:

for (size_t precision = 1; precision <= MAX_PRECISION;	++precision) {
    std::cout << "precision: " << precision << '\n';

+   if (reportSuccess(precision)) {
+       continue;
+   }

    while (true) {
        ex.spin(SPIN_STEP);
        if (reportSuccess(precision)) {
            break;
        }
    }
}

In fact, this problem might or might not show up depending on the chosen spin step, and here we’re approaching the last problem. That problem — choosing SPIN_STEP — is really hard. I set it to 1ull << 21 (that’s $2^{21}$), but that was the result of a tiresome search. $2^{20}$ gave a snail’s pace, and I didn’t have the patience to even reach precision: 1. $2^{22}$ was too fast and immediately hit huge values that took forever to compute.

This, by the way, shows that tuning the miner for optimal speed isn’t easy — it will strongly depend on the step you choose. I reasoned the step should be adaptive — small enough at the start, then increasing. I introduced a minimum step and a maximum step:

constexpr BigInt MIN_SPIN_STEP = SIMD_BATCH_BITS;
constexpr BigInt MAX_SPIN_STEP = 4'294'967'296;
size_t step = MIN_SPIN_STEP;

Then I started calculating the current step dynamically on the fly, based on how many times the experiment had already tossed the coin:

step = experiment.tossed < step ? step : std::min(step * 2, MAX_SPIN_STEP);
experiment.spin(step);

That is, we periodically double the step until we hit the maximum. The maximum here is purely nominal — just in case — and I picked a tested, fairly large value: 4,294,967,296.

The miner now works as intended. At least, as I want it to. Here are the first six captures:

precision: 1
Tossed: 33 552 384
 Heads: 15 724 557, 0.4686569216661325765
 Tails: 17 827 827, 0.5313430783338674235
  Time: 18ms 222us 700ns

precision: 2
Tossed: 268 433 408
 Heads: 133 163 234, 0.4960754884876326571
 Tails: 135 270 174, 0.5039245115123673429
  Time: 21ms 636us 200ns

precision: 3
Tossed: 2 147 481 600
 Heads: 1 072 710 984, 0.4995204540984192833
 Tails: 1 074 770 616, 0.5004795459015807167
  Time: 25ms 722us 500ns

precision: 4
Tossed: 21 474 834 432
 Heads: 10 736 419 663, 0.4999535478141562046
 Tails: 10 738 414 769, 0.5000464521858437954
  Time: 46ms 516us 900ns

precision: 5
Tossed: 227 633 264 640
 Heads: 113 815 517 221, 0.4999951013354671009
 Tails: 113 817 747 419, 0.5000048986645328991
  Time: 253ms 918us 100ns

precision: 6
Tossed: 1 026 497 181 696
 Heads: 513 248 105 522, 0.4999995272018192996
 Tails: 513 249 076 174, 0.5000004727981807004
  Time: 1s 38ms 463us 500ns

...

Works great — and most importantly, now the progress in precision is clearly visible through the growing strings of zeros and nines. Quite satisfying to watch.

Persistence

Once I had the miner tuned well enough to work reliably, I realized that after about 10–12 digits after the decimal point (which takes just a few minutes), the program hits extremely serious timing limits. Presumably, reaching further precision could take hours or even days — in fact, I don’t know yet, because I haven’t tested it.

I decided I want to run it for several days to see what happens. But I don’t want to lose the data mined with such stubborn effort. If anything goes wrong — the program crashes or there’s a power outage — I want the results to be preserved.

So we need to save the data to a file. But we still want to print to the console, because that’s convenient. We want identical records in both places, meaning we either duplicate every output or create a system to automatically write to two streams at once. Obviously, the second option is preferable, so I set about writing a StreamOverdub class.

class StreambufOverdub : public std::streambuf {
public:
    StreambufOverdub(std::streambuf* sb1, std::streambuf* sb2)
        : sb1(sb1), sb2(sb2) {
    }

protected:
    int overflow(int c) override {
        if (c == EOF) return !EOF;
        const int r1 = sb1->sputc(static_cast<char>(c));
        const int r2 = sb2->sputc(static_cast<char>(c));
        return (r1 == EOF || r2 == EOF) ? EOF : c;
    }

    int sync() override {
        int const r1 = sb1->pubsync();
        int const r2 = sb2->pubsync();
        return (r1 == 0 && r2 == 0) ? 0 : -1;
    }

private:
    std::streambuf* sb1;
    std::streambuf* sb2;
};

class StreamOverdub : public std::ostream {
public:
    StreamOverdub(std::ostream& o1, std::ostream& o2)
        : std::ostream(&tbuf)
        , tbuf(o1.rdbuf(), o2.rdbuf()) {
    }

private:
    StreambufOverdub tbuf;
};

As you can see, the whole trick is wrapping two std::streambuf objects in a proxy. Now, if you create a StreamOverdub like this:

std::ofstream file("mined coins.txt", std::ios::trunc);
StreamOverdub out(file, std::cout);

then we can simply replace every std::cout with out and get output to both console and file for the cost of a single output operation. Also, after each successful capture, it’s wise to call out.flush() so the file is definitely updated with fresh data.

Now no catastrophe can destroy our mined results, and we can confidently run the miner through the weekend.

What I Mined

So, here I present to you the results of two days of mining:

precision: 1
Tossed: 33 552 384
 Heads: 15 724 557, 0.4686569216661325765
 Tails: 17 827 827, 0.5313430783338674235
  Time: 18ms 222us 700ns

precision: 2
Tossed: 268 433 408
 Heads: 133 163 234, 0.4960754884876326571
 Tails: 135 270 174, 0.5039245115123673429
  Time: 21ms 636us 200ns

precision: 3
Tossed: 2 147 481 600
 Heads: 1 072 710 984, 0.4995204540984192833
 Tails: 1 074 770 616, 0.5004795459015807167
  Time: 25ms 722us 500ns

precision: 4
Tossed: 21 474 834 432
 Heads: 10 736 419 663, 0.4999535478141562046
 Tails: 10 738 414 769, 0.5000464521858437954
  Time: 46ms 516us 900ns

precision: 5
Tossed: 227 633 264 640
 Heads: 113 815 517 221, 0.4999951013354671009
 Tails: 113 817 747 419, 0.5000048986645328991
  Time: 253ms 918us 100ns

precision: 6
Tossed: 1 026 497 181 696
 Heads: 513 248 105 522, 0.4999995272018192996
 Tails: 513 249 076 174, 0.5000004727981807004
  Time: 1s 38ms 463us 500ns

precision: 7
Tossed: 10 647 223 924 736
 Heads: 5 323 611 436 183, 0.4999999505800757343
 Tails: 5 323 612 488 553, 0.5000000494199242657
  Time: 10s 355ms 188us 500ns

precision: 8
Tossed: 11 583 526 795 264
 Heads: 5 791 763 344 721, 0.4999999954322201748
 Tails: 5 791 763 450 543, 0.5000000045677798252
  Time: 11s 292ms 750us 400ns

precision: 9
Tossed: 11 785 390 258 176
 Heads: 5 892 695 123 667, 0.4999999995400237174
 Tails: 5 892 695 134 509, 0.5000000004599762826
  Time: 11s 483ms 665us

precision: 10
Tossed: 13 284 333 844 480
 Heads: 6 642 166 921 632, 0.4999999999542318036
 Tails: 6 642 166 922 848, 0.5000000000457681964
  Time: 12s 903ms 693us 300ns

precision: 11
Tossed: 41 145 786 693 632
 Heads: 20 572 893 346 782, 0.4999999999991736699
 Tails: 20 572 893 346 850, 0.5000000000008263301
  Time: 1min 36s 550ms 404us 400ns

precision: 12
Tossed: 166 129 335 007 232
 Heads: 83 064 667 503 543, 0.4999999999995605833
 Tails: 83 064 667 503 689, 0.5000000000004394167
  Time: 3min 50s 949ms 149us 600ns

precision: 13
Tossed: 6 560 038 558 627 840
 Heads: 3 280 019 279 313 828, 0.4999999999999859757
 Tails: 3 280 019 279 314 012, 0.5000000000000140243
  Time: 6h 22min 42s 628ms 678us 500ns

precision: 14
Tossed: 6 568 598 428 448 768
 Heads: 3 284 299 214 224 400, 0.5000000000000024358
 Tails: 3 284 299 214 224 368, 0.4999999999999975642
  Time: 6h 22min 51s 19ms 966us 300ns

Beautiful. Notably — achieving 12 digits of precision took just three minutes, but the next results required a six-hour wait. Then suddenly, 13th and 14th digits arrived almost together, just ten seconds apart. Oh wow. The rest of the weekend the machine just burned CPU for nothing :) Or maybe I should’ve waited a bit longer — who knows.

In fact, this isn’t my only long run — such results repeat again and again. And we still haven’t touched the limits of 64-bit numbers — reaching 19 digits of precision is still a moonshot for us. So let’s leave it there.

Conclusions

Wrapping up this long journey, we record the following results:

I think the scholars of the past could be proud of today’s computational capabilities.

Modern-day scholars in the comments can tell you why my experiment isn’t rigorous, but I achieved my main goal — I had fun.

A constexpr solution to the problem

Actually, there’s a compile-time solution to this whole problem. All its code fits in these lines:

#include <iostream>

int main()
{
    std::cout << "precision: inf" << std::endl;
    std::cout << "Tossed: inf" << std::endl;
    std::cout << " Heads: inf, 0.5" << std::endl;
    std::cout << " Tails: inf, 0.5" << std::endl;
    std::cout << "  Time: 0" << std::endl;

    return 0;
}

It outputs:

precision: inf
Tossed: inf
 Heads: inf, 0.5
 Tails: inf, 0.5
  Time: 0

I consider this solution optimal.

UPD

I’m attaching the GitHub repository for anyone curious to dig in. If you have ideas to speed up the program, I’m looking forward to your Pull Request.


© Nikolai Shalakin. Translated by the author.