vrijdag 29 september 2017

Floating point numbers made simple

Based on a great post by Fabien Sanglard, over at http://fabiensanglard.net/floating_point_visually_explained/, I figured I should add some more information about floating point numbers. The information is universally applicable to any programming language and comes in handy in many weird cases. I recently helped somebody debug a weird memory corruption where his data was being overwritten by floating point numbers, and this knowledge of recognizing one on sight helped me to be able to do that.

 Basic floats (recap)

Fabien's basic explanation comes down to a floating point number being in a "window" of numbers, and then encoding which window it is, and what offset in that window it is at. In short, you have a window from a given power of two, to the next power of two. For example, a window from 4 to 8, or from 1024 to 2048. Each of these ranges of numbers forms a single window of numbers. Every window is the exact same amount of numbers - 8388608 to be exact, for a 32-bit float.

This has three important implications. First off, not all numbers can be represented. If your window is really big, there will not be enough numbers to go around to represent all the numbers accurately, so (for example) the number 16777217 is not representable as a float, no matter how hard you try. I picked this number specifically because it is 1 higher than 2^24, so it falls in the window from 16777216 to 33554432 - that's a 16.7 million number range, and there are only 8388608 numbers in a window so only half of them get to be representable accurately as a float.

Second, not all common numbers are in a window. The floating point standard defines 5 types of numbers:
  • Normal numbers that follow the rules as set out, and these are by far the most common number to be found
  • Zeroes - and yes, that's plural, because we have positive zero and negative zero.
  • Infinities, again plural because we have positive infinity and negative infinity
  • Not-a-number, the representation that's used when you've made a calculation that results in mathematical nonsense
  • Denormals. I will give a short intro, but it boils down to not being able to use them.
Third, there is a finite amount of windows to go around. In a float there are 8 bits reserved for the window and this allows us to have 256 different positive windows (and 256 different equivalent negative windows). Two of these windows, the first and the last one, are used for the special number encodings. That leaves 254 different windows for normal numbers... these are balanced around the middle point so window #127 ends up mapping from 1 to 2, window #126 maps from 0.5 to 1 and window #128 maps from 2 to 4.

Doing a bit of math shows that window #1, the first one we can use for normal numbers, maps from 1.1754944e-38 to 2.3509887e-38. That first number is 2^-126, and is exactly what FLT_MIN represents - the smallest normal floating-point number. Given that all windows represent the same amount of numbers, we now also know that there are 8 million floating point numbers smaller than 2.3509887e-38 ! We can do the same for window #254 and see it represents numbers from 1.7014118e+38 to 3.4028237e+38 - and that latter number is of course FLT_MAX. We also know that there are only 8 million numbers in this window - so they are really far apart!

Assembling floats

Now that we know how the parts of a float work together, we can assemble one by hand. A float has the window identifier (exponent), the offset in the window (mantissa) and the sign bit (whether it's positive or negative). Quite bluntly, it's literally the three of these concatenated as bits - one bit for the sign, then 8 bits for the exponent, then 23 bits for the mantissa. So for example:

The number positive 1 is a 0-bit for the sign, then 127 for the exponent (as it's the 1st number in window #127). Then we put in 0 for the mantissa (as it's the first number in the window, again) and we get 0x3F800000.
Let's try a harder one - pi. Pi is a positive number (so 0 for the sign), then 128 for the exponent (as it's in the window from 2 to 4). It's slightly over halfway there, so we calculate (pi - 2) / (4 - 2) * 8388608 and get 4788187 (by just copy/pasting the calculation to google - try it yourself!). Assembling this gets us

0x40000000 for the exponent (128 (0x80), the value, shifted left by 23 positions)
0x00490FDB for the mantissa (just converted to hexadecimal)

Assembled that becomes 0x40490FDB. Now let's see what the computer thinks of it:

    #include
    int values[] = {
      0x3F800000,   // We expect 1.0
      0x40490FDB    // We expect pi
    };
    int main() {
      float* f = (float*)values;
      printf("%f %f\n", f[0], f[1]);
    }

    $ ./test
    1.000000 3.141593


Looks to be pretty accurate. Try it yourself, but turn off optimization - your compiler can normally assume you're not doing these kinds of hacks, so the optimizer will break this code.

Special numbers

So there are four kinds of special numbers. For all of these we have to make special corner cases in doing maths, because they don't behave like normal numbers. The first two types are pretty simple ones:

- Positive zero is the first number in positive window #0.
- Negative zero is the first number in negative window #0.
- Positive infinity is the first number in positive window #255.
- Negative infinity is the first number in negative window #255.

We treat them special so that anything times zero is a zero, and anything times infinity is infinity. That brings the question - what's infinity divided by zero? zero times infinity? Or infinity divided by infinity? Or zero divided by zero?

All of these are of course nonsense questions - they're more of a mathematical oddity in doing derivatives and integrations. Your computer is not able to reply in any other way to such a question other than giving back a number though, so a number it will be - but it will be the "Not-a-number" reply. That's your computer telling you "This was nonsense and I cannot give you a numerical answer". These are encoded similar to infinity, except that it's not the first number. It doesn't really matter which number it is in window #255 - as long as it's not 0 it's a NaN.

Denormals

This leaves us with the 5th type of number. A denormal is a number that's not normal, but only *just* outside of the normal range. A long time ago somebody thought about the number representations and noticed that there are 8 million numbers in window 0 that we're not using at all! What if we use this for a special "final window" of numbers from 0 to FLT_MIN, they thought. Then you can encode those numbers as well...

In practice though, this is not used often at all and only happens in very weird corner cases. In those corner cases it's not even that useful, because the precision suffers. There are only 8 million numbers between 0 and FLT_MIN, so unlike the normal numbers where dividing by two will go to the next window, every division by 2 will halve your precision instead. This leads to computer chip makers to make your computer really fast at doing anything that's not a denormal number - and implicitly, to go really slow when it is, because it requires special attention. Some computer makers then allowed you to "turn off" denormal handling, so that it would just treat it as zero - also called DaZ, Denormals-as-Zero. Some computer makers then just flat out don't implement denormal numbers any more, because nobody wants it and people turn it off anyway.

So this is why there are denormal numbers, and why you really shouldn't care. Your computer probably doesn't support them, and if it does it will slow down disproportionally if you use them making it a useless feature.

Further reading

All of these bits of information are from a standard that describes floating point numbers. Nearly all computers nowadays implement IEEE-754 floating point numbers. These are the numbers I described above. The standard also has 64-bit and larger numbers, which do the same basic idea as described above, but will just result in higher and harder to read numbers.

You can also take these insights and read 16-bit floating point numbers - they do the same logical things. Some omit the NaN/Infinity window to get just one more window of normal numbers.

1 opmerking:

  1. This is a really nice and accessible article!

    As you noted, casting a int * to a float * like that can break because it violates the strict aliasing rule which is undefined behavior.

    In both C and C++ we don't have violate strict aliasing rules nor do type punning via unions, which is another common way to do this. We can use memcpy which is a blessed solution and the compiler should recognize type punning via memcpy and optimize appropriately. I point to several references here on this point: https://stackoverflow.com/a/31080901/1708801

    For C++, JF Bastien has a proposal to add bit_cast which would also solve this problem: http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0476r1.html and see the code on github: https://github.com/jfbastien/bit_cast/blob/master/bit_cast.h

    BeantwoordenVerwijderen