Floats are not inaccurate
  • 17th May 2024 •
  •  15 min read
  •  • Tags: 
  • code
  • Last updated on 17th May 2024

There is a ton of sources on the internet about binary floating-point numbers, how they work and how they can yield surprising results and lead to programming errors and bugs. The reason I’m writing another article about this is that many of them only tell you half of the story or contain wrong information. Keep in mind that “wrong” can also mean that the statements were simplified down to a point where they are IMHO misleading. It doesn’t necessarily mean that the authors themselves didn’t know any better. Many sources even claim that floating-point numbers are not precise enough or should somehow be avoided at all costs because of “rounding errors”, which couldn’t be any further from the truth.

🔗About floats

When I say “floats” I mean representations that are based on scientific notation (like IEEE 754). This also includes doubles and other (even made up) precisions. There are other representations that are not based on scientific notation as well, but they are pretty rare and I don’t address them here.

🔗Floats don’t have to be base two

The basic form of a floating point number looks like this

$$ -1^\text{Sign}\ \cdot\ \text{Mantissa}\ \cdot\ \text{Base}^\text{Exponent}\ $$

So to pin down the value of a floating-point number, we have to know these four values (all positive integers):

  • Sign
  • Mantissa
  • Base
  • Exponent

For practically all floating-point types the base is constant and therefore isn’t stored in the binary representation. For IEEE 754 floating-points the base is two. The other three values Sign, Mantissa and Exponent are stored. For 32 bit float (single precision) there is a single sign bit, 8 bits for the exponent and 23 bits for the mantissa.

⚠️ Important
The fact that IEEE 754 uses binary floating-points (base two) is not implied by the fact that computers can only work with binary! Read it again!

This is something a lot of other resources just assume, don’t mention or describe in a way that sounds like that one would imply the other. It doesn’t! The base could be completely different, like 10, 3 or 7. But of course there is a reason why the IEEE standard uses base two and not base ten, but it’s a practical and historical one.

The ideas and standards someone writes into a paper somehow have to be either programmed in software or built in hardware (or a combination of those) and if you work with binary logic and gates it’s way simpler to design and implement circuitry that does base two calculations than in a different base, like base ten. Now add into the mix that even computers like the Z1 back in 1938 implemented some form of floating-point numbers, it’s clear why hardware manufacturers build their floating-point support with base two. In our modern world, where CPUs are no longer manually designed and layed out by hand, it would be possible to implement instructions for native base ten floats in a binary based hardware, but it’s still not done because there wouldn’t be a lot of benefits over current software implementations (decimal types).

So to recap: The base for IEEE 754 floats is two because it’s simpler to implement that in hardware. The base for floating-point numbers on computers don’t have to be two. There even are types (decimal types) that use base ten, but such types are implemented in software.

🔗Floats are not inaccurate

Many developers think, because of their experience and what they’ve been told, that floats are inaccurate. That’s wrong. Floats are not inaccurate, your assumptions about how floats work are inaccurate. Let’s take the following code snippet

// define x as a 32 bit float and store 0.1
let x : f32 = 0.1;

// print x to the console
println!("{}", x);

If you execute it, the console shows 0.1. Everything perfect right? No. The problem already happened at line 1 and it happened at compile-time(!). Binary floating-point numbers can’t store most decimals exactly. According to the source code we wanted to store $\frac{1}{10}$ in x but that’s impossible, because float can’t represent 0.1 exactly. So what happens? During the compilation process the compiler encounters your input with 0.1, “find’s out” that this can’t be exactly represented in 32 bit float and then looks for the closest value that can be represented in 32 bit float. I won’t describe the exact process here, but the compiler has to find the values for Sign, Mantissa and Exponent. In our case it will likely find the following solution:

$$ \Large -1^{\overbrace{0}^{\text{Sign}}}\ \cdot\ \overbrace{13\text{‘}421\text{’}773}^{\text{Mantissa}} \cdot\ 2^{\overbrace{-27}^{\text{Exponent}}}\ $$

$$ = 0.100000001490116119384765625_{10}\ \small\text{(exactly!)} $$

As you can see the value that is stored is not 0.1 exactly, but something close, because it has to fit into the form $-1^S\cdot M\cdot2^E$. If you wounder what’s the next smaller representable value in float 32, just reduce the mantissa by one

$$ -1^0\ \cdot\ 13\text{‘}421\text{’}772 \cdot\ 2^{-27}\ $$ $$ = 0.0999999940395355224609375_{10}\ \small\text{(exactly!)} $$

Another way to show why some values don’t fit in that representation would be to do a binary conversion. If the result has to many digits after the decimal point - which always happens when the result is periodic - it will not be stored exactly.

$$ 0.1_{10} \overset{\text{to binary}}{\Longrightarrow} 0.0\overline{0011} $$ $$ = 0.00110011001100110011… $$

Note that this conversion error doesn’t happen for all values. There are values which can be represented exactly. For example all integers (inside a specific limit) can be represented exactly and fractional parts that are powers of two (or sums of powers of two) like 0.25, 0.5, 0.125, 0.625, etc.

A very (very) pedantic compiler could throw warnings (or even errors) every time you write a binary floating-point literal in base ten that can’t be represented exactly. But that would probably be pretty annoying and that’s why the compiler just let’s you write stuff like let x : f32 = 0.1; even though it can’t store it perfectly.

All the conversion problems are also true for non-compiled languages, it just might happen at a different point in time. It might happen when an intermediate code representation is generated (like in C# or Java), during the JIT compilation step (like in V8 JavaScript), or at runtime for some interpreted scripting languages.

Now that we know that just by writing something like let x : f32 = 0.1; we already stored a value which is not exactly 0.1, why does the program still print 0.1. Some people might think it has to do with rounding and from the outside it certainly looks like regular rounding, but internally it’s “a bit” more involved. I won’t go into much detail but you can google Grisu2, Grisu-Exact and Ryu + float if you want to go deeper.

As we saw, converting our base ten input to a binary floating-point number is lossy. But getting back a human readable base 10 number is never lossy in a sense that it never leads to periodic representations. Because two (the base for IEEE floats) is a factor of ten (the number most system humans typically use), all binary floating point numbers with a finite decimal expansion an be converted to a base 10 floating point number with a finite decimal expansion. There are always many (in theory even infinitely many) possible base 10 values, that would lead to the same binary floating point representation, so the computer “just picks” the shortest decimal number that would be converted back to the same binary floating-point number you are trying to display.

🔗Why not use decimal types everywhere?

Based on what we’ve seen so far, one might be tempted to just never use binary floats and always use decimals for everything. When you write base ten literals decimal types won’t introduce conversion errors, because the literal value you wrote and the target type are both base ten. But they also don’t have infinite precision and introduce errors for operations like division. For example think about what would happen inside a base ten floating-point number when you divide $\frac{1}{3}$. The number $0.3333333$ will be stored and when you then multiply it again by 3 you will get $0.9999999$. Many decimal type implementations will try to hide or mask that fact by using a higher precision internally for calculations and “round” before they display the number. Decimal implementations are also not hardware accelerated and because of that are a lot slower than binary floats.

🔗Good and bad use-cases for floats

Floats can be used everywhere where it doesn’t matter that you can’t store a 100% accurate base ten representations. For example positions and speeds in 3D games and animations, “analog” values like temperatures, speed of a vehicle, geo positions with longitude and latitude, a persons weight or heart pressure. In fact if you develop games there is no way around 32 bit floats because GPUs are f32 number crunching beasts. Modern 3D games wouldn’t be possible without all those fast f32 calculations.

You shouldn’t use binary floats if you need or expect accurate base ten calculations (addition, subtraction, multiplication, - note that divisions also introduce errors quickly in decimal types) and for dimensions that have a smallest unit that can’t be broken down, for example like money. If you need to handle money just store the amount of cents as integers and only divide by 100 in your display function.

🔗Thing other sources get wrong

Here is a collection of some quotes from other sources that are wrong or could be misleading and corrections for those statements. In general the information on those sites is fine but the devil is in the detail. The fact that a site or resource is linked here doesn’t mean it’s bad, quite the opposite actually. They are all pretty solid material to learn about binary floats.

🔗0.30000000000000004.com

I think this site is a pretty good resource which provides a good overview over a lot of programming languages and how floats behave in them. Nonetheless here are a few things the author(s) got wrong

Computers can only natively store integers

That’s not quite right, because computers can’t even store integers natively. Computers can only store and process “bits (binary digits)” natively. This distinction is important, because computers (and without context even humans) don’t know what those bits represent. They don’t have to represent integers. Here is a single byte (eight bits) for example:

$$ \boxed{1}\boxed{1}\boxed{0}\boxed{1}\boxed{1}\boxed{1}\boxed{1}\boxed{1} $$

Now try to interpret what’s stored in this byte. The only thing we can say for certain is that the bits “11011111” are stored here. We can certainly try to interpret it as an integer, but it could be all sort of things like

So you can certainly interpret everything a computer stores as an integer, but you could also interpret everything a computer stores as braille patterns, but that doesn’t mean that computers natively store braille patterns. The just store and process bits and those bits don’t have any meaning up until the point where we assign meaning.

Back to the original topic “floats”. The fact that computers can only store binary information means, that we have to find a representation for a floating point number that we can store in binary. That does NOT however imply that the base used by the floating point number must be two, as discussed earlier.

When you perform math on these repeating decimals, you end up with leftovers which carry over when you convert the computer’s base-2 (binary) number into a more human-readable base-10 representation

As discussed earlier, this is also not true. All floats can be converted to a human-readable base-10 representation without any error. The biggest errors happens when the programmer assumes that their decimal literal (without any other operations at all) is stored as an exact value - which it’s not in most cases.

🔗Tom Scott at Computerphile

IMHO the video is a good summary, but there are a few important details it glosses over.

@00:00: “People expect computers to be entirely accurate and precise with numbers.”

And they are if we exclude bugs and cosmic rays. This quote kind of primes the viewer to expect something where computers are not accurate. But that’s not true for floating-point numbers, they are accurate, but programmers have misconceptions about floats.

@03:42: “Base 2, on the other hand, binary…, computers…, they don’t do that.”

Tom makes it sound like that it’s not possible for computers to accurately represent and calculate with base 10 floating points. This, however is not true and Tom knows this because he later (very briefly) mentions decimal types (@08:07)

🔗Floating-Point Guide

In Basic Answers the site states

Because internally, computers use a format (binary floating-point) that cannot accurately represent a number like 0.1, 0.2 or 0.3 at all.

The sentence sounds like the only options for computers to handle floating point numbers are IEEE 754 binary floats, which is not true. Computers can handle decimals just fine and are able to calculate $0.1+0.2=0.3$ precisely, but you have to know what you are doing.

Grab a random scientific calculator like a “Casio fx-991ES”, “TI-30XIIS” or “HiPER Calc for Android” (not sponsored btw. but I’m a big fan) and show me the “floating point errors”. The problem is that a lot of developers miss the fact that floats are calculation primitives. It’s a like using a single byte (u8), then wonder why you can’t count up to 1’000 and than claiming that computers can’t count higher than 255.

So binary floating points can’t represent 0.1, but computers certainly can and do that all the time, for example with types like C# decimal. The Rust crate BigRational for example can even handle arbitrary! precision (at least until you are out of RAM 😂). To be fair the site mentions decimal types a couple of lines later.

So even if your programming language exposes a lot of calculation primitives to you like 64-bit integers, this doesn’t mean that computers can’t handle numbers larger than 64-bit integers. They can, but you have to model and solve your problem with software and build new types based on those primitive types. Most of the time some smart people already solved those problems for the language you use and you just have to include a library to have proper base ten support.

When the code is compiled or interpreted, your “0.1” is already rounded to the nearest number in that format

Big bonus points for that. Most sites fail to mention that the “biggest” (and for most probably most unintuitive) error already happened long before the first operation.

🔗More about floats

IEEE 754 floats have even more interesting concepts like NaNs, two zeros (plus and minus zero), plus and minus Infinity, subnormal numbers, etc. In reality we just have scratched the surface yet. All those things are out of scope for this article, but I wanted to mention them so you can look those cases up if you are interested.

Here are a few links to sources I’d consider pretty solid