Floating Point Explained

The German scientist, Zuse, when he developed the first computer in 1941, realized that computing machines would need to represent any numerical value - just whole numbers (0, 1, 2, etc.) would not be sufficient. Whole numbers are good for things that can be counted, but computers needed to be able to represent things that could be measured, such as scientific calculations. He used a numeric representation in the computer in a manner similar to scientific notation with a sign, an exponent, and a fraction. This method became known as “floating point.”

Floating point was expensive to implement in computers and, consequently, in earlier and smaller computers floating point was implemented in software. And different companies used different representations and algorithms to perform floating point arithmetic.

Eventually, advancing computer technology implemented these representations and algorithms with hardware, so floating point results were not compatible between computers in native form. Even worse, different computers gave different answers to the same problem. So in 1985, the Institute of Electrical and Electronics Engineers (IEEE) developed and published a standard for floating point that has since been improved and internationalized.

Standard Floating Point Format

This diagram indicates how floating-point numbers are represented in a computer in standard floating point format. Even though the number of bytes required may vary, the format remains the same.

where k is the total space required for the floating-point value, S is the single bit sign, E is the exponent of width e, and T is the mantissa or fraction of width t.

Common values for k are 32, 64, 80, and 128 bits

Though E is a positive integer between 0 through 2^e-1, the actual exponent represented is E-O where O is an offset so that exponents represented are actually in the range -2^O to +2^E-O. O is normally chosen to be 2^e-1-1.

Though T represents a value between 0.0 and T/2^t, there is a “hidden bit” equal to 2^t so that the “fraction” represented is actually 1.0 to 2.0 minus a bit. The hidden bit need not be shown in the representation, because it must always be one.

Precision, p = t+1, by the standard, is the number of significant bits that might be significant. The real value represented is:

-1^S * ( 1+T*2^t) * 2 ^E-0

Numbers that cannot be represented exactly by sums of powers of two will have error.

Standard Floating Point Format

home | PRoblem | solution | blog | connect

Safety in Numbers

Blog

Floating Point Explained

Standard Floating Point Format

home | PRoblem | solution | blog | connect

Safety in Numbers