Comparing Floating Point Numbersby Bruce Dawson
Posted on 20101230 11:25 kitlover 阅读(...) 评论(...) 编辑 收藏Comparing floating point numbers
Bruce Dawson
(源:http://www.cygnussoftware.com/papers/comparingfloats/comparingfloats.htm)
Comparing for equality
Floating point math is not exact. Simple values like 0.2 cannot be precisely represented using binary floating point numbers, and the limited precision of floatingpoint numbers means that slight changes in the order of operations can changethe result. Different compilers and CPU architectures store temporary resultsat different precisions, so results will differ depending on the details ofyour environment. If you do a calculation and then compare the results againstsome expected value it is highly unlikely that you will get exactly the result you intended.
In otherwords, if you do a calculation and then do this comparison:
if(result == expectedResult)
then it is unlikely that the comparisonwill be true. If the comparison is true then it is probably unstable – tinychanges in the input values, compiler, or CPU may change the result and makethe comparison be false.
Comparingwith epsilon – absolute error
Since floating point calculations involve a bit of uncertainty we can try to allowfor this by seeing if two numbers are ‘close’ to each other. If you decide –based on error analysis, testing, or a wild guess – that the result should always be within 0.00001 of the expected result then you can change your comparison to this:
if(fabs(result  expectedResult) < 0.00001)
The maximumerror value is typically called epsilon.
Absolute error calculations have their place, but they aren’t what ismost often used. When talking about experimental error it is more common tospecify the error as a percentage. Absolute error is used less often because ifyou know, say, that the error is 1.0 that tells you very little. If the resultis one million then an error of 1.0 is great. If the result is 0.1 then anerror of 1.0 is terrible.
With the fixed precision of floating point numbers in computers there are additional considerations with absolute error. If the absolute error is too small for the numbers being compared then the epsilon comparison may have no effect, because the finite precision of the floats may not be able to represent such small differences.
Let's say you do a calculation that has an expected answer of about 10,000. Because floating point math is imperfect you may not get an answer of exactly 10,000 you may be off by one or two in the least significant bits of your result. Ifyou're using 4byte floats and you're off by one in the least significant bitof your result then instead of 10,000 you'll get +10000.000977. So we have:
float expectedResult = 10000;
floatresult = +10000.000977; // The closest 4byte float to 10,000 without being 10,000
floatdiff = fabs(result  expectedResult);
diff is equal to 0.000977, which is 97.7 times larger than our epsilon. So, ourcomparison tells us that result and expectedResultare not nearly equal, even though they are adjacent floats! Using an epsilonvalue 0.00001 for float calculations in this range is meaningless – it’s the sameas doing a direct comparison, just more expensive.
Absolute error comparisons have value. If the range of the expectedResultis known then checking for absolute error is simple and effective. Just makesure that your absolute error value is larger than the minimum representable differencefor the range and type of float you’re dealing with.
Comparing with epsilon – relative error
An error of0.00001 is appropriate for numbers around one, too big for numbers around0.00001, and too small for numbers around 10,000. A more generic way ofcomparing two numbers – that works regardless of their range, is to check therelative error. Relative error is measured by comparing the error to theexpected result. One way of calculating it would be like this:
relativeError = fabs((result  expectedResult) /expectedResult);
If resultis 99.5, and expectedResult is 100, then the relativeerror is 0.005.
Sometimeswe don’t have an ‘expected’ result, we just have twonumbers that we want to compare to see if they are almost equal. We might writea function like this:
// Nonoptimal AlmostEqual function not recommended.
bool AlmostEqualRelative(float A, float B, floatmaxRelativeError)
{
if (A == B)
return true;
float relativeError = fabs((A  B) / B);
if (relativeError <=maxRelativeError)
return true;
return false;
}
The maxRelativeError parameter specifies what relative error weare willing to tolerate. If we want 99.999% accuracy then we should pass a maxRelativeError of 0.00001.
The initialcomparison for A == B may seem odd – if A == B then won’t relativeErrorbe zero? There is one case where this will not be true. If Aand B are both equal to zero then the relativeErrorcalculation will calculate 0.0 / 0.0. Zero divided by zero is undefined, andgives a NAN result. A NAN will never return trueon a <= comparison, so this function will return false if A and B are bothzero (on some platforms where NAN comparisons are not handled properly thisfunction might return true for zero, but it will then return true for all NANinputs as well, which makes this poor behavior to count on).
The troublewith this function is that AlmostEqualRelative(x1,x2, epsilon) may not give the result as AlmostEqualRelative(x2,x1, epsilon), because the second parameter is always used as the divisor. Animproved version of AlmostEqualRelative would alwaysdivide by the larger number. This function might look like this;
// Slightly better AlmostEqual function– still not recommended
bool AlmostEqualRelative2(float A, float B, floatmaxRelativeError)
{
if (A == B)
returntrue;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A  B) / B);
else
relativeError = fabs((A  B) / A);
if (relativeError <=maxRelativeError)
return true;
return false;
}
Even now ourfunction isn’t perfect. In general this function will behave poorly for numbersaround zero. The positive number closest to zero and the negative numberclosest to zero are extremely close to each other, yet this function will correctlycalculate that they have a huge relative error of 2.0. If you want to countnumbers near zero but of opposite sign as being equal then you need to add a maxAbsoluteError check also. The function would then returntrue if either the absoluteError or the relativeError were smaller than the maximums passed in. Atypical value for this backup maxAbsoluteError wouldbe very small – FLT_MAX or less, depending on whether the platform supports subnormals.
// Slightly better AlmostEqual function– still not recommended
bool AlmostEqualRelativeOrAbsolute(floatA, float B,
floatmaxRelativeError, float maxAbsoluteError)
{
if (fabs(A  B) < maxAbsoluteError)
return true;
float relativeError;
if (fabs(B) > fabs(A))
relativeError = fabs((A  B) / B);
else
relativeError = fabs((A  B) / A);
if (relativeError <=maxRelativeError)
return true;
return false;
}
Comparingusing integers
There is analternate technique for checking whether two floating point numbers are closeto each other. Recall that the problem with absolute error checks is that theydon’t take into consideration whether there are any values in the range beingchecked. That is, with an allowable absolute error of 0.00001 and an expectedResult of 10,000 we are saying that we will acceptany number in the range 9,999.99999 to 10,000.00001, without realizing thatwhen using 4byte floats there is only onerepresentable float in that range – 10,000. Wouldn’t it be handy if we couldeasily specify our error range in terms of how many floats we want in thatrange? That is, wouldn’t it be convenient if we could say “I think the answeris 10,000 but since floating point math is imperfect I’ll accept the 5 floatsabove and the 5 floats below that value.”
It turnsout there is an easy way to do this.
The IEEEfloat and double formats were designed so that the numbers are“lexicographically ordered”, which – in the words of IEEE architect William Kahan means “if two floatingpoint numbers in the sameformat are ordered ( say x < y ), then they areordered the same way when their bits are reinterpreted as SignMagnitudeintegers.”
This meansthat if we take two floats in memory, interpret their bit pattern as integers,and compare them, we can tell which is larger, without doing a floating pointcomparison. In the C/C++ language this comparison looks like this:
if(*(int*)&f1 < *(int*)&f2)
Thischarming syntax means take the address of f1, treat it as an integer pointer,and dereference it. All those pointer operations look expensive, but theybasically all cancel out and just mean ‘treat f1 as an integer’. Since we applythe same syntax to f2 the whole line means ‘compare f1 and f2, using theirinmemory representations interpreted as integers instead of floats’.
Kahansays that we can compare them if we interpret them as signmagnitude integers.That’s unfortunate because most processors these days use twoscomplementintegers. Effectively this means that the comparison only works if one or moreof the floats is positive. If both floats are negativethen the sense of the comparison is reversed – the result will be the oppositeof the equivalent float comparison. Later we will see that there is a handytechnique for dealing with this inconvenience.
Because thefloats are lexicographically ordered that means that if we increment therepresentation of a float as an integer then we move to the next float. Inother words, this line of code:
(*(int*)&f1)+= 1;
willincrement the underlying representation of a float and, subject to certainrestrictions, will give us the next float. For a positive number this means thenext larger float, for a negative number this means the next smaller float. Inboth cases it gives us the next float farther away from zero.
We canapply this logic in reverse also. If we subtract the integer representations oftwo floats then that will tell us how close they are. If the difference iszero, they are identical. If the difference is one, they are adjacent floats.In general, if the difference is n then there are n1 floats between them.
The chartbelow shows some floating point numbers and the integer stored in memory thatrepresents them. It can be seen in this chart that the five numbers near 2.0are represented by adjacent integers. This demonstrates the meaning ofsubtracting integer representations, and also shows that there are no floatsbetween 1.99999988 and 2.0.

Representation 

Float value 
Hexadecimal 
Decimal 
+1.99999976 
0x3FFFFFFE 
1073741822 
+1.99999988 
0x3FFFFFFF 
1073741823 
+2.00000000 
0x40000000 
1073741824 
+2.00000024 
0x40000001 
1073741825 
+2.00000048 
0x40000002 
1073741826 
With this knowledgeof the floating point format we can write this revised AlmostEqualimplementation:
// Initial AlmostEqualULPs version fast and simple, but
// some limitations.
bool AlmostEqualUlps(float A, float B, int maxUlps)
{
assert(sizeof(float) == sizeof(int));
if (A == B)
return true;
int intDiff = abs(*(int*)&A  *(int*)&B);
if (intDiff <= maxUlps)
return true;
return false;
}
It’s certainly a lot simpler, especially when you look atall the divides and calls to fabs() that it’s not doing!
The last parameter to this function is different from theprevious AlmostEqual. Instead of passing in maxRelativeError as a ratio we pass in the maximum error interms of Units in the Last Place.This specifies how big an error we are willing to accept in terms of the valueof the least significant digit of the floating point number’s representation. maxUlps can also be interpreted interms of how many representable floats we are willing to accept between A andB. This function will allow maxUlps1 floats between A and B.
If two numbers are identical except for a onebit differencein the last digit of their mantissa then this function will calculate intDiff as one.
If one number is the maximum number for a particularexponent – perhaps 1.99999988 – and the other number is the smallest number forthe next exponent – 2.0 – then this function will again calculate intDiff as one.
In both cases the two numbers are the closest floats thereare.
There is not a completely direct translation between maxRelativeError and maxUlps. Fora normal float number a maxUlps of 1 is equivalent toa maxRelativeError of between 1/8,000,000 and 1/16,000,000.The variance is because the accuracy of a float varies slightly depending onwhether it is near the top or bottom of its current exponent’s range. This canbe seen in the chart of numbers near 2.0 – the gap between numbers just above2.0 is twice as big as the gap between numbers just below 2.0.
Our AlmostEqualUlps functionstarts by checking whether A and B are equal – justlike AlmostEqualRelative did, but for a differentreason that will be discussed below.
Compiler issues
In our last version of AlmostEqualUlpswe use pointers and casting to tell the compiler to treat the inmemoryrepresentation of a float as an int. There are a couple of things that can gowrong with this. One risk is that int and float mightnot be the same size. A float should be 32 bits, but an intcan be almost any size. This is certainly something to be aware of, but everymodern compiler that I am aware of has 32bit ints.If your compiler has ints of a different size, find a32bit integral type and use it instead.
Another complication comes from aliasing optimizations.Strictly speaking the C/C++ standard says that the compiler can assume thatdifferent types do not overlap in memory (with a few exceptions such as charpointers). For instance, it is allowed to assume that a pointer to an int and a pointer to a float do not point to overlappingmemory. This opens up lots of worthwhile optimizations, but for code thatviolates this rule—which is quite common—it leads to undefined results. Inparticular, some versions of g++ default to very strict aliasing rules, anddon’t like the techniques used in AlmostEqualUlps.
Luckily g++ knows that there will be a problem, and it givesthis warning:
warning: dereferencing typepunned pointer willbreak strictaliasing rules
There are two possible solutions if you encounter thisproblem. Turn off the strict aliasing option using the fnostrictaliasingswitch, or use a union between a float and an int toimplement the reinterpretation of a float as an int. The documentation for fstrictaliasing gives more information.
Complications
Floating point math is never simple. AlmostEqualUlpsdoesn’t properly deal with all the peculiar types of floating point numbers.Whether it deals with them well enough depends on how you want to use it, butan improved version will often be needed.
IEEE floating point numbers fall into a few categories:
 Zeroes
 Subnormals
 Normal numbers
 Infinities
 NANs
Zeroes
AlmostEqual is designed to dealwith normal numbers, and it is there that it behaves its best. Its firstproblem is when dealing with zeroes. IEEE floats can have both positive andnegative zeroes. If you compare them as floats they are equal, but theirinteger representations are quite different – positive 0.0 is an integer zero, butnegative zero is 0x80000000! (in decimal this is 2147483648).The chart below shows the positive and negative floats closest to zero,together with their integer representations.

Representation 

Float value 
Hexadecimal 
Decimal 
+4.2038954e045 
0x00000003 
3 
+2.8025969e045 
0x00000002 
2 
+1.4012985e045 
0x00000001 
1 
+0.00000000 
0x00000000 
0 
0.00000000 
0x80000000 
2147483648 
1.4012985e045 
0x80000001 
2147483647 
2.8025969e045 
0x80000002 
2147483646 
4.2038954e045 
0x80000003 
2147483645 
In AlmostEqualUlps I deal withthis by checking for A and B being exactly equal, thus handling the case whereone input is positive zero and the other is negative zero. However this stillisn’t perfect. With this implementation positive zero and the smallest positivesubnormal will be calculated as being one ulp apart, and therefore will generally count as being equal.However negative zero and the smallest positive subnormal will be counted asbeing far apart, thus destroying the idea that positive and negative zero areidentical.
A more general way of handling negative numbers is to adjustthem so that they are lexicographically ordered as twoscomplement integersinstead of as signed magnitude integers. This is done by detecting negativenumbers and subtracting them from 0x80000000. This maps negative zero to aninteger zero representation – making it identical to positive zero – and itmakes it so that the smallest negative number is represented by negative one,and downwards from there. With this change the representations of our numbersaround zero look much more rational.
Remapping for twos complement 


Representation 

Float value 
Hexadecimal 
Decimal 
+4.2038954e045 
0x00000003 
3 
+2.8025969e045 
0x00000002 
2 
+1.4012985e045 
0x00000001 
1 
+0.00000000 
0x00000000 
0 
0.00000000 
0x00000000 
0 
1.4012985e045 
0xFFFFFFFF 
1 
2.8025969e045 
0xFFFFFFFE 
2 
4.2038954e045 
0xFFFFFFFD 
3 
Once we have made this adjustment we can no longer treat ournumbers as IEEE floats – the values of the negative numbers will bedramatically altered – but we can compare them as intsmore easily, in our new and convenient representation.
This technique has the additional advantage that now thedistance between numbers can be measured across the zero boundary.That is, the smallest subnormal positive number and the smallest subnormalnegative number will now compare as being very close – just a few ulps away.This is probably a good thing – it’s equivalent toadding an absolute error check to the relative error check. Code to implementthis technique looks like this:
// Usable AlmostEqual function
bool AlmostEqual2sComplement(float A, float B, int maxUlps)
{
// Make sure maxUlps is nonnegative and small enough that the
// default NAN won't compare as equalto anything.
assert(maxUlps > 0 && maxUlps < 4 * 1024 * 1024);
int aInt = *(int*)&A;
// Make aInt lexicographically ordered as a twoscomplement int
if (aInt < 0)
aInt = 0x80000000  aInt;
// Make bInt lexicographically ordered as a twoscomplement int
int bInt = *(int*)&B;
if (bInt < 0)
bInt = 0x80000000  bInt;
int intDiff = abs(aInt  bInt);
if (intDiff <= maxUlps)
return true;
return false;
}
Subnormals
The next potential issue is subnormals, also known as denormals.Subnormals are numbers that are so small that they cannot be normalized. Thislack of normalization means that they have less precision – the closer they getto zero, the less precision they have. This means that when comparing twosubnormals, an error of one ulp can imply asignificant relative error – as great as 100%. However the interpretation ofthe ulps error as a measure of the number of representable floats between thenumbers remains. Thus, this variation in the relativeErrorinterpretation is probably a good thing – yet another advantage to thistechnique of comparing floating point numbers.
Infinities
IEEE floating point numbers have a special representationfor infinities. These are used for overflows and for the result of divide byzeroes. The representation for infinities is adjacent to the representation forthe largest normal number. Thus, the AlmostEqualUlpsroutine will say that FLT_MAX and infinity are almost the same. This isreasonable in some sense – after all, there are no representable floats betweenthem – but horribly inaccurate in another sense – after all, no finite numberis ‘close’ to infinity.
If treating infinity as being ‘close’ to FLT_MAX is undesirablethen an extra check is needed.
NANs
IEEE floating point numbers have a series of representationsfor NANs. These representations – sharing an exponent with infinity but markedby their nonzero mantissa – are numerically adjacent to the infinities whencompared as ints. Therefore it is possible for aninfinite result, or a FLT_MAX result, to compare as being very close to a NAN. If your code produces NANresults then this could be very bad. However, two things protect against thisproblem. One is that most floating point code is designed to not produce NANs,and in fact most floating point code should treat NANs as an error by enablingfloating point divide by zero and floating point illegal operation exceptions.The other reason this should not be an issue is that usually only one NAN value is generated. On x87 compatible processors thisvalue is 0xFFC00000, which has a value separated by four million from thenearest infinity. This value is particularly well placed because another riskwith NAN comparisons is that they could wraparound. A NAN with a value of 0xFFFFFFFF couldcompare as being very close to zero. The translation of negative numbers usedby AlmostEqual2sComplement avoids this by moving the NANs where they can onlywrap around to each other, but the NAN value 0xFFC00000 also avoids thisproblem since it keeps the NAN value fourmillion ulps away from wrapping around.
One other complication is that comparisons involving NANsare always supposed to return false, but AlmostEqual2sComplement will say thattwo NANs are equal to each other if they have the same integer representation.If you rely on correct NAN comparisons youhave to add extra checks.
Limitations
maxUlpscannot be arbitrarily large. If maxUlps is fourmillion or greater then there is a risk of finding large negative floats equalto NANs. If maxUlps is sixteen million or greaterthen the largest positive floats will compare as equal to the largest negativefloats.
As a practical matter such large maxUlpsvalues should not be needed. A maxUlps of sixteenmillion means that numbers 100% larger and 50% smaller should count as equal. AmaxUlps of four million means that numbers 25% largerand 12.5% smaller should count as equal. If these large maxUlpsvalues are needed then separate checking for wraparound above infinity to NANsor numbers of the opposite sign will be needed. To prevent accidental usage ofhuge maxUlps values the comparison routines assertthat maxUlps is in a safe range.
AlmostEqual2sComplement is very reliant on the IEEE floatingpoint math format, and assumes twoscomplement integers of the same size. Theselimitations are the norm on the majority of machines, especially consumermachines, but there are machines out there that use different formats. For thisreason, and because the techniques used are tricky and nonobvious, it isimportant to encapsulate the behavior in a function where appropriatedocumentation, asserts, and conditional checks can be placed.
Summary
AlmostEqual2sComplement is an effective way of handlingfloating point comparisons. Its behavior does not map perfectly to AlmostEqualRelative,but in many ways its behavior is arguably superior. To summarize, AlmostEqual2sComplementhas these characteristics:
 Measures whether two floats are ‘close’ to each other, where close is defined by ulps, also interpreted as how many floats there are inbetween the numbers
 Treats infinity as being close to FLT_MAX
 Treats NANs as being four million ulps away from everything (assuming the default NAN value for x87), except other NANs
 Accepts greater relative error as numbers gradually underflow to subnormals
 Treats tiny negative numbers as being close to tiny positive numbers
If the special characteristics of AlmostEqual2sComplementare not desirable then they can selectively be checked for. A version with thenecessary checks, #ifdefed for easy control of thebehavior, is available here.
AlmostEqual2sComplement works best on machines that cantransfer values quickly between the floating point and integer units. Thisoften requires going through memory and can be quite slow. This function can beimplemented efficiently on machines with vector units that can do integer orfloating point operations on the same registers. This allows reinterpreting thevalues without going through memory.
The same techniques can be applied to doubles, mapping themto 64bit integers. Because doubles have a 53bit mantissa a one ulp error implies a relative error of between1/4,000,000,000,000,000 and 1/8,000,000,000,000,000.
References
IEEEStandard 754 Floating Point Numbers by SteveHollasch
LectureNotes on the Status of IEEE Standard 754 for Binary FloatingPoint Arithmeticby William Kahan