Atmel CAVR-4 Manual De Usuario

Descargar
Página de 323
CAVR-4
140
Basic data types
AVR® IAR C/C++ Compiler
Reference Guide
* Depends on whether the 
--64bit_doubles
 option is used, see
 
The type 
long double
 use the same precision as 
double
.
32-bit floating-point format
The representation of a 32-bit floating-point number as an integer is:
The value of the number is:
(-1)
S
 * 2
(Exponent-127)
 * 1.Mantissa
The precision of the float operators (
+
-
*
, and 
/
) is approximately 7 decimal digits.
64-bit floating-point format
The representation of a 64-bit floating-point number as an integer is:
The value of the number is:
(-1)
S
 * 2
(Exponent-1023)
 * 1.Mantissa
The precision of the float operators (
+
-
*
, and 
/
) is approximately 15 decimal digits.
Special cases
The following applies to both 32-bit and 64-bit floating-point formats:
Zero is represented by zero mantissa and exponent. The sign bit signifies positive or 
negative zero.
Infinity is represented by setting the exponent to the highest value and the mantissa 
to zero. The sign bit signifies positive or negative infinity.
Not a number (
NaN
) is represented by setting the exponent to the highest positive 
value and the mantissa to a non-zero value. The value of the sign bit is ignored.
Subnormal numbers are used for representing values smaller than what can be 
represented by normal values. The drawback is that the precision will decrease with 
smaller values. The exponent is set to 0 to signify that the number is denormalized, 
even though the number is treated as if the exponent would have been 1. Unlike 
normal numbers, denormalized numbers do not have an implicit 1 as the most 
significant bit (MSB) of the mantissa. 
S
31 30
23 22
0
Exponent
Mantissa
S
63 62
52 51
0
Exponent
Mantissa