Atmel CAVR-4 Manual De Usuario

CAVR-4

140

Basic data types

AVR® IAR C/C++ Compiler

Reference Guide

* Depends on whether the

--64bit_doubles

option is used, see

--64bit_doubles

, page 201

The type

long double

use the same precision as

double

32-bit floating-point format

The representation of a 32-bit floating-point number as an integer is:

The value of the number is:

(-1)

* 2

(Exponent-127)

* 1.Mantissa

The precision of the float operators (

, and

) is approximately 7 decimal digits.

64-bit floating-point format

The representation of a 64-bit floating-point number as an integer is:

The value of the number is:

(-1)

* 2

(Exponent-1023)

* 1.Mantissa

The precision of the float operators (

, and

) is approximately 15 decimal digits.

Special cases

The following applies to both 32-bit and 64-bit floating-point formats:

●

Zero is represented by zero mantissa and exponent. The sign bit signifies positive or
negative zero.

●

Infinity is represented by setting the exponent to the highest value and the mantissa
to zero. The sign bit signifies positive or negative infinity.

●

Not a number (

NaN

) is represented by setting the exponent to the highest positive

value and the mantissa to a non-zero value. The value of the sign bit is ignored.

●

Subnormal numbers are used for representing values smaller than what can be
represented by normal values. The drawback is that the precision will decrease with
smaller values. The exponent is set to 0 to signify that the number is denormalized,
even though the number is treated as if the exponent would have been 1. Unlike
normal numbers, denormalized numbers do not have an implicit 1 as the most
significant bit (MSB) of the mantissa.

31 30

23 22

Exponent

Mantissa

63 62

52 51

Exponent

Mantissa