64bit IEEE 754: Decimal ↗ Double Precision Floating Point Binary: 49.000 000 5 Convert the Number to 64 Bit Double Precision IEEE 754 Binary Floating Point Representation Standard, From a Base Ten Decimal System Number

Number 49.000 000 5(10) converted and written in 64 bit double precision IEEE 754 binary floating point representation (1 bit for sign, 11 bits for exponent, 52 bits for mantissa)

1. First, convert to binary (in base 2) the integer part: 49.
Divide the number repeatedly by 2.

Keep track of each remainder.

We stop when we get a quotient that is equal to zero.


  • division = quotient + remainder;
  • 49 ÷ 2 = 24 + 1;
  • 24 ÷ 2 = 12 + 0;
  • 12 ÷ 2 = 6 + 0;
  • 6 ÷ 2 = 3 + 0;
  • 3 ÷ 2 = 1 + 1;
  • 1 ÷ 2 = 0 + 1;

2. Construct the base 2 representation of the integer part of the number.

Take all the remainders starting from the bottom of the list constructed above.


49(10) =


11 0001(2)


3. Convert to binary (base 2) the fractional part: 0.000 000 5.

Multiply it repeatedly by 2.


Keep track of each integer part of the results.


Stop when we get a fractional part that is equal to zero.


  • #) multiplying = integer + fractional part;
  • 1) 0.000 000 5 × 2 = 0 + 0.000 001;
  • 2) 0.000 001 × 2 = 0 + 0.000 002;
  • 3) 0.000 002 × 2 = 0 + 0.000 004;
  • 4) 0.000 004 × 2 = 0 + 0.000 008;
  • 5) 0.000 008 × 2 = 0 + 0.000 016;
  • 6) 0.000 016 × 2 = 0 + 0.000 032;
  • 7) 0.000 032 × 2 = 0 + 0.000 064;
  • 8) 0.000 064 × 2 = 0 + 0.000 128;
  • 9) 0.000 128 × 2 = 0 + 0.000 256;
  • 10) 0.000 256 × 2 = 0 + 0.000 512;
  • 11) 0.000 512 × 2 = 0 + 0.001 024;
  • 12) 0.001 024 × 2 = 0 + 0.002 048;
  • 13) 0.002 048 × 2 = 0 + 0.004 096;
  • 14) 0.004 096 × 2 = 0 + 0.008 192;
  • 15) 0.008 192 × 2 = 0 + 0.016 384;
  • 16) 0.016 384 × 2 = 0 + 0.032 768;
  • 17) 0.032 768 × 2 = 0 + 0.065 536;
  • 18) 0.065 536 × 2 = 0 + 0.131 072;
  • 19) 0.131 072 × 2 = 0 + 0.262 144;
  • 20) 0.262 144 × 2 = 0 + 0.524 288;
  • 21) 0.524 288 × 2 = 1 + 0.048 576;
  • 22) 0.048 576 × 2 = 0 + 0.097 152;
  • 23) 0.097 152 × 2 = 0 + 0.194 304;
  • 24) 0.194 304 × 2 = 0 + 0.388 608;
  • 25) 0.388 608 × 2 = 0 + 0.777 216;
  • 26) 0.777 216 × 2 = 1 + 0.554 432;
  • 27) 0.554 432 × 2 = 1 + 0.108 864;
  • 28) 0.108 864 × 2 = 0 + 0.217 728;
  • 29) 0.217 728 × 2 = 0 + 0.435 456;
  • 30) 0.435 456 × 2 = 0 + 0.870 912;
  • 31) 0.870 912 × 2 = 1 + 0.741 824;
  • 32) 0.741 824 × 2 = 1 + 0.483 648;
  • 33) 0.483 648 × 2 = 0 + 0.967 296;
  • 34) 0.967 296 × 2 = 1 + 0.934 592;
  • 35) 0.934 592 × 2 = 1 + 0.869 184;
  • 36) 0.869 184 × 2 = 1 + 0.738 368;
  • 37) 0.738 368 × 2 = 1 + 0.476 736;
  • 38) 0.476 736 × 2 = 0 + 0.953 472;
  • 39) 0.953 472 × 2 = 1 + 0.906 944;
  • 40) 0.906 944 × 2 = 1 + 0.813 888;
  • 41) 0.813 888 × 2 = 1 + 0.627 776;
  • 42) 0.627 776 × 2 = 1 + 0.255 552;
  • 43) 0.255 552 × 2 = 0 + 0.511 104;
  • 44) 0.511 104 × 2 = 1 + 0.022 208;
  • 45) 0.022 208 × 2 = 0 + 0.044 416;
  • 46) 0.044 416 × 2 = 0 + 0.088 832;
  • 47) 0.088 832 × 2 = 0 + 0.177 664;
  • 48) 0.177 664 × 2 = 0 + 0.355 328;
  • 49) 0.355 328 × 2 = 0 + 0.710 656;
  • 50) 0.710 656 × 2 = 1 + 0.421 312;
  • 51) 0.421 312 × 2 = 0 + 0.842 624;
  • 52) 0.842 624 × 2 = 1 + 0.685 248;
  • 53) 0.685 248 × 2 = 1 + 0.370 496;

We didn't get any fractional part that was equal to zero. But we had enough iterations (over Mantissa limit) and at least one integer that was different from zero => FULL STOP (losing precision...)


4. Construct the base 2 representation of the fractional part of the number.

Take all the integer parts of the multiplying operations, starting from the top of the constructed list above:


0.000 000 5(10) =


0.0000 0000 0000 0000 0000 1000 0110 0011 0111 1011 1101 0000 0101 1(2)


5. Positive number before normalization:

49.000 000 5(10) =


11 0001.0000 0000 0000 0000 0000 1000 0110 0011 0111 1011 1101 0000 0101 1(2)

6. Normalize the binary representation of the number.

Shift the decimal mark 5 positions to the left, so that only one non zero digit remains to the left of it:


49.000 000 5(10) =


11 0001.0000 0000 0000 0000 0000 1000 0110 0011 0111 1011 1101 0000 0101 1(2) =


11 0001.0000 0000 0000 0000 0000 1000 0110 0011 0111 1011 1101 0000 0101 1(2) × 20 =


1.1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000 0010 11(2) × 25


7. Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:

Sign 0 (a positive number)


Exponent (unadjusted): 5


Mantissa (not normalized):
1.1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000 0010 11


8. Adjust the exponent.

Use the 11 bit excess/bias notation:


Exponent (adjusted) =


Exponent (unadjusted) + 2(11-1) - 1 =


5 + 2(11-1) - 1 =


(5 + 1 023)(10) =


1 028(10)


9. Convert the adjusted exponent from the decimal (base 10) to 11 bit binary.

Use the same technique of repeatedly dividing by 2:


  • division = quotient + remainder;
  • 1 028 ÷ 2 = 514 + 0;
  • 514 ÷ 2 = 257 + 0;
  • 257 ÷ 2 = 128 + 1;
  • 128 ÷ 2 = 64 + 0;
  • 64 ÷ 2 = 32 + 0;
  • 32 ÷ 2 = 16 + 0;
  • 16 ÷ 2 = 8 + 0;
  • 8 ÷ 2 = 4 + 0;
  • 4 ÷ 2 = 2 + 0;
  • 2 ÷ 2 = 1 + 0;
  • 1 ÷ 2 = 0 + 1;

10. Construct the base 2 representation of the adjusted exponent.

Take all the remainders starting from the bottom of the list constructed above.


Exponent (adjusted) =


1028(10) =


100 0000 0100(2)


11. Normalize the mantissa.

a) Remove the leading (the leftmost) bit, since it's allways 1, and the decimal point, if the case.


b) Adjust its length to 52 bits, by removing the excess bits, from the right (if any of the excess bits is set on 1, we are losing precision...).


Mantissa (normalized) =


1. 1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000 00 1011 =


1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000


12. The three elements that make up the number's 64 bit double precision IEEE 754 binary floating point representation:

Sign (1 bit) =
0 (a positive number)


Exponent (11 bits) =
100 0000 0100


Mantissa (52 bits) =
1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000


The base ten decimal number 49.000 000 5 converted and written in 64 bit double precision IEEE 754 binary floating point representation:
0 - 100 0000 0100 - 1000 1000 0000 0000 0000 0000 0100 0011 0001 1011 1101 1110 1000

The latest decimal numbers converted from base ten to 64 bit double precision IEEE 754 floating point binary standard representation