64bit IEEE 754: Decimal ↗ Double Precision Floating Point Binary: 0.120 000 000 000 000 01 Convert the Number to 64 Bit Double Precision IEEE 754 Binary Floating Point Representation Standard, From a Base Ten Decimal System Number

Number 0.120 000 000 000 000 01(10) converted and written in 64 bit double precision IEEE 754 binary floating point representation (1 bit for sign, 11 bits for exponent, 52 bits for mantissa)

1. First, convert to binary (in base 2) the integer part: 0.
Divide the number repeatedly by 2.

Keep track of each remainder.

We stop when we get a quotient that is equal to zero.


  • division = quotient + remainder;
  • 0 ÷ 2 = 0 + 0;

2. Construct the base 2 representation of the integer part of the number.

Take all the remainders starting from the bottom of the list constructed above.


0(10) =


0(2)


3. Convert to binary (base 2) the fractional part: 0.120 000 000 000 000 01.

Multiply it repeatedly by 2.


Keep track of each integer part of the results.


Stop when we get a fractional part that is equal to zero.


  • #) multiplying = integer + fractional part;
  • 1) 0.120 000 000 000 000 01 × 2 = 0 + 0.240 000 000 000 000 02;
  • 2) 0.240 000 000 000 000 02 × 2 = 0 + 0.480 000 000 000 000 04;
  • 3) 0.480 000 000 000 000 04 × 2 = 0 + 0.960 000 000 000 000 08;
  • 4) 0.960 000 000 000 000 08 × 2 = 1 + 0.920 000 000 000 000 16;
  • 5) 0.920 000 000 000 000 16 × 2 = 1 + 0.840 000 000 000 000 32;
  • 6) 0.840 000 000 000 000 32 × 2 = 1 + 0.680 000 000 000 000 64;
  • 7) 0.680 000 000 000 000 64 × 2 = 1 + 0.360 000 000 000 001 28;
  • 8) 0.360 000 000 000 001 28 × 2 = 0 + 0.720 000 000 000 002 56;
  • 9) 0.720 000 000 000 002 56 × 2 = 1 + 0.440 000 000 000 005 12;
  • 10) 0.440 000 000 000 005 12 × 2 = 0 + 0.880 000 000 000 010 24;
  • 11) 0.880 000 000 000 010 24 × 2 = 1 + 0.760 000 000 000 020 48;
  • 12) 0.760 000 000 000 020 48 × 2 = 1 + 0.520 000 000 000 040 96;
  • 13) 0.520 000 000 000 040 96 × 2 = 1 + 0.040 000 000 000 081 92;
  • 14) 0.040 000 000 000 081 92 × 2 = 0 + 0.080 000 000 000 163 84;
  • 15) 0.080 000 000 000 163 84 × 2 = 0 + 0.160 000 000 000 327 68;
  • 16) 0.160 000 000 000 327 68 × 2 = 0 + 0.320 000 000 000 655 36;
  • 17) 0.320 000 000 000 655 36 × 2 = 0 + 0.640 000 000 001 310 72;
  • 18) 0.640 000 000 001 310 72 × 2 = 1 + 0.280 000 000 002 621 44;
  • 19) 0.280 000 000 002 621 44 × 2 = 0 + 0.560 000 000 005 242 88;
  • 20) 0.560 000 000 005 242 88 × 2 = 1 + 0.120 000 000 010 485 76;
  • 21) 0.120 000 000 010 485 76 × 2 = 0 + 0.240 000 000 020 971 52;
  • 22) 0.240 000 000 020 971 52 × 2 = 0 + 0.480 000 000 041 943 04;
  • 23) 0.480 000 000 041 943 04 × 2 = 0 + 0.960 000 000 083 886 08;
  • 24) 0.960 000 000 083 886 08 × 2 = 1 + 0.920 000 000 167 772 16;
  • 25) 0.920 000 000 167 772 16 × 2 = 1 + 0.840 000 000 335 544 32;
  • 26) 0.840 000 000 335 544 32 × 2 = 1 + 0.680 000 000 671 088 64;
  • 27) 0.680 000 000 671 088 64 × 2 = 1 + 0.360 000 001 342 177 28;
  • 28) 0.360 000 001 342 177 28 × 2 = 0 + 0.720 000 002 684 354 56;
  • 29) 0.720 000 002 684 354 56 × 2 = 1 + 0.440 000 005 368 709 12;
  • 30) 0.440 000 005 368 709 12 × 2 = 0 + 0.880 000 010 737 418 24;
  • 31) 0.880 000 010 737 418 24 × 2 = 1 + 0.760 000 021 474 836 48;
  • 32) 0.760 000 021 474 836 48 × 2 = 1 + 0.520 000 042 949 672 96;
  • 33) 0.520 000 042 949 672 96 × 2 = 1 + 0.040 000 085 899 345 92;
  • 34) 0.040 000 085 899 345 92 × 2 = 0 + 0.080 000 171 798 691 84;
  • 35) 0.080 000 171 798 691 84 × 2 = 0 + 0.160 000 343 597 383 68;
  • 36) 0.160 000 343 597 383 68 × 2 = 0 + 0.320 000 687 194 767 36;
  • 37) 0.320 000 687 194 767 36 × 2 = 0 + 0.640 001 374 389 534 72;
  • 38) 0.640 001 374 389 534 72 × 2 = 1 + 0.280 002 748 779 069 44;
  • 39) 0.280 002 748 779 069 44 × 2 = 0 + 0.560 005 497 558 138 88;
  • 40) 0.560 005 497 558 138 88 × 2 = 1 + 0.120 010 995 116 277 76;
  • 41) 0.120 010 995 116 277 76 × 2 = 0 + 0.240 021 990 232 555 52;
  • 42) 0.240 021 990 232 555 52 × 2 = 0 + 0.480 043 980 465 111 04;
  • 43) 0.480 043 980 465 111 04 × 2 = 0 + 0.960 087 960 930 222 08;
  • 44) 0.960 087 960 930 222 08 × 2 = 1 + 0.920 175 921 860 444 16;
  • 45) 0.920 175 921 860 444 16 × 2 = 1 + 0.840 351 843 720 888 32;
  • 46) 0.840 351 843 720 888 32 × 2 = 1 + 0.680 703 687 441 776 64;
  • 47) 0.680 703 687 441 776 64 × 2 = 1 + 0.361 407 374 883 553 28;
  • 48) 0.361 407 374 883 553 28 × 2 = 0 + 0.722 814 749 767 106 56;
  • 49) 0.722 814 749 767 106 56 × 2 = 1 + 0.445 629 499 534 213 12;
  • 50) 0.445 629 499 534 213 12 × 2 = 0 + 0.891 258 999 068 426 24;
  • 51) 0.891 258 999 068 426 24 × 2 = 1 + 0.782 517 998 136 852 48;
  • 52) 0.782 517 998 136 852 48 × 2 = 1 + 0.565 035 996 273 704 96;
  • 53) 0.565 035 996 273 704 96 × 2 = 1 + 0.130 071 992 547 409 92;
  • 54) 0.130 071 992 547 409 92 × 2 = 0 + 0.260 143 985 094 819 84;
  • 55) 0.260 143 985 094 819 84 × 2 = 0 + 0.520 287 970 189 639 68;
  • 56) 0.520 287 970 189 639 68 × 2 = 1 + 0.040 575 940 379 279 36;

We didn't get any fractional part that was equal to zero. But we had enough iterations (over Mantissa limit) and at least one integer that was different from zero => FULL STOP (losing precision...)


4. Construct the base 2 representation of the fractional part of the number.

Take all the integer parts of the multiplying operations, starting from the top of the constructed list above:


0.120 000 000 000 000 01(10) =


0.0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001(2)


5. Positive number before normalization:

0.120 000 000 000 000 01(10) =


0.0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001(2)

6. Normalize the binary representation of the number.

Shift the decimal mark 4 positions to the right, so that only one non zero digit remains to the left of it:


0.120 000 000 000 000 01(10) =


0.0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001(2) =


0.0001 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001(2) × 20 =


1.1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001(2) × 2-4


7. Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:

Sign 0 (a positive number)


Exponent (unadjusted): -4


Mantissa (not normalized):
1.1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001


8. Adjust the exponent.

Use the 11 bit excess/bias notation:


Exponent (adjusted) =


Exponent (unadjusted) + 2(11-1) - 1 =


-4 + 2(11-1) - 1 =


(-4 + 1 023)(10) =


1 019(10)


9. Convert the adjusted exponent from the decimal (base 10) to 11 bit binary.

Use the same technique of repeatedly dividing by 2:


  • division = quotient + remainder;
  • 1 019 ÷ 2 = 509 + 1;
  • 509 ÷ 2 = 254 + 1;
  • 254 ÷ 2 = 127 + 0;
  • 127 ÷ 2 = 63 + 1;
  • 63 ÷ 2 = 31 + 1;
  • 31 ÷ 2 = 15 + 1;
  • 15 ÷ 2 = 7 + 1;
  • 7 ÷ 2 = 3 + 1;
  • 3 ÷ 2 = 1 + 1;
  • 1 ÷ 2 = 0 + 1;

10. Construct the base 2 representation of the adjusted exponent.

Take all the remainders starting from the bottom of the list constructed above.


Exponent (adjusted) =


1019(10) =


011 1111 1011(2)


11. Normalize the mantissa.

a) Remove the leading (the leftmost) bit, since it's allways 1, and the decimal point, if the case.


b) Adjust its length to 52 bits, only if necessary (not the case here).


Mantissa (normalized) =


1. 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001 =


1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001


12. The three elements that make up the number's 64 bit double precision IEEE 754 binary floating point representation:

Sign (1 bit) =
0 (a positive number)


Exponent (11 bits) =
011 1111 1011


Mantissa (52 bits) =
1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001


The base ten decimal number 0.120 000 000 000 000 01 converted and written in 64 bit double precision IEEE 754 binary floating point representation:
0 - 011 1111 1011 - 1110 1011 1000 0101 0001 1110 1011 1000 0101 0001 1110 1011 1001

The latest decimal numbers converted from base ten to 64 bit double precision IEEE 754 floating point binary standard representation