64bit IEEE 754: Decimal ↗ Double Precision Floating Point Binary: 0.358 335 Convert the Number to 64 Bit Double Precision IEEE 754 Binary Floating Point Representation Standard, From a Base Ten Decimal System Number

Number 0.358 335(10) converted and written in 64 bit double precision IEEE 754 binary floating point representation (1 bit for sign, 11 bits for exponent, 52 bits for mantissa)

1. First, convert to binary (in base 2) the integer part: 0.
Divide the number repeatedly by 2.

Keep track of each remainder.

We stop when we get a quotient that is equal to zero.


  • division = quotient + remainder;
  • 0 ÷ 2 = 0 + 0;

2. Construct the base 2 representation of the integer part of the number.

Take all the remainders starting from the bottom of the list constructed above.


0(10) =


0(2)


3. Convert to binary (base 2) the fractional part: 0.358 335.

Multiply it repeatedly by 2.


Keep track of each integer part of the results.


Stop when we get a fractional part that is equal to zero.


  • #) multiplying = integer + fractional part;
  • 1) 0.358 335 × 2 = 0 + 0.716 67;
  • 2) 0.716 67 × 2 = 1 + 0.433 34;
  • 3) 0.433 34 × 2 = 0 + 0.866 68;
  • 4) 0.866 68 × 2 = 1 + 0.733 36;
  • 5) 0.733 36 × 2 = 1 + 0.466 72;
  • 6) 0.466 72 × 2 = 0 + 0.933 44;
  • 7) 0.933 44 × 2 = 1 + 0.866 88;
  • 8) 0.866 88 × 2 = 1 + 0.733 76;
  • 9) 0.733 76 × 2 = 1 + 0.467 52;
  • 10) 0.467 52 × 2 = 0 + 0.935 04;
  • 11) 0.935 04 × 2 = 1 + 0.870 08;
  • 12) 0.870 08 × 2 = 1 + 0.740 16;
  • 13) 0.740 16 × 2 = 1 + 0.480 32;
  • 14) 0.480 32 × 2 = 0 + 0.960 64;
  • 15) 0.960 64 × 2 = 1 + 0.921 28;
  • 16) 0.921 28 × 2 = 1 + 0.842 56;
  • 17) 0.842 56 × 2 = 1 + 0.685 12;
  • 18) 0.685 12 × 2 = 1 + 0.370 24;
  • 19) 0.370 24 × 2 = 0 + 0.740 48;
  • 20) 0.740 48 × 2 = 1 + 0.480 96;
  • 21) 0.480 96 × 2 = 0 + 0.961 92;
  • 22) 0.961 92 × 2 = 1 + 0.923 84;
  • 23) 0.923 84 × 2 = 1 + 0.847 68;
  • 24) 0.847 68 × 2 = 1 + 0.695 36;
  • 25) 0.695 36 × 2 = 1 + 0.390 72;
  • 26) 0.390 72 × 2 = 0 + 0.781 44;
  • 27) 0.781 44 × 2 = 1 + 0.562 88;
  • 28) 0.562 88 × 2 = 1 + 0.125 76;
  • 29) 0.125 76 × 2 = 0 + 0.251 52;
  • 30) 0.251 52 × 2 = 0 + 0.503 04;
  • 31) 0.503 04 × 2 = 1 + 0.006 08;
  • 32) 0.006 08 × 2 = 0 + 0.012 16;
  • 33) 0.012 16 × 2 = 0 + 0.024 32;
  • 34) 0.024 32 × 2 = 0 + 0.048 64;
  • 35) 0.048 64 × 2 = 0 + 0.097 28;
  • 36) 0.097 28 × 2 = 0 + 0.194 56;
  • 37) 0.194 56 × 2 = 0 + 0.389 12;
  • 38) 0.389 12 × 2 = 0 + 0.778 24;
  • 39) 0.778 24 × 2 = 1 + 0.556 48;
  • 40) 0.556 48 × 2 = 1 + 0.112 96;
  • 41) 0.112 96 × 2 = 0 + 0.225 92;
  • 42) 0.225 92 × 2 = 0 + 0.451 84;
  • 43) 0.451 84 × 2 = 0 + 0.903 68;
  • 44) 0.903 68 × 2 = 1 + 0.807 36;
  • 45) 0.807 36 × 2 = 1 + 0.614 72;
  • 46) 0.614 72 × 2 = 1 + 0.229 44;
  • 47) 0.229 44 × 2 = 0 + 0.458 88;
  • 48) 0.458 88 × 2 = 0 + 0.917 76;
  • 49) 0.917 76 × 2 = 1 + 0.835 52;
  • 50) 0.835 52 × 2 = 1 + 0.671 04;
  • 51) 0.671 04 × 2 = 1 + 0.342 08;
  • 52) 0.342 08 × 2 = 0 + 0.684 16;
  • 53) 0.684 16 × 2 = 1 + 0.368 32;
  • 54) 0.368 32 × 2 = 0 + 0.736 64;

We didn't get any fractional part that was equal to zero. But we had enough iterations (over Mantissa limit) and at least one integer that was different from zero => FULL STOP (losing precision...)


4. Construct the base 2 representation of the fractional part of the number.

Take all the integer parts of the multiplying operations, starting from the top of the constructed list above:


0.358 335(10) =


0.0101 1011 1011 1011 1101 0111 1011 0010 0000 0011 0001 1100 1110 10(2)


5. Positive number before normalization:

0.358 335(10) =


0.0101 1011 1011 1011 1101 0111 1011 0010 0000 0011 0001 1100 1110 10(2)

6. Normalize the binary representation of the number.

Shift the decimal mark 2 positions to the right, so that only one non zero digit remains to the left of it:


0.358 335(10) =


0.0101 1011 1011 1011 1101 0111 1011 0010 0000 0011 0001 1100 1110 10(2) =


0.0101 1011 1011 1011 1101 0111 1011 0010 0000 0011 0001 1100 1110 10(2) × 20 =


1.0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010(2) × 2-2


7. Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:

Sign 0 (a positive number)


Exponent (unadjusted): -2


Mantissa (not normalized):
1.0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010


8. Adjust the exponent.

Use the 11 bit excess/bias notation:


Exponent (adjusted) =


Exponent (unadjusted) + 2(11-1) - 1 =


-2 + 2(11-1) - 1 =


(-2 + 1 023)(10) =


1 021(10)


9. Convert the adjusted exponent from the decimal (base 10) to 11 bit binary.

Use the same technique of repeatedly dividing by 2:


  • division = quotient + remainder;
  • 1 021 ÷ 2 = 510 + 1;
  • 510 ÷ 2 = 255 + 0;
  • 255 ÷ 2 = 127 + 1;
  • 127 ÷ 2 = 63 + 1;
  • 63 ÷ 2 = 31 + 1;
  • 31 ÷ 2 = 15 + 1;
  • 15 ÷ 2 = 7 + 1;
  • 7 ÷ 2 = 3 + 1;
  • 3 ÷ 2 = 1 + 1;
  • 1 ÷ 2 = 0 + 1;

10. Construct the base 2 representation of the adjusted exponent.

Take all the remainders starting from the bottom of the list constructed above.


Exponent (adjusted) =


1021(10) =


011 1111 1101(2)


11. Normalize the mantissa.

a) Remove the leading (the leftmost) bit, since it's allways 1, and the decimal point, if the case.


b) Adjust its length to 52 bits, only if necessary (not the case here).


Mantissa (normalized) =


1. 0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010 =


0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010


12. The three elements that make up the number's 64 bit double precision IEEE 754 binary floating point representation:

Sign (1 bit) =
0 (a positive number)


Exponent (11 bits) =
011 1111 1101


Mantissa (52 bits) =
0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010


The base ten decimal number 0.358 335 converted and written in 64 bit double precision IEEE 754 binary floating point representation:
0 - 011 1111 1101 - 0110 1110 1110 1111 0101 1110 1100 1000 0000 1100 0111 0011 1010

The latest decimal numbers converted from base ten to 64 bit double precision IEEE 754 floating point binary standard representation