-0.000 000 000 000 000 000 078 Converted to 64 Bit Double Precision IEEE 754 Binary Floating Point Representation Standard

Convert decimal -0.000 000 000 000 000 000 078(10) to 64 bit double precision IEEE 754 binary floating point representation standard (1 bit for sign, 11 bits for exponent, 52 bits for mantissa)

What are the steps to convert decimal number
-0.000 000 000 000 000 000 078(10) to 64 bit double precision IEEE 754 binary floating point representation (1 bit for sign, 11 bits for exponent, 52 bits for mantissa)

1. Start with the positive version of the number:

|-0.000 000 000 000 000 000 078| = 0.000 000 000 000 000 000 078


2. First, convert to binary (in base 2) the integer part: 0.
Divide the number repeatedly by 2.

Keep track of each remainder.

We stop when we get a quotient that is equal to zero.


  • division = quotient + remainder;
  • 0 ÷ 2 = 0 + 0;

3. Construct the base 2 representation of the integer part of the number.

Take all the remainders starting from the bottom of the list constructed above.

0(10) =


0(2)


4. Convert to binary (base 2) the fractional part: 0.000 000 000 000 000 000 078.

Multiply it repeatedly by 2.


Keep track of each integer part of the results.


Stop when we get a fractional part that is equal to zero.


  • #) multiplying = integer + fractional part;
  • 1) 0.000 000 000 000 000 000 078 × 2 = 0 + 0.000 000 000 000 000 000 156;
  • 2) 0.000 000 000 000 000 000 156 × 2 = 0 + 0.000 000 000 000 000 000 312;
  • 3) 0.000 000 000 000 000 000 312 × 2 = 0 + 0.000 000 000 000 000 000 624;
  • 4) 0.000 000 000 000 000 000 624 × 2 = 0 + 0.000 000 000 000 000 001 248;
  • 5) 0.000 000 000 000 000 001 248 × 2 = 0 + 0.000 000 000 000 000 002 496;
  • 6) 0.000 000 000 000 000 002 496 × 2 = 0 + 0.000 000 000 000 000 004 992;
  • 7) 0.000 000 000 000 000 004 992 × 2 = 0 + 0.000 000 000 000 000 009 984;
  • 8) 0.000 000 000 000 000 009 984 × 2 = 0 + 0.000 000 000 000 000 019 968;
  • 9) 0.000 000 000 000 000 019 968 × 2 = 0 + 0.000 000 000 000 000 039 936;
  • 10) 0.000 000 000 000 000 039 936 × 2 = 0 + 0.000 000 000 000 000 079 872;
  • 11) 0.000 000 000 000 000 079 872 × 2 = 0 + 0.000 000 000 000 000 159 744;
  • 12) 0.000 000 000 000 000 159 744 × 2 = 0 + 0.000 000 000 000 000 319 488;
  • 13) 0.000 000 000 000 000 319 488 × 2 = 0 + 0.000 000 000 000 000 638 976;
  • 14) 0.000 000 000 000 000 638 976 × 2 = 0 + 0.000 000 000 000 001 277 952;
  • 15) 0.000 000 000 000 001 277 952 × 2 = 0 + 0.000 000 000 000 002 555 904;
  • 16) 0.000 000 000 000 002 555 904 × 2 = 0 + 0.000 000 000 000 005 111 808;
  • 17) 0.000 000 000 000 005 111 808 × 2 = 0 + 0.000 000 000 000 010 223 616;
  • 18) 0.000 000 000 000 010 223 616 × 2 = 0 + 0.000 000 000 000 020 447 232;
  • 19) 0.000 000 000 000 020 447 232 × 2 = 0 + 0.000 000 000 000 040 894 464;
  • 20) 0.000 000 000 000 040 894 464 × 2 = 0 + 0.000 000 000 000 081 788 928;
  • 21) 0.000 000 000 000 081 788 928 × 2 = 0 + 0.000 000 000 000 163 577 856;
  • 22) 0.000 000 000 000 163 577 856 × 2 = 0 + 0.000 000 000 000 327 155 712;
  • 23) 0.000 000 000 000 327 155 712 × 2 = 0 + 0.000 000 000 000 654 311 424;
  • 24) 0.000 000 000 000 654 311 424 × 2 = 0 + 0.000 000 000 001 308 622 848;
  • 25) 0.000 000 000 001 308 622 848 × 2 = 0 + 0.000 000 000 002 617 245 696;
  • 26) 0.000 000 000 002 617 245 696 × 2 = 0 + 0.000 000 000 005 234 491 392;
  • 27) 0.000 000 000 005 234 491 392 × 2 = 0 + 0.000 000 000 010 468 982 784;
  • 28) 0.000 000 000 010 468 982 784 × 2 = 0 + 0.000 000 000 020 937 965 568;
  • 29) 0.000 000 000 020 937 965 568 × 2 = 0 + 0.000 000 000 041 875 931 136;
  • 30) 0.000 000 000 041 875 931 136 × 2 = 0 + 0.000 000 000 083 751 862 272;
  • 31) 0.000 000 000 083 751 862 272 × 2 = 0 + 0.000 000 000 167 503 724 544;
  • 32) 0.000 000 000 167 503 724 544 × 2 = 0 + 0.000 000 000 335 007 449 088;
  • 33) 0.000 000 000 335 007 449 088 × 2 = 0 + 0.000 000 000 670 014 898 176;
  • 34) 0.000 000 000 670 014 898 176 × 2 = 0 + 0.000 000 001 340 029 796 352;
  • 35) 0.000 000 001 340 029 796 352 × 2 = 0 + 0.000 000 002 680 059 592 704;
  • 36) 0.000 000 002 680 059 592 704 × 2 = 0 + 0.000 000 005 360 119 185 408;
  • 37) 0.000 000 005 360 119 185 408 × 2 = 0 + 0.000 000 010 720 238 370 816;
  • 38) 0.000 000 010 720 238 370 816 × 2 = 0 + 0.000 000 021 440 476 741 632;
  • 39) 0.000 000 021 440 476 741 632 × 2 = 0 + 0.000 000 042 880 953 483 264;
  • 40) 0.000 000 042 880 953 483 264 × 2 = 0 + 0.000 000 085 761 906 966 528;
  • 41) 0.000 000 085 761 906 966 528 × 2 = 0 + 0.000 000 171 523 813 933 056;
  • 42) 0.000 000 171 523 813 933 056 × 2 = 0 + 0.000 000 343 047 627 866 112;
  • 43) 0.000 000 343 047 627 866 112 × 2 = 0 + 0.000 000 686 095 255 732 224;
  • 44) 0.000 000 686 095 255 732 224 × 2 = 0 + 0.000 001 372 190 511 464 448;
  • 45) 0.000 001 372 190 511 464 448 × 2 = 0 + 0.000 002 744 381 022 928 896;
  • 46) 0.000 002 744 381 022 928 896 × 2 = 0 + 0.000 005 488 762 045 857 792;
  • 47) 0.000 005 488 762 045 857 792 × 2 = 0 + 0.000 010 977 524 091 715 584;
  • 48) 0.000 010 977 524 091 715 584 × 2 = 0 + 0.000 021 955 048 183 431 168;
  • 49) 0.000 021 955 048 183 431 168 × 2 = 0 + 0.000 043 910 096 366 862 336;
  • 50) 0.000 043 910 096 366 862 336 × 2 = 0 + 0.000 087 820 192 733 724 672;
  • 51) 0.000 087 820 192 733 724 672 × 2 = 0 + 0.000 175 640 385 467 449 344;
  • 52) 0.000 175 640 385 467 449 344 × 2 = 0 + 0.000 351 280 770 934 898 688;
  • 53) 0.000 351 280 770 934 898 688 × 2 = 0 + 0.000 702 561 541 869 797 376;
  • 54) 0.000 702 561 541 869 797 376 × 2 = 0 + 0.001 405 123 083 739 594 752;
  • 55) 0.001 405 123 083 739 594 752 × 2 = 0 + 0.002 810 246 167 479 189 504;
  • 56) 0.002 810 246 167 479 189 504 × 2 = 0 + 0.005 620 492 334 958 379 008;
  • 57) 0.005 620 492 334 958 379 008 × 2 = 0 + 0.011 240 984 669 916 758 016;
  • 58) 0.011 240 984 669 916 758 016 × 2 = 0 + 0.022 481 969 339 833 516 032;
  • 59) 0.022 481 969 339 833 516 032 × 2 = 0 + 0.044 963 938 679 667 032 064;
  • 60) 0.044 963 938 679 667 032 064 × 2 = 0 + 0.089 927 877 359 334 064 128;
  • 61) 0.089 927 877 359 334 064 128 × 2 = 0 + 0.179 855 754 718 668 128 256;
  • 62) 0.179 855 754 718 668 128 256 × 2 = 0 + 0.359 711 509 437 336 256 512;
  • 63) 0.359 711 509 437 336 256 512 × 2 = 0 + 0.719 423 018 874 672 513 024;
  • 64) 0.719 423 018 874 672 513 024 × 2 = 1 + 0.438 846 037 749 345 026 048;
  • 65) 0.438 846 037 749 345 026 048 × 2 = 0 + 0.877 692 075 498 690 052 096;
  • 66) 0.877 692 075 498 690 052 096 × 2 = 1 + 0.755 384 150 997 380 104 192;
  • 67) 0.755 384 150 997 380 104 192 × 2 = 1 + 0.510 768 301 994 760 208 384;
  • 68) 0.510 768 301 994 760 208 384 × 2 = 1 + 0.021 536 603 989 520 416 768;
  • 69) 0.021 536 603 989 520 416 768 × 2 = 0 + 0.043 073 207 979 040 833 536;
  • 70) 0.043 073 207 979 040 833 536 × 2 = 0 + 0.086 146 415 958 081 667 072;
  • 71) 0.086 146 415 958 081 667 072 × 2 = 0 + 0.172 292 831 916 163 334 144;
  • 72) 0.172 292 831 916 163 334 144 × 2 = 0 + 0.344 585 663 832 326 668 288;
  • 73) 0.344 585 663 832 326 668 288 × 2 = 0 + 0.689 171 327 664 653 336 576;
  • 74) 0.689 171 327 664 653 336 576 × 2 = 1 + 0.378 342 655 329 306 673 152;
  • 75) 0.378 342 655 329 306 673 152 × 2 = 0 + 0.756 685 310 658 613 346 304;
  • 76) 0.756 685 310 658 613 346 304 × 2 = 1 + 0.513 370 621 317 226 692 608;
  • 77) 0.513 370 621 317 226 692 608 × 2 = 1 + 0.026 741 242 634 453 385 216;
  • 78) 0.026 741 242 634 453 385 216 × 2 = 0 + 0.053 482 485 268 906 770 432;
  • 79) 0.053 482 485 268 906 770 432 × 2 = 0 + 0.106 964 970 537 813 540 864;
  • 80) 0.106 964 970 537 813 540 864 × 2 = 0 + 0.213 929 941 075 627 081 728;
  • 81) 0.213 929 941 075 627 081 728 × 2 = 0 + 0.427 859 882 151 254 163 456;
  • 82) 0.427 859 882 151 254 163 456 × 2 = 0 + 0.855 719 764 302 508 326 912;
  • 83) 0.855 719 764 302 508 326 912 × 2 = 1 + 0.711 439 528 605 016 653 824;
  • 84) 0.711 439 528 605 016 653 824 × 2 = 1 + 0.422 879 057 210 033 307 648;
  • 85) 0.422 879 057 210 033 307 648 × 2 = 0 + 0.845 758 114 420 066 615 296;
  • 86) 0.845 758 114 420 066 615 296 × 2 = 1 + 0.691 516 228 840 133 230 592;
  • 87) 0.691 516 228 840 133 230 592 × 2 = 1 + 0.383 032 457 680 266 461 184;
  • 88) 0.383 032 457 680 266 461 184 × 2 = 0 + 0.766 064 915 360 532 922 368;
  • 89) 0.766 064 915 360 532 922 368 × 2 = 1 + 0.532 129 830 721 065 844 736;
  • 90) 0.532 129 830 721 065 844 736 × 2 = 1 + 0.064 259 661 442 131 689 472;
  • 91) 0.064 259 661 442 131 689 472 × 2 = 0 + 0.128 519 322 884 263 378 944;
  • 92) 0.128 519 322 884 263 378 944 × 2 = 0 + 0.257 038 645 768 526 757 888;
  • 93) 0.257 038 645 768 526 757 888 × 2 = 0 + 0.514 077 291 537 053 515 776;
  • 94) 0.514 077 291 537 053 515 776 × 2 = 1 + 0.028 154 583 074 107 031 552;
  • 95) 0.028 154 583 074 107 031 552 × 2 = 0 + 0.056 309 166 148 214 063 104;
  • 96) 0.056 309 166 148 214 063 104 × 2 = 0 + 0.112 618 332 296 428 126 208;
  • 97) 0.112 618 332 296 428 126 208 × 2 = 0 + 0.225 236 664 592 856 252 416;
  • 98) 0.225 236 664 592 856 252 416 × 2 = 0 + 0.450 473 329 185 712 504 832;
  • 99) 0.450 473 329 185 712 504 832 × 2 = 0 + 0.900 946 658 371 425 009 664;
  • 100) 0.900 946 658 371 425 009 664 × 2 = 1 + 0.801 893 316 742 850 019 328;
  • 101) 0.801 893 316 742 850 019 328 × 2 = 1 + 0.603 786 633 485 700 038 656;
  • 102) 0.603 786 633 485 700 038 656 × 2 = 1 + 0.207 573 266 971 400 077 312;
  • 103) 0.207 573 266 971 400 077 312 × 2 = 0 + 0.415 146 533 942 800 154 624;
  • 104) 0.415 146 533 942 800 154 624 × 2 = 0 + 0.830 293 067 885 600 309 248;
  • 105) 0.830 293 067 885 600 309 248 × 2 = 1 + 0.660 586 135 771 200 618 496;
  • 106) 0.660 586 135 771 200 618 496 × 2 = 1 + 0.321 172 271 542 401 236 992;
  • 107) 0.321 172 271 542 401 236 992 × 2 = 0 + 0.642 344 543 084 802 473 984;
  • 108) 0.642 344 543 084 802 473 984 × 2 = 1 + 0.284 689 086 169 604 947 968;
  • 109) 0.284 689 086 169 604 947 968 × 2 = 0 + 0.569 378 172 339 209 895 936;
  • 110) 0.569 378 172 339 209 895 936 × 2 = 1 + 0.138 756 344 678 419 791 872;
  • 111) 0.138 756 344 678 419 791 872 × 2 = 0 + 0.277 512 689 356 839 583 744;
  • 112) 0.277 512 689 356 839 583 744 × 2 = 0 + 0.555 025 378 713 679 167 488;
  • 113) 0.555 025 378 713 679 167 488 × 2 = 1 + 0.110 050 757 427 358 334 976;
  • 114) 0.110 050 757 427 358 334 976 × 2 = 0 + 0.220 101 514 854 716 669 952;
  • 115) 0.220 101 514 854 716 669 952 × 2 = 0 + 0.440 203 029 709 433 339 904;
  • 116) 0.440 203 029 709 433 339 904 × 2 = 0 + 0.880 406 059 418 866 679 808;

We didn't get any fractional part that was equal to zero. But we had enough iterations (over Mantissa limit) and at least one integer that was different from zero => FULL STOP (Losing precision - the converted number we get in the end will be just a very good approximation of the initial one).


5. Construct the base 2 representation of the fractional part of the number.

Take all the integer parts of the multiplying operations, starting from the top of the constructed list above:


0.000 000 000 000 000 000 078(10) =


0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000(2)

6. Positive number before normalization:

0.000 000 000 000 000 000 078(10) =


0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000(2)

7. Normalize the binary representation of the number.

Shift the decimal mark 64 positions to the right, so that only one non zero digit remains to the left of it:


0.000 000 000 000 000 000 078(10) =


0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000(2) =


0.0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0000 0001 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000(2) × 20 =


1.0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000(2) × 2-64


8. Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:

Sign 1 (a negative number)


Exponent (unadjusted): -64


Mantissa (not normalized):
1.0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000


9. Adjust the exponent.

Use the 11 bit excess/bias notation:


Exponent (adjusted) =


Exponent (unadjusted) + 2(11-1) - 1 =


-64 + 2(11-1) - 1 =


(-64 + 1 023)(10) =


959(10)


10. Convert the adjusted exponent from the decimal (base 10) to 11 bit binary.

Use the same technique of repeatedly dividing by 2:


  • division = quotient + remainder;
  • 959 ÷ 2 = 479 + 1;
  • 479 ÷ 2 = 239 + 1;
  • 239 ÷ 2 = 119 + 1;
  • 119 ÷ 2 = 59 + 1;
  • 59 ÷ 2 = 29 + 1;
  • 29 ÷ 2 = 14 + 1;
  • 14 ÷ 2 = 7 + 0;
  • 7 ÷ 2 = 3 + 1;
  • 3 ÷ 2 = 1 + 1;
  • 1 ÷ 2 = 0 + 1;

11. Construct the base 2 representation of the adjusted exponent.

Take all the remainders starting from the bottom of the list constructed above.


Exponent (adjusted) =


959(10) =


011 1011 1111(2)


12. Normalize the mantissa.

a) Remove the leading (the leftmost) bit, since it's allways 1, and the decimal point, if the case.


b) Adjust its length to 52 bits, only if necessary (not the case here).


Mantissa (normalized) =


1. 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000 =


0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000


13. The three elements that make up the number's 64 bit double precision IEEE 754 binary floating point representation:

Sign (1 bit) =
1 (a negative number)


Exponent (11 bits) =
011 1011 1111


Mantissa (52 bits) =
0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000


Decimal number -0.000 000 000 000 000 000 078 converted to 64 bit double precision IEEE 754 binary floating point representation:

1 - 011 1011 1111 - 0111 0000 0101 1000 0011 0110 1100 0100 0001 1100 1101 0100 1000


How to convert numbers from the decimal system (base ten) to 64 bit double precision IEEE 754 binary floating point standard

Follow the steps below to convert a base 10 decimal number to 64 bit double precision IEEE 754 binary floating point:

  • 1. If the number to be converted is negative, start with its the positive version.
  • 2. First convert the integer part. Divide repeatedly by 2 the positive representation of the integer number that is to be converted to binary, until we get a quotient that is equal to zero, keeping track of each remainder.
  • 3. Construct the base 2 representation of the positive integer part of the number, by taking all the remainders from the previous operations, starting from the bottom of the list constructed above. Thus, the last remainder of the divisions becomes the first symbol (the leftmost) of the base two number, while the first remainder becomes the last symbol (the rightmost).
  • 4. Then convert the fractional part. Multiply the number repeatedly by 2, until we get a fractional part that is equal to zero, keeping track of each integer part of the results.
  • 5. Construct the base 2 representation of the fractional part of the number, by taking all the integer parts of the multiplying operations, starting from the top of the list constructed above (they should appear in the binary representation, from left to right, in the order they have been calculated).
  • 6. Normalize the binary representation of the number, shifting the decimal mark (the decimal point) "n" positions either to the left, or to the right, so that only one non zero digit remains to the left of the decimal mark.
  • 7. Adjust the exponent in 11 bit excess/bias notation and then convert it from decimal (base 10) to 11 bit binary, by using the same technique of repeatedly dividing by 2, as shown above:
    Exponent (adjusted) = Exponent (unadjusted) + 2(11-1) - 1
  • 8. Normalize mantissa, remove the leading (leftmost) bit, since it's allways '1' (and the decimal mark, if the case) and adjust its length to 52 bits, either by removing the excess bits from the right (losing precision...) or by adding extra bits set on '0' to the right.
  • 9. Sign (it takes 1 bit) is either 1 for a negative or 0 for a positive number.

Example: convert the negative number -31.640 215 from the decimal system (base ten) to 64 bit double precision IEEE 754 binary floating point:

  • 1. Start with the positive version of the number:

    |-31.640 215| = 31.640 215

  • 2. First convert the integer part, 31. Divide it repeatedly by 2, keeping track of each remainder, until we get a quotient that is equal to zero:
    • division = quotient + remainder;
    • 31 ÷ 2 = 15 + 1;
    • 15 ÷ 2 = 7 + 1;
    • 7 ÷ 2 = 3 + 1;
    • 3 ÷ 2 = 1 + 1;
    • 1 ÷ 2 = 0 + 1;
    • We have encountered a quotient that is ZERO => FULL STOP
  • 3. Construct the base 2 representation of the integer part of the number by taking all the remainders of the previous dividing operations, starting from the bottom of the list constructed above:

    31(10) = 1 1111(2)

  • 4. Then, convert the fractional part, 0.640 215. Multiply repeatedly by 2, keeping track of each integer part of the results, until we get a fractional part that is equal to zero:
    • #) multiplying = integer + fractional part;
    • 1) 0.640 215 × 2 = 1 + 0.280 43;
    • 2) 0.280 43 × 2 = 0 + 0.560 86;
    • 3) 0.560 86 × 2 = 1 + 0.121 72;
    • 4) 0.121 72 × 2 = 0 + 0.243 44;
    • 5) 0.243 44 × 2 = 0 + 0.486 88;
    • 6) 0.486 88 × 2 = 0 + 0.973 76;
    • 7) 0.973 76 × 2 = 1 + 0.947 52;
    • 8) 0.947 52 × 2 = 1 + 0.895 04;
    • 9) 0.895 04 × 2 = 1 + 0.790 08;
    • 10) 0.790 08 × 2 = 1 + 0.580 16;
    • 11) 0.580 16 × 2 = 1 + 0.160 32;
    • 12) 0.160 32 × 2 = 0 + 0.320 64;
    • 13) 0.320 64 × 2 = 0 + 0.641 28;
    • 14) 0.641 28 × 2 = 1 + 0.282 56;
    • 15) 0.282 56 × 2 = 0 + 0.565 12;
    • 16) 0.565 12 × 2 = 1 + 0.130 24;
    • 17) 0.130 24 × 2 = 0 + 0.260 48;
    • 18) 0.260 48 × 2 = 0 + 0.520 96;
    • 19) 0.520 96 × 2 = 1 + 0.041 92;
    • 20) 0.041 92 × 2 = 0 + 0.083 84;
    • 21) 0.083 84 × 2 = 0 + 0.167 68;
    • 22) 0.167 68 × 2 = 0 + 0.335 36;
    • 23) 0.335 36 × 2 = 0 + 0.670 72;
    • 24) 0.670 72 × 2 = 1 + 0.341 44;
    • 25) 0.341 44 × 2 = 0 + 0.682 88;
    • 26) 0.682 88 × 2 = 1 + 0.365 76;
    • 27) 0.365 76 × 2 = 0 + 0.731 52;
    • 28) 0.731 52 × 2 = 1 + 0.463 04;
    • 29) 0.463 04 × 2 = 0 + 0.926 08;
    • 30) 0.926 08 × 2 = 1 + 0.852 16;
    • 31) 0.852 16 × 2 = 1 + 0.704 32;
    • 32) 0.704 32 × 2 = 1 + 0.408 64;
    • 33) 0.408 64 × 2 = 0 + 0.817 28;
    • 34) 0.817 28 × 2 = 1 + 0.634 56;
    • 35) 0.634 56 × 2 = 1 + 0.269 12;
    • 36) 0.269 12 × 2 = 0 + 0.538 24;
    • 37) 0.538 24 × 2 = 1 + 0.076 48;
    • 38) 0.076 48 × 2 = 0 + 0.152 96;
    • 39) 0.152 96 × 2 = 0 + 0.305 92;
    • 40) 0.305 92 × 2 = 0 + 0.611 84;
    • 41) 0.611 84 × 2 = 1 + 0.223 68;
    • 42) 0.223 68 × 2 = 0 + 0.447 36;
    • 43) 0.447 36 × 2 = 0 + 0.894 72;
    • 44) 0.894 72 × 2 = 1 + 0.789 44;
    • 45) 0.789 44 × 2 = 1 + 0.578 88;
    • 46) 0.578 88 × 2 = 1 + 0.157 76;
    • 47) 0.157 76 × 2 = 0 + 0.315 52;
    • 48) 0.315 52 × 2 = 0 + 0.631 04;
    • 49) 0.631 04 × 2 = 1 + 0.262 08;
    • 50) 0.262 08 × 2 = 0 + 0.524 16;
    • 51) 0.524 16 × 2 = 1 + 0.048 32;
    • 52) 0.048 32 × 2 = 0 + 0.096 64;
    • 53) 0.096 64 × 2 = 0 + 0.193 28;
    • We didn't get any fractional part that was equal to zero. But we had enough iterations (over Mantissa limit = 52) and at least one integer part that was different from zero => FULL STOP (losing precision...).
  • 5. Construct the base 2 representation of the fractional part of the number, by taking all the integer parts of the previous multiplying operations, starting from the top of the constructed list above:

    0.640 215(10) = 0.1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0(2)

  • 6. Summarizing - the positive number before normalization:

    31.640 215(10) = 1 1111.1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0(2)

  • 7. Normalize the binary representation of the number, shifting the decimal mark 4 positions to the left so that only one non-zero digit stays to the left of the decimal mark:

    31.640 215(10) =
    1 1111.1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0(2) =
    1 1111.1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0(2) × 20 =
    1.1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0(2) × 24

  • 8. Up to this moment, there are the following elements that would feed into the 64 bit double precision IEEE 754 binary floating point representation:

    Sign: 1 (a negative number)

    Exponent (unadjusted): 4

    Mantissa (not-normalized): 1.1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0

  • 9. Adjust the exponent in 11 bit excess/bias notation and then convert it from decimal (base 10) to 11 bit binary (base 2), by using the same technique of repeatedly dividing it by 2, as shown above:

    Exponent (adjusted) = Exponent (unadjusted) + 2(11-1) - 1 = (4 + 1023)(10) = 1027(10) =
    100 0000 0011(2)

  • 10. Normalize mantissa, remove the leading (leftmost) bit, since it's allways '1' (and the decimal sign) and adjust its length to 52 bits, by removing the excess bits, from the right (losing precision...):

    Mantissa (not-normalized): 1.1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100 1010 0

    Mantissa (normalized): 1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100

  • Conclusion:

    Sign (1 bit) = 1 (a negative number)

    Exponent (8 bits) = 100 0000 0011

    Mantissa (52 bits) = 1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100

  • Number -31.640 215, converted from decimal system (base 10) to 64 bit double precision IEEE 754 binary floating point =
    1 - 100 0000 0011 - 1111 1010 0011 1110 0101 0010 0001 0101 0111 0110 1000 1001 1100