Questions about 64-bit stuff
Show older comments
Here's my question based on what I understand from the book I'm reading. Hopefully, someone can understand my rough idea:
1) Say we have 64 bits. Each bit is either 0 or 1. So 64 bits can only have 64 spots to store 0s or 1s. But 10^380 is a very long (and huge number). We need 380 spots to write down 10^380, i.e., 10...000. Then how could it be possible for computer to store this number? I'm totally lost here.
2)"uint64" data type means computer require 64 bits to store such a number. The maximum integer of this type it can store is 2^64 - 1. Comparing to "double", which also uses 64 bits to store a fractions. Yet the largest number it can store is 1.79x10^380. 10^380 is a very very large number, in comparison to 2^64. How could this be? I mean why don't we just throw away (literally throw away) "uint64" because it uses the same amount of memory like "double" and can store even larger numbers.
Unless I'm crazy here, or misunderstand something. Someone please help explain.
Thanks.
1 Comment
Walter Roberson
on 27 Jul 2015
realmax() is 1.79769313486232e+308 not a number roughly 10^380
Accepted Answer
More Answers (3)
Image Analyst
on 27 Jul 2015
1 vote
4 Comments
Huy Truong
on 27 Jul 2015
James Tursa
on 27 Jul 2015
I don't know what we can write that is not already in multiple online sources for this. The difference between floating point bit notation and integer bit notation is detailed in those places. Another link:
https://en.wikipedia.org/wiki/Floating_point
Steven Lord
on 27 Jul 2015
You might be interested in section 7 of the introduction chapter of Cleve's Numerical Computing with MATLAB.
The main assumption you're making that is not correct is that double precision numbers are equally spaced (in terms of absolute difference) throughout the range covered by double precision. This IS valid for uint64 (the uniform spacing is 1) but is NOT valid for double.
Muthu Annamalai
on 27 Jul 2015
@Huy Truong - you just answered the question, "What is the difference between floating point and fixed point numbers ?"
Dynamic range
The core difference is this:
- floating point classes (e.g. double and single) split their total number of bits into three groups: the main part encodes the digits (or fraction), a smaller part encodes the magnitude, and one bit encodes the sign.
- integer classes only encode the digits, and possibly the sign.
This means floating point numbers encode a value a bit like this:
X * ZZZZZZZZZZZZZZZ * 2^YYYYY
where the X is the sign bit, the Z's are the digits, and the Y's are the exponent (multiplier) bits. The advantage of doing this is it is possible to encode a reasonably large range of magnitudes (the range of 2^YYYYY) with the same precision (how many Z digits there are). Note it is not possible to represent all integers within that range!
An integer can be much simpler:
XZZZZZZZZZZZZZZZZZZZZ
Why do we not "throw away" the integer classes: they encode precise integer values right until their limits (see how there are more Z digits for the same number of bits) so their memory usage can be much more efficient, and because many operations can be applied directly to the bits themselves their operations can be faster.
2 Comments
Huy Truong
on 27 Jul 2015
Edited: Huy Truong
on 27 Jul 2015
James Tursa
on 27 Jul 2015
Edited: James Tursa
on 27 Jul 2015
"... adding two double has a more complicated mechanism happening inside computer than adding two pure integers?"
Yes. When adding two doubles, the code has to check for special bit patterns first (NaN, inf, denormalized). If they are present, then special code to determine the result must be used. If normal bit patterns are present, then you need to handle the difference in exponents for the two numbers to get the mantissas to effectively "line-up" for the addition. Then you need to account for a possible difference in signs (one positive and the other negative). And the result might overflow into an inf pattern, or underflow into a denormalized pattern.
For adding two integers bit patterns, the bits are already "lined-up" since there are no exponents to worry about. So a simple algorithm to add the bits works. And if 2's complement bit format is used (which is typical in modern computers), the exact same algorithm works for positive and negative operands. Overflow/underflow can be detected by examining the register overflow bit and also depends on the signed/unsigned status of the operands. But overall this can be much less work than adding doubles (although micro-code for adding doubles is still pretty fast).
Walter Roberson
on 27 Jul 2015
0 votes
What you are missing is that a 64 bit double cannot represent every number in the range up to 10^308. 64 bit doubles can only precisely represent some values in that range.
The smallest positive integer that a 64 bit double in IEEE 754 format cannot represent properly is 2^53 + 1.
Numbers represented in double are restricted to about 16 digits in accuracy. Once the values get above about 10^16 then the distance between adjacent representable numbers becomes larger than 1.
Categories
Find more on Logical in Help Center and File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!