Why does two's-complement multiplication need to do sign extension? - multiplication

In the book Computer Systems A Programmer's Perspective (2.3.5), the method to calculate two's-complement multiplication is described as follows:
Signed multiplication in C generally is performed by truncating the 2w-bit product to w bits.
Truncating a two’s-complement number to w bits is equivalent to first computing its value modulo 2w and then converting from unsigned to two’s-complement.
Thus, for similar bit-level operands, why is unsigned multiplication different from two’s-complement multiplication? Why does two's-complement multiplication need to do sign extension?
To calculate same bit-level representation of unsigned and two’s-complement addition, we can convert the arguments of two’s-complement, then perform unsigned addition, and finally convert back to two’s-complement.
Since multiplication consists of multiple additions, why are the full representations of unsigned and two’s-complement multiplication different?

Figure 2.27 demonstrates the example below:
+------------------+----------+---------+-------------+-----------------+
| Mode | x | y | x · y | Truncated x · y |
+------------------+----------+---------+-------------+-----------------+
| Unsigned | 5 [101] | 3 [011] | 15 [001111] | 7 [111] |
| Two's complement | –3 [101] | 3 [011] | –9 [110111] | –1 [111] |
+------------------+----------+---------+-------------+-----------------+
If you multiply 101 by 011, you will get 1111 (which is equal to 001111). How did they get 110111 for two's complement case then?
The catch here is that to get a correct 6-bit two's complement product you need to multiply 6-bit two's complement numbers. Thus, you need first to convert -3 and 3 to 6-bit two's complement representation: -3 = 111101, 3 = 000011 and only then multiply them 111101 * 000011 = 10110111. You also need to truncate the result to 6 bits to eventually get 110111 from the table above.

Related

How to select half precision (BFLOAT16 vs FLOAT16) for your trained model?

how will you decide what precision works best for your inference model? Both BF16 and F16 takes two bytes but they use different number of bits for fraction and exponent.
Range will be different but I am trying to understand why one chose one over other.
Thank you
|--------+------+----------+----------|
| Format | Bits | Exponent | Fraction |
|--------+------+----------+----------|
| FP32 | 32 | 8 | 23 |
| FP16 | 16 | 5 | 10 |
| BF16 | 16 | 8 | 7 |
|--------+------+----------+----------|
Range
bfloat16: ~1.18e-38 … ~3.40e38 with 3 significant decimal digits.
float16: ~5.96e−8 (6.10e−5) … 65504 with 4 significant decimal digits precision.
bfloat16 is generally easier to use, because it works as a drop-in replacement for float32. If your code doesn't create nan/inf numbers or turn a non-0 into a 0 with float32, then it shouldn't do it with bfloat16 either, roughly speaking. So, if your hardware supports it, I'd pick that.
Check out AMP if you choose float16.

What is the size of metadata in a postgres table?

There is a table in postgres 9.4 with following types of columns:
NAME TYPE TYPE SIZE
id | integer | 4 bytes
timestamp | timestamp with time zone | 8 bytes
num_seconds | double precision | 8 bytes
count | integer | 4 bytes
total | double precision | 8 bytes
min | double precision | 8 bytes
max | double precision | 8 bytes
local_counter | integer | 4 bytes
global_counter | integer | 4 bytes
discrete_value | integer | 4 bytes
Giving in total: 60 bytes per row
The size of a table(with toast) returned by pg_table_size(table) is: 49 152 bytes
Number of rows in the table: 97
Taking into account that a table is split into pages of 8kB, we can fit 49 152/8 192 = 6 pages into this table.
Each page and each row has some meta-data...
Looking at the pure datatype size we should expect something around 97 * 60 = 5 820 bytes of row data and adding approximately the same amount of metadata to it, we are not landing even close to the result returned by pg_table_size: 49 152 bytes.
Does metadata really take ~9x space compared to the pure data in postgres?
A factor 9 is clearly more wasted space ("bloat") than there should be:
Each page has a 16-byte header.
Each row has a 23-byte "tuple header".
There will be four bytes of padding between id and timestamp and between count and total for alignment reasons (you can avoid that by reordering the columns).
Moreover, each tuple has a "line pointer" of two bytes in the data page.
See this answer for some details.
To see exactly how the space in your table is used, install the pgstattuple extension:
CREATE EXTENSION pgstattuple;
and use the pgstattuple function on the table:
SELECT * FROM pgstattuple('tablename');

Float stored as Real in database Sql Server

Why is Float stored as Real in sys.columns or Information_schema.columns when precision <= 24.
CREATE TABLE dummy
(
a FLOAT(24),
b FLOAT(25)
)
checking the data type
SELECT TABLE_NAME,
COLUMN_NAME,
DATA_TYPE,
NUMERIC_PRECISION
FROM INFORMATION_SCHEMA.COLUMNS
WHERE TABLE_NAME = 'dummy'
Result:
+------------+-------------+-----------+-------------------+
| TABLE_NAME | COLUMN_NAME | DATA_TYPE | NUMERIC_PRECISION |
+------------+-------------+-----------+-------------------+
| dummy | a | real | 24 |
| dummy | b | float | 53 |
+------------+-------------+-----------+-------------------+
So why is float stored as real when the precision is less than or equal to 24. Is this documented somewhere ?
From an MSDN article which discusses the difference between float and real in T-SQL:
The ISO synonym for real is float(24).
float [ (n) ]
Where n is the number of bits that are used to store the mantissa of the float number in scientific notation and, therefore, dictates the precision and storage size. If n is specified, it must be a value between 1 and 53. The default value of n is 53.
n value | Precision | Storage size
1-24 | 7 digits | 4 bytes
24-53 | 15 digits | 8 bytes
SQL Server treats n as one of two possible values. If 1<=n<=24, n is treated as 24. If 25<=n<=53, n is treated as 53.
As to why SQL Server labels it as real, I think it is just a synonym. However, underneath the hood it is still a float(24).
The ISO synonym for real is float(24).
Please refer for more info:
https://msdn.microsoft.com/en-us/library/ms173773.aspx

NSNumber how to get smallest common denominator? Like 3/8 for 0.375?

Say if I have an NSNumber, which is something between 0 and 1, and it can be represented using X/Y, how do I calculate the X and Y in this case? I don't want to compare:
if (number.doubleValue == 0.125)
{
X = 1;
Y = 8;
}
so I get 1/8 for 0.125
That's relatively straightforward. For example, 0.375 is equivalent to 0.375/1.
First step is to multiply numerator and denominator until the numerator is an integral value (a), giving you 375/1000.
Then find the greatest common divisor and divide both numerator and denominator by that.
A (recursive) function for GCD is:
int gcd (int a, int b) {
return (b == 0) ? a : gcd (b, a%b);
}
If you call that with 375 and 1000, it will spit out 125 so that, when you divide the numerator and denominator by that, you get 3/8.
(a) As pointed out in the comments, there may be problems with numbers that have more precision bits than your integer types (such as IEEE754 doubles with 32-bit integers). You can solve this by choosing integers with a larger range (longs, or a bignum library like MPIR) or choosing a "close-enough" strategy (consider it an integer when the fractional part is relatively insignificant compared to the integral part).
Another issue is the fact that some numbers don't even exist in IEEE754, such as the infamous 0.1 and 0.3.
Unless a number can be represented as the sum of 2-n values where n is limited by the available precision (such as 0.375 being 1/4 + 1/8), the best you can hope for is an approximation.
Example, consider the single-precision (you'll see why below, I'm too lazy to do the whole 64 bits) 1/3. As a single precision value, this is stored as:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111101 01010101010101010101010
In this example, the sign is 0 hence it's a positive number.
The exponent bits give 125 which, when you subtract the 127 bias, gives you -2. Hence the multiplier will be 2-2, or 0.25.
The mantissa bits are a little trickier. They form the sum of an explicit 1 along with all the 2-n values for the 1 bits, where n is 1 through 23 (left to right. So the mantissa is calculated thus:
s eeeeeeee mmmmmmmmmmmmmmmmmmmmmmm
0 01111101 01010101010101010101010
| | | | | | | | | | |
| | | | | | | | | | +-- 0.0000002384185791015625
| | | | | | | | | +---- 0.00000095367431640625
| | | | | | | | +------ 0.000003814697265625
| | | | | | | +-------- 0.0000152587890625
| | | | | | +---------- 0.00006103515625
| | | | | +------------ 0.000244140625
| | | | +-------------- 0.0009765625
| | | +---------------- 0.00390625
| | +------------------ 0.015625
| +-------------------- 0.0625
+---------------------- 0.25
Implicit 1
========================
1.3333332538604736328125
When you multiply that by 0.25 (see exponent earlier), you get:
0.333333313465118408203125
Now that's why they say you only get about 7 decimal digits of precision (15 for IEEE754 double precision).
Were you to pass that actual number through my algorithm above, you would not get 1/3, you would instead get:
5,592,405
---------- (or 0.333333313465118408203125)
16,777,216
But that's not a problem with the algorithm per se, more a limitation of the numbers you can represent.
Thaks to Wolfram Alpha for helping out with the calculations. If you ever need to do any math that stresses out your calculator, that's one of the best tools for the job.
As an aside, you'll no doubt notice the mantissa bits follow a certain pattern: 0101010101.... This is because 1/3 is an infinitely recurring binary value as well as an infinitely recurring decimal one. You would need and infinite number of 01 bits at the end to exactly represent 1/3 exactly.
You can try this:
- (CGPoint)yourXAndYValuesWithANumber:(NSNumber *)number
{
float x = 1.0f;
float y = x/number.doubleValue;
for(int i = 1; TRUE; i++)
{
if((float)(int)(y * i) == y * i)
// Alternatively floor(y * i), instead of (float)(int)(y * i)
{
x *= i;
y *= i;
break;
}
}
/* Also alternatively
int coefficient = 1;
while(floor(y * coefficient) != y * coefficient)coefficient++;
x *= coefficient, y *= coefficient;*/
return CGPointMake(x, y);
}
This will not work if you have invalid input. X and Y will have to exist and be valid natural numbers (1 to infinity). A good example that will break it is 1/pi. If you have limits, you can do some critical thinking to implement them.
The approach outlined by paxdiablo is spot-on.
I just wanted to provide an efficient GCD function (implemented iteratively):
int gcd (int a, int b){
int c;
while ( a != 0 ) {
c = a; a = b%a; b = c;
}
return b;
}
Source.

Rounding off a list of numbers to a user-defined step while preserving their sum

I've been reading a lot of posts about rounding off numbers, but I couldn't manage to do what I want :
I have got a list of positive floats.
The unsigned integer roundOffStep to use is user-defined. I have no control other it.
I want to be able to do the most accurate rounding while preserving the sum of those numbers, or at least while keeping the new sum inferior to the original sum.
How would I do that ? I am terrible with algorithms, so this is way too tricky for me.
Thx.
EDIT : Adding a Test case :
FLOATS
29.20
18.25
14.60
8.76
2.19
sum = 73;
Let's say roundOffStep = 5;
ROUNDED FLOATS
30
15
15
10
0
sum = 70 < 73 OK
Round all numbers to the nearest multiple of roundOffStep normally.
If the new sum is lower than the original sum, you're done.
For each number, calculate rounded_number - original_number. Sort this list of differences in decreasing order so that you can find the numbers with the largest difference.
Pick the number that gives the largest difference rounded_number - original_number, and subtract roundOffStep from that number.
Repeat step 4 (picking the next largest difference each time) until the new sum is less than the original.
This process should ensure that the rounded numbers are as close as possible to the originals, without going over the original sum.
Example, with roundOffStep = 5:
Original Numbers | Rounded | Difference
----------------------+------------+--------------
29.20 | 30 | 0.80
18.25 | 20 | 1.75
14.60 | 15 | 0.40
8.76 | 10 | 1.24
2.19 | 0 | -2.19
----------------------+------------+--------------
Sum: 73 | 75 |
The sum is too large, so we pick the number giving the largest difference (18.25 which was rounded to 20) and subtract 5 to give 15. Now the sum is 70, so we're done.