GMP variable's bit size

GMP variable's bit size - gmp

How to know the size of a declared variable in GMP??or how can we decide the size of an integer in GMP?
mpz_random(temp,1);
in manual it is given that this function allocates 1limb(=32bits for my comp) size to the "temp"....
but it is having 9 digit number only..
SO i dont think that 32 bit size number holds only 9 digits number..
So please help me to know the size of integer variable in GMP ..
thanks in adv..

mpz_sizeinbase(num, 2) will give you the size in 'used' bits.

32 bits (4 bytes) really can be used to store only 9 decimal digits
2^32 = 4 294 967 296
so only 9 full decimal digits here (the 10th is in interval from 0 up 4, so it is not full).
You can recompute this via logarithms:
log_10(2^32)
let's ask google
log base 10(2^32) = 9.63295986
Everything is correct.

You can check the number of limbs in a debugger. A GMP integer has the internal field '_mp_size' which is the count of the limbs used to hold the current value of the variable (0 is a special case: it's represented with _mp_size = 0). Here's an example I ran in Visual C++ (see my article How to Install and Run GMP on Windows Using MPIR):
mpz_set_ui(temp, 1073741824); //2^30, (_mp_size = 1)
mpz_mul(temp,temp,temp); //2^60 (_mp_size = 2)
mpz_mul(temp,temp,temp); //2^120 (_mp_size = 4)

Related

Numeric value (9000000000000) out of range of int (-2147483648 - 2147483647) in splunk

Because I have very large data, I changed the value of maxresultrows
change maxresultrows
But when I use the dbxquery command, I get the following error. Is there a solution?
show error

The solution is to set maxresultrows to a value in the specified range.
Also, be aware of this note about maxresultrows in limits.conf.spec:
This limit should not exceed 50000.

A (64 bit machine) integer has a size of 4 bytes or 32 bits. The first bit is reserved for the sign. So there are only 2^31-1 (=2147483647) positive integers (excluding 0) and 2^31 (=2147483648) negative integers. It is like trying to write the number 12345 with only 4 digits.
Of course there are exceptions to the size of integers, but most languages standardized them to always be 4 bytes long.

what's best way for awk to check arbitrary integer precision

from GNU gawk's page
https://www.gnu.org/software/gawk/manual/html_node/Checking-for-MPFR.html
they have a formula to check arbitrary precision
function adequate_math_precision(n) { return (1 != (1+(1/(2^(n-1))))) }
My question is : wouldn't it be more efficient by staying within integer math domain with a formula such as
( 2^abs(n) - 1 ) % 2 # note 2^(n-1) vs. 2^|n| - 1
Since any power of 2 must also be even, then subtracting 1 must always be odd, then its modulo (%) over 2 becomes indicator function for is_odd() for n >= 0, while the abs(n) handles the cases where it's negative.
Or does the modulo necessitate a casting to float point, thus nullifying any gains ?

Good question. Let's tackle it.
The proposed snippet aims at checking wether gawk was invoked with the -M option.
I'll attach some digression on that option at the bottom.
The argument n of the function is the floating point precision needed for whatever operation you'll have to perform. So, say your script is in a library somewhere and will get called but you have no control over it. You'll run that function at the beginning of the script to promptly throw exception and bail out, suggesting that the end result will be wrong due to lack of bits to store numbers.
Your code stays in the integer realm: a power of two of an integer is an integer. There is no need to use abs(n) here, because there is no point in specifying how many bits you'll need as a negative number in the first place.
Then you subtract one from an even, integer number. Now, unless n=0, in which case 2^0=1 and then your code reads (1 - 1) % 2 = 0, your snippet shall always return 1, because the quotient (%) of an odd number divided by two is 1.
Problem is: you are trying to calculate a potentially stupidly large number in a function that should check if you are able to do so in the first place.
Since any power of 2 must also be even, then subtracting 1 must always
be odd, then its modulo (%) over 2 becomes indicator function for
is_odd() for n >= 0, while the abs(n) handles the cases where it's
negative.
Except when n=0 as we discussed above, you are right. The snippet will tell that any power of 2 is even, and any power of 2, minus 1, is odd. We were discussing another subject entirely thought.
Let's analyze the other function instead:
return (1 != (1+(1/(2^(n-1)))))
Remember that booleans in awk runs like this: 0=false and non zero equal true. So, if 1+x where x is a very small number, typically a large power of two (2^122 in the example page) is mathematically guaranteed to be !=1, in the digital world that's not the case. At one point, floating computation will reach a precision rock bottom, will be rounded down, and x=0 will be suddenly declared. At that point, the arbitrary precision function will return 0: false: 1 is equal 1.
A larger discussion on types and data representation
The page you link explains precision for gawk invoked with the -M option. This sounds like technoblahblah, let's decipher it.
At one point, your OS architecture has to decide how to store data, how to represent it in memory so that it can be accessed again and displayed. Terms like Integer, Float, Double, Unsigned Integer are examples of data representation. We here are addressing Integer representation: how is an integer stored in memory?
A 32-bit system will use 4 bytes to represent and integer, which in turn determines how larger the integer will be. The 32 bits are read from most significative (MSB) to less significative (LSB) and if signed, one bit will represent the sign (the MSB typically, drastically reducing the max size of the integer).
If asked to compute a large number, a machine will try to fit in in the max number available. If the end result is larger than that, you have overflow and end up with a wrong result or an error. Many online challenges typically ask you to write code for arbitrary long loops or large sums, then test it with inputs that will break the 64bit barrier, to see if you master proper types for indexes.
AWK is not a strongly typed language. Meaning, any variable can store data, regardless of the type. The data type can change and it is determined at runtime by the interpreter, so that the developer doesn't need to care. For instance:
$awk '{a="this is text"; print a; a=2; print a; print a+3.0*2}'
-| this is text
-| 2
-| 8
In the example, a is text, then is an integer and can be summed to a floating point number and printed as integer without any special type handling.
The Arbitrary Precision Page presents the following snippet:
$ gawk -M 'BEGIN {
> s = 2.0
> for (i = 1; i <= 7; i++)
> s = s * (s - 1) + 1
> print s
> }'
-| 113423713055421845118910464
There is some math voodoo behind, we will skip that. Since s is interpreted as a floating point number, the end result is computed as floating point.
Try to input that number on Windows calculator as decimal, and it will fail. Although you can compute it as a binary. You'll need the programmer setting and to add up to 53 bits to be able to fit it as unsigned integer.
53 is a magic number here: with the -M option, gawk uses arbitrary precision for numbers. In other words, it commandeers how many bits are necessary, track them and breaks free of the native OS architecture. The default option says that gawk will allocate 53 bits for any given arbitrary number. Fun fact, the actual result of that snippet is wrong, and it would take up to 100 bits to compute correctly.
To implement arbitrary large numbers handling, gawk relies on an external library called MPFR. Provided with an arbitrary large number, MPFR will handle the memory allocation and bit requisition to store it. However, the interface between gawk and MPFR is not perfect, and gawk can't always control the type that MPFR will use. In case of integers, that's not an issue. For floating point numbers, that will result in rounding errors.
This brings us back to the snippet at the beginning: if gawk was called with the -M option, numbers up to 2^53 can be stored as integers. Floating points will be smaller than that (you'll need to make the comma disappear somehow, or rather represent it spending some of the bits allocated for that number, just like the sign). Following the example of the page, and asking an arbitrary precision larger than 32, the snippet will return TRUE only if the -M option was passed, otherwise 1/2^(n-1) will be rounded down to be 0.

What is the difference in presentation between hexadecimal ASCII And hexadecimal number

I have two questions:
What is the difference in presentation between hexadecimal ASCII And hexadecimal number?
I mean that when we say
var db 31H
How we can find out if we want to say Character a or we want to say number 31H.
Why this application goes like this?
1- a db 4 dup(41h)
2- b dw 2 dup(4141h)
I thought that this two lines will be run in the same way but in the second line when I want to see the variables they will be 8 8bits and in each one is number 41h.
But it must something wrong because dw is 2 8 bits and we are saying make 2 of 2 of 8 bits and it must be 4 8 bits not 8 8 bits.

The answer to the first question is simple: in a computer's memory, there is no ASCII, no numbers, no images ... there is just bits. 31H represents the string of bits 00110001; nothing more, nothing less.
It's only when you do something with those bits (display them to a screen, use them in a mathematical operation, etc) that you interpret it as meaning 1 (which it would in ASCII), or a (in some other character encoding), or 49 (as a decimal number), or a particular shade of blue in your colour palette.

Redis int representation of a string is bigger when the string is more than 7 bytes but smaller otherwise

I'm trying to reduce Redis's objects size as much as I can and I've taken this whole week to experiment with it.
While testing different data representations I found out that an int representation of the string "hello" results in a smaller object. It may not look like much, but if you have a lot of data it can make a difference between using a few GB memory vs dozens of it.
Look at the following example (you can try it yourself if you want):
> SET test:1 "hello"
> debug object test:1
> Value at:0xb6c9f380 refcount:1 encoding:raw serializedlength:6 lru:9535350 lru_seconds_idle:7
In particular you should look at serializedlength which is 6 (bytes) in this case.
Now, look at the following int representation of it:
> SET test:2 "857715"
> debug object test:2
> Value at:0xb6c9f460 refcount:1 encoding:int serializedlength:5 lru:9535401 lru_seconds_idle:2
As you see, it results in a byte shorter object (note also encoding:int which I think is suggesting that ints get handled in a more efficient way).
With the string "hello w" (you'll see in a few moments why I didn't use "hello world" instead) we get an even bigger saving when it's represented as an int:
> SET test:3 "hello w"
> SET test:4 "857715023" <- Int representation. Notice that I inserted a "0", if I don't, it results in a bigger object and the encoding is set to "raw" instead (after all a space is not an int).
>
> debug object test:3
> Value at:0xb6c9f3a0 refcount:1 encoding:raw serializedlength:8 lru:9535788 lru_seconds_idle:6
> debug object test:4
> Value at:0xb6c9f380 refcount:1 encoding:int serializedlength:5 lru:9535809 lru_seconds_idle:5
It looks cool as long as you don't exceed 7 bytes string.. Look at what happens by a "hello wo" int representation:
> SET test:5 "hello wo"
> SET test:6 "85771502315"
>
> debug object test:5
> Value at:0xb6c9f430 refcount:1 encoding:raw serializedlength:9 lru:9535907 lru_seconds_idle:9
> debug object test:6
> Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:5
As you can see the int (12 bytes) is bigger than the string representation (9 bytes).
My question here is, what's going on behind the scenes when you represent a string as an int, that it is smaller until you reach 7 bytes?
Is there a way to increase this limit as you do with "list-max-ziplist-entries/list-max-ziplist-value" or a clever way to optimize this process so that it always (or nearly) results in a smaller object than a string?
UPDATE
I've further experimented with other tricks, and you can actually have smaller ints than string, regardless of its size, but that would involve a little more work as of data structure modelling.
I've found out that if you split the int representation of a string in chunks of ~8 numbers each, it ends up being smaller.
Take as an example the word "Hello World Hi Universe" and create both a string and int SET:
> HMSET test:7 "Hello" "World" "Hi" "Universe"
> HMSET test:8 "74111114" "221417113" "78" "2013821417184"
The results are as follows:
> debug object test:7
> Value at:0x7d12d600 refcount:1 encoding:ziplist serializedlength:40 lru:9567096 lru_seconds_idle:296
>
> debug object test:8
> Value at:0x7c17d240 refcount:1 encoding:ziplist serializedlength:37 lru:9567531 lru_seconds_idle:2
As you can see we got the int set smaller by 3 bytes.
The problem in this will be how to organize such a thing, but it shows that it's possible nonetheless.
Still, don't know where this limit is set. The ~700K persistent use of memory (even when you have no data inside) makes me think that there is a pre-defined "pool" dedicated to the optimization of int sets.
UPDATE2
I think I've found where this intset "pool" is defined in Redis source.
At line 81 in the file redis.h there is the def REDIS_SHARED_INTEGERS set to 10000
REDISH_SHARED_INTEGERS
I suspect it's the one defining the limit of an intset byte length.
I have to try to recompile it with an higher value and see if I can use a longer int value (it'll most probably allocate more memory if it's the one I think of).
UPDATE3
I want to thank Antirez for the reply! Didn't expect that.
As he made me notice, len != memory usage.
I got further in my experiment and saw that the objects get already slightly compressed (serialized). I may have missed something from the Redis documentation.
The confirmation comes from analyzing a Redis key wih the command redis-memory-for-key key, which actually returns the memory usage and not the serialized length.
For example, let's take the "hello" string and int we used before, and see what's the result:
~ # redis-memory-for-key test:1
Key "test:1"
Bytes 101
Type string
~ #
~ # redis-memory-for-key test:2
Key "test:2"
Bytes 87
Type string
As you can notice the intset is smaller (87 bytes) than the string (101 bytes) anyway.
UPDATE4
Surprisingly a longer intset seems to affect its serializedlength but not memory usage..
This makes it possible to actually build a 2digit-char mapping while it still being more memory efficient than a string, without even chunking it.
By 2digit-char mapping I mean that instead of mapping "hello" to "85121215" we map it to digits with a fixed length of 2 each, prefixing it with "0" if digit < 10 like "0805121215".
A custom script would then proceed by taking every two digit apart and converting them to their equivalent char:
08 05 12 12 15
\ | | | /
h e l l o
This is enough to avoid disambiguation (like "o" and "ae" which both result in the digit "15").
I'll show you this works by creating another set and therefore analyzing its memory usage like I did before:
> SET test:9 "0805070715"
Unix shell
----------
~ # redis-memory-for-key test:9
Key "test:9"
Bytes 87
Type string
You can see that we have a memory win here.
The same "hello" string compressed with Smaz for comparison:
>>> smaz.compress('hello')
'\x10\x98\x06'
// test:10 would be unfair as it results in a byte longer object
SET post:1 "\x10\x98\x06"
~ # redis-memory-for-key post:1
Key "post:1"
Bytes 99
Type string

My question here is, what's going on behind the scenes when you represent a
string as an int, that it is smaller until you reach 7 bytes?
Notice that the integer you supplied as test #6 is no longer actually encoded
as an integer, but as raw:
SET test:6 "85771502315"
Value at:0xb6c9f470 refcount:1 encoding:raw serializedlength:12 lru:9535913 lru_seconds_idle:
So we see that a "raw" value occupies one byte plus the length of its string representation. In memory
you get that plus the overhead of the value.
The integer encoding, I suspect, encodes a number as a 32-bit integer; then it will always
need five bytes, one to tell its type, and four to store those 32 bits.
As soon as you overflow the maximum representable integer in 32 bits, which is either 2 billions or 4 depending on whether you use a sign or not, you need to revert to raw encoding.
So probably
2147483647 -> five bytes (TYPE_INT 0x7F 0xFF 0xFF 0xFF)
2147483649 -> eleven bytes (TYPE_RAW '2' '1' '4' '7' '4' '8' '3' '6' '4' '9')
Now, how can you squeeze a string representation PROVIDED THAT YOU ONLY USE AN ASCII SET?
You can get the string (140 characters):
When in the Course of human events it becomes necessary for one people
to dissolve the political bands which have connected them with another
and convert each character to a six-bit representation; basically its index in the string
"ABCDEFGHIJKLMNOPQRSTUVWXYZ01234 abcdefghijklmnopqrstuvwxyz56789."
which is the set of all the characters you can use.
You can now encode four such "text-only characters" in three "binary characters", a sort of "reverse base 64 encoding"; base64 encoding will get three binary characters and create a four-byte sequence of ASCII characters.
If we were to code it as groups of integers, we would save a few bytes - maybe get it down
to 130 bytes - at the cost of a larger overhead.
With this type of "reverse base64" encoding, we can get 140 character to 35 groups of four characters, which become a string of 35x3 = 105 binary characters, raw encoded to 106 bytes.
As long, I repeat, as you never use characters outside the range above. If you do, you can
enlarge the range to 128 characters and 7 bits, thus saving 12.5% instead of 25%; 140 characters will then become 126, raw encoded to 127 bytes, and you save (141-127) = 14 bytes.
Compression
If you have much longer strings, you can compress them (i.e., you use a function such as deflate() or gzencode() or gzcompress() ). Either straight; in which case the above string becomes 123 bytes. Easy to do.
Compressing many small strings: the Rube Goldberg approach
Since compression algorithms learn, and at the beginning they dare assume nothing, small strings will not compress greatly. They're "all beginning", so to speak. Just as an engine, when running cold the performances are inferior.
If you have a "corpus" of text these strings come from, you can use a time-consuming trick that "warms up" the compression engine and may double (or better) its performances.
Suppose you have two strings, COMMON and TARGET (the second one is the one you're interested in). If you z-compressed COMMON you would get, say, ZCMN. If you compressed TARGET you would get ZTRGT.
But as I said, since the gz compression algorithm is stream oriented, and it learns as it goes by, the compression ratio of the second half of any text (provided there aren't freakish statistical distribution changes between halves) is always appreciably higher than that of the first half.
So if you were to compress, say, COMMONTARGET, you'd get ZCMGHQI.
Notice that the first part of the string, as far as almost the end, is the same as before. Indeed if you compressed COMMONFOOBAR, you'd get something like ZCMQKL. And the second part is compressed better than before, even if we count the area of overlap as belonging entirely to the second string.
And this is the trick. Given a family of strings (TARGET, FOOBAR, CASTLE BRAVO), we compress not the strings, but the concatenation of those strings with a large prefix. Then we discard from the result the common compressed prefix. Thus TARGET is taken from the compression of COMMONTARGET (which is ZCMGHQI), and becomes GHQI instead of ZTRGT, with a 20% gain.
The decoder does the reverse: given GHQI, it first applies the common compressed prefix ZCM (which it must know); then it decodes the result, and finally discards the common uncompressed prefix, of which it need only know the length beforehand.
So the first sentence above (140 characters) becomes 123 when compressed by itself; if I take the rest of the Declaration and use it as a prefix, it compresses to 3355 bytes. This prefix plus my 140 bytes becomes 3409 bytes, of which 3352 are common, leaving 57 bytes.
At the cost of storing once the uncompressed prefix in the encoder, and the compressed prefix once in the decoder, and the whole thingamajig running five times as slow, I can now get those 140 bytes down to 57 instead of 123 - less than half of before.
This trick works great for small strings; for larger ones, the advantage isn't worth the pain. Also, different prefixes yield different results. The best prefixes are those that contain most of the sequences that are likely to appear in the string pool, ordered by increasing length.
Added bonus: the compressed prefix also doubles as a sort of weak encryption, as without that, you can't easily decode the compressed strings, even if you might be able to recover some pieces thereof.