What kind of hash algorithm is used for Hive's built-in HASH() Function - hive

What kind of hashing algorithm is used in the built-in HASH() function?
I'm ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers within the linkedin datafu UDFs for Pig.

HASH function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.
Its code looks like this:
int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
for (Object item: items) {
hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
}
Basically it's a classic hash algorithm as recommended in the book Effective Java.
To quote a great man (and a great book):
The value 31 was chosen because it is an odd prime. If it were even
and the multiplication overflowed, information would be lost, as
multiplication by 2 is equivalent to shifting. The advantage of using
a prime is less clear, but it is traditional. A nice property of 31 is
that the multiplication can be replaced by a shift and a subtraction
for better performance: 31 * i == (i << 5) - i. Modern VMs do this
sort of optimization automatically.
I digress. You can look at the HASH source here.
If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect function (I hope that'll work):
SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')

As of Hive 2.1.0 there is a mask_hash function that will hash string values.
For Hive 2.x it uses md5 as the hashing algorithm. This was changed to sha256 for Hive 3.x

Related

Is there a way to wrap an integer value into an integer range [min,max] without division or modulo?

My situation is the same as in this question, but instead of using floats, I'm using integers. I feel like this difference might be significant because there might be some bit twiddling hacks that can be used to wrap the value into the range.
One use case for wrapping a value into a range is assigning a value to a bucket in a hash table, like this (range of [0,len(buckets)]):
bucket = buckets[hash(key) % len(buckets)]
If we are continuously adding values to a hash table, it might be worthwhile to optimize this operation.
Since modulo isn't actually needed, there is a shortcut to mapping a hash into a new range. See Lemire's FastRange.
// map a **full-width 32-bit** hash to [0,p) without division.
uint32_t fastrange32(uint32_t word, uint32_t p) {
return (uint32_t)(((uint64_t)word * (uint64_t)p) >> 32);
}
Simply take the high half of a 32x32 -> 64-bit multiply of word * p. If all your word values are smallish, all your results will be 0, so this is only good for word values that are fairly uniformly distributed over the full 32-bit range. Hash functions are normally fine.
I guess that is slightly biased, but that can be worked around see https://github.com/apple/swift/pull/39143
Often the number of buckets is kept as a power of 2. This allows unwanted bits to just get masked-off.
(e.g. If there are 16 buckets then bucket_id = hash & 0xF.)
The downside is that it requires the number of buckets to double each time the table is resized.

(bitcoin) Calculate hash from getwork function - how to do it?

when I call getwork on my bitcoind server, I get the following:
./bitcoind getwork
{
"midstate" : "695d56ae173bbd0fd5f51d8f7753438b940b7cdd61eb62039036acd1af5e51e3",
"data" : "000000013d9dcbbc2d120137c5b1cb1da96bd45b249fd1014ae2c2b400001511000000009726fba001940ebb5c04adc4450bdc0c20b50db44951d9ca22fc5e75d51d501f4deec2711a1d932f00000000000000800000000000000000000000000000000000000000000000000000000000000000000000000000000080020000",
"hash1" : "00000000000000000000000000000000000000000000000000000000000000000000008000000000000000000000000000000000000000000000000000010000",
"target" : "00000000000000000000000000000000000000000000002f931d000000000000"
}
This protocol does not seem to be documented. How do I compute the hash from this data. I think that this data is in little endian. So the first step is to convert everything to big endian? Once that is done, I calculate the sha256 of the data. The data can be divided in two chuncks of 64 bytes each. The hash of the first chuck is given by midstate and therefore does not have to be computed.
I must therefore hash the chunck #2 with sha256, using the midstate as the initial hash values. Once that is done, I end up with a hash of chunk 2, which is 32 bytes. I calculate the hash of this chunk one more time to get a final hash.
Then, do I convert everything to little endian and submit the work?
What is hash1 used for?
The hash calculation is documented at Block hashing algorithm.
Start there for the relatively simple basics. The basic data structures are documented in Protocol specification - Bitcoin Wiki. Note that the protocol definition (and the definition of work) more or less assumes that SHA-256 hashes are 256-bit little-endian values, rather than big-endian as the standard implies. See also
Getwork is more complicated and runs into more serious endian/byte ordering confusion.
First note that the getwork API is optimized to speed up the initial steps of mining.
The midstate and hash1 values are for these performance optimizations and can be ignored. Just look at the "data".
And when a standard sha256 implementation is used, only the first 80 bytes (160 hex characters) of the "data" are hashed.
Unfortunately, the JSON data presented in the getwork data structure has different endian characteristics than what is needed for hashing in the block example above.
They all say to go to the source for the answer, but the C++ source can be big and confusing. A simple alternative is the poold.py code. There is discussion of it here: New mining pool for testing. You only need to look at the first few lines of the "checkwork" routine, and the "bufreverse" and "bytereverse" functions, to get the byte ordering right. In the end it is just a matter of doing a reversal of the bytes in each 32-bit segment of the data. Yes - very odd. But endian issues are tricky and can end up that way....
Some other helpful information on the way "getwork" works can be found in discussions at:
Do I understand header hashing?
Stupid newbie question about the nonce
Note that finding the signal to noise in the original Bitcoin forum is getting very hard, and there is currently an Area51 proposal for a StackExchange site for Bitcoin and Crypto Currency in general. Come join us!
It sounds right, there is a script in javascript that do calculate the hash but I do not fully understand it so I don't know, maybe you understand it better if you look.
this.tryHash = function(midstate, half, data, hash1, target, nonce){
data[3] = nonce;
this.sha.reset();
var h0 = this.sha.update(midstate, data).state; // compute first hash
for (var i = 0; i < 8; i++) hash1[i] = h0[i]; // place it in the h1 holder
this.sha.reset(); // reset to initial state
var h = this.sha.update(hash1).state; // compute final hash
if (h[7] == 0) {
var ret = [];
for (var i = 0; i < half.length; i++)
ret.push(half[i]);
for (var i = 0; i < data.length; i++)
ret.push(data[i]);
return ret;
} else return null;
};
SOURCE: https://github.com/jwhitehorn/jsMiner/blob/4fcdd9042a69b309035dfe9c9ddf716119831a16/engine.js#L149-165
Frankly speaking
Bitcoin block hashing algorithm is not officially described by any source.
"
The hash calculation is documented at Block hashing algorithm.
"
should read
The hash calculation is "described" at Block hashing algorithm.
en.bitcoin.it/wiki/Block_hashing_algorithm
btw the example code in PHP comes with a bug (typo)
the example code in Python generates errors when run by Python3.3 for Windows XP 32
(missing support for string.decode)

What are the practical uses of modulus (%) in programming? [duplicate]

This question already has answers here:
Closed 12 years ago.
Possible Duplicate:
Recognizing when to use the mod operator
What are the practical uses of modulus? I know what modulo division is. The first scenario which comes to my mind is to use it to find odd and even numbers, and clock arithmetic. But where else I could use it?
The most common use I've found is for "wrapping round" your array indices.
For example, if you just want to cycle through an array repeatedly, you could use:
int a[10];
for (int i = 0; true; i = (i + 1) % 10)
{
// ... use a[i] ...
}
The modulo ensures that i stays in the [0, 10) range.
I usually use them in tight loops, when I have to do something every X loops as opposed to on every iteration..
Example:
int i;
for (i = 1; i <= 1000000; i++)
{
do_something(i);
if (i % 1000 == 0)
printf("%d processed\n", i);
}
One use for the modulus operation is when making a hash table. It's used to convert the value out of the hash function into an index into the array. (If the hash table size is a power of two, the modulus could be done with a bit-mask, but it's still a modulus operation.)
To print a number as string, you need the modulus to find the value of a digit.
string number_to_string(uint number) {
string result = "";
while (number != 0) {
result = cast(char)((number % 10) + '0') ~ result;
// ^^^^^^^^^^^
number /= 10;
}
return result;
}
For the control number of international bank account numbers, the mod97 technique.
Also in large batches to do something after n iterations. Here is an example for NHibernate:
ISession session = sessionFactory.openSession();
ITransaction tx = session.BeginTransaction();
for ( int i=0; i<100000; i++ ) {
Customer customer = new Customer(.....);
session.Save(customer);
if ( i % 20 == 0 ) { //20, same as the ADO batch size
//Flush a batch of inserts and release memory:
session.Flush();
session.Clear();
}
}
tx.Commit();
session.Close();
The usual implementation of buffered communications uses circular buffers, and you manage them with modulus arithmetic.
For languages that don't have bitwise operators, modulus can be used to get the lowest n bits of a number. For example, to get the lowest 8 bits of x:
x % 256
which is equivalent to:
x & 255
Cryptography. That alone would account for an obscene percentage of modulus (I exaggerate, but you get the point).
Try the Wikipedia page too:
Modular arithmetic is referenced in number theory, group theory, ring theory, knot theory, abstract algebra, cryptography, computer science, chemistry and the visual and musical arts.
In my experience, any sufficiently advanced algorithm is probably going to touch on one more of the above topics.
Well, there are many perspectives you can look at it. If you are looking at it as a mathematical operation then it's just a modulo division. Even we don't need this as whatever % do, we can achieve using subtraction as well, but every programming language implement it in very optimized way.
And modulu division is not limited to finding odd and even numbers or clock arithmetic. There are hundreds of algorithms which need this module operation, for example, cryptography algorithms, etc. So it's a general mathematical operation like other +, -, *, /, etc.
Except the mathematical perspective, different languages use this symbol for defining built-in data structures, like in Perl %hash is used to show that the programmer declared a hash. So it all varies based on the programing language design.
So still there are a lot of other perspectives which one can do add to the list of use of %.

Trie Implementation Question

I'm implementing a trie for predictive text entry in VB.NET - basically autocompletion as far as the use of the trie is concerned. I've made my trie a recursive data structure based on the generic dictionary class.
It's basically:
class WordTree Inherits Dictionary(of Char, WordTree)
Each letter in a word (all upper cased) is used as a key to a new WordTrie. A null character on a leaf indicates the termination of a word. To find a word starting with a prefix I walk the trie as far as my prefix goes then collect all children words.
My question is basically on the implementation of the trie itself. I'm using the dictionary hash function to branch my tree. I could use a list and do a linear search over the list, or do something else. What's the smooth move here? Is this a reasonable way to do my branching?
Thanks.
Update:
Just to clarify, I'm basically asking if the dictionary branching approach is obviously inferior to some other alternative. The application in which I'm using this data structure only uses upper case letters, so maybe the array approach is the best. I might use the same data structure for a more complex typeahead situation in the future (more characters). In that case, it sounds like the dictionary is the right approach - up to the point where I need to use something more complex in general.
If it's just the 26 letters, as a 26 entry array. Then lookup is by index. It probably uses less space than the Dictionary if the bucket-list is longer than 26.
If you are worried about space, you can use bitmap compression on the valid byte transitions, assuming the 26char limit.
class State // could be struct or whatever
{
int valid; // can handle 32 transitions -- each bit set is valid
vector<State> transitions;
State getNextState( int ch )
{
int index;
int mask = ( 1 << ( toupper( ch ) - 'A' )) -1;
int bitsToCount = valid & mask;
for( index = 0; bitsToCount ; bitsToCount >>= 1)
{
index += bitsToCount & 1;
}
transitions.at( index );
}
};
There are other ways to do the bit counting Here, the index into the vector is the number of set bits in the valid bitset. the other alternative is the direct indexed array of states;
class State
{
State transitions[ 26 ]; // use the char as the index.
State getNextState( int ch )
{
return transitions[ ch ];
}
};
A good data structure that's efficient in space and potentially gives sub-linear prefix lookups is the ternary search tree. Peter Kankowski has a fantastic article about it. He uses C, but it's straightforward code once you understand the data structure. As he mentioned, this is the structure ispell uses for spelling correction.
I have done this (a trie implementation) in C with 8 bit chars, and simply used the array version (as alluded to by the "26 chars" answer).
HOWEVER, I am guessing that you want full unicode support (since a .NET char is unicode, among other reasons). Assuming you have to have support for unicode, the hash/map/dictionary lookup is probably your best bet, as a 64K entry array in each node won't really work very well.
About the only hack up I could think of on this is to store entire strings (suffixes or possibly "in-fixes") on branches that do not yet split, depending on how sparse the tree, er, trie, is. That adds a lot of logic to detect the multi-char strings, though, and to split them up when an alternate path is introduced.
What is the read vs update pattern?
---- update jul 2013 ---
If .NET strings have a function like java to get the bytes for a string (as UTF-8), then having an array in each node to represent the current position's byte value is probably a good way to go. You could even make the arrays variable size, with first/last bounds indicators in each node, since MANY nodes will have only lower case ASCII letters anyway, or only upper case letters or the digits 0-9 in some cases.
I've found burst trie's to be very space efficient. I wrote my own burst trie in Scala that also re-uses some ideas that I found in GWT's trie implementation. I used it in Stripe's Capture the Flag contest on a problem that was multi-node with a small amount of RAM.

Is there a practical limit to the size of bit masks?

There's a common way to store multiple values in one variable, by using a bitmask. For example, if a user has read, write and execute privileges on an item, that can be converted to a single number by saying read = 4 (2^2), write = 2 (2^1), execute = 1 (2^0) and then add them together to get 7.
I use this technique in several web applications, where I'd usually store the variable into a field and give it a type of MEDIUMINT or whatever, depending on the number of different values.
What I'm interested in, is whether or not there is a practical limit to the number of values you can store like this? For example, if the number was over 64, you couldn't use (64 bit) integers any more. If this was the case, what would you use? How would it affect your program logic (ie: could you still use bitwise comparisons)?
I know that once you start getting really large sets of values, a different method would be the optimal solution, but I'm interested in the boundaries of this method.
Off the top of my head, I'd write a set_bit and get_bit function that could take an array of bytes and a bit offset in the array, and use some bit-twiddling to set/get the appropriate bit in the array. Something like this (in C, but hopefully you get the idea):
// sets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// result is 0 on success, non-zero on failure (offset out-of-bounds)
int set_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//set the right bit
bytes[offset >> 3] |= (1 << (offset & 0x7));
return 0; //success
}
//gets the n-th bit in |bytes|. num_bytes is the number of bytes in the array
// returns (-1) on error, 0 if bit is "off", positive number if "on"
int get_bit(char* bytes, unsigned long num_bytes, unsigned long offset)
{
// make sure offset is valid
if(offset < 0 || offset > (num_bytes<<3)-1) { return -1; }
//get the right bit
return (bytes[offset >> 3] & (1 << (offset & 0x7));
}
I've used bit masks in filesystem code where the bit mask is many times bigger than a machine word. think of it like an "array of booleans";
(journalling masks in flash memory if you want to know)
many compilers know how to do this for you. Adda bit of OO code to have types that operate senibly and then your code starts looking like it's intent, not some bit-banging.
My 2 cents.
With a 64-bit integer, you can store values up to 2^64-1, 64 is only 2^6. So yes, there is a limit, but if you need more than 64-its worth of flags, I'd be very interested to know what they were all doing :)
How many states so you need to potentially think about? If you have 64 potential states, the number of combinations they can exist in is the full size of a 64-bit integer.
If you need to worry about 128 flags, then a pair of bit vectors would suffice (2^64 * 2).
Addition: in Programming Pearls, there is an extended discussion of using a bit array of length 10^7, implemented in integers (for holding used 800 numbers) - it's very fast, and very appropriate for the task described in that chapter.
Some languages ( I believe perl does, not sure ) permit bitwise arithmetic on strings. Giving you a much greater effective range. ( (strlen * 8bit chars ) combinations )
However, I wouldn't use a single value for superimposition of more than one /type/ of data. The basic r/w/x triplet of 3-bit ints would probably be the upper "practical" limit, not for space efficiency reasons, but for practical development reasons.
( Php uses this system to control its error-messages, and I have already found that its a bit over-the-top when you have to define values where php's constants are not resident and you have to generate the integer by hand, and to be honest, if chmod didn't support the 'ugo+rwx' style syntax I'd never want to use it because i can never remember the magic numbers )
The instant you have to crack open a constants table to debug code you know you've gone too far.
Old thread, but it's worth mentioning that there are cases requiring bloated bit masks, e.g., molecular fingerprints, which are often generated as 1024-bit arrays which we have packed in 32 bigint fields (SQL Server not supporting UInt32). Bit wise operations work fine - until your table starts to grow and you realize the sluggishness of separate function calls. The binary data type would work, were it not for T-SQL's ban on bitwise operators having two binary operands.
For example .NET uses array of integers as an internal storage for their BitArray class.
Practically there's no other way around.
That being said, in SQL you will need more than one column (or use the BLOBS) to store all the states.
You tagged this question SQL, so I think you need to consult with the documentation for your database to find the size of an integer. Then subtract one bit for the sign, just to be safe.
Edit: Your comment says you're using MySQL. The documentation for MySQL 5.0 Numeric Types states that the maximum size of a NUMERIC is 64 or 65 digits. That's 212 bits for 64 digits.
Remember that your language of choice has to be able to work with those digits, so you may be limited to a 64-bit integer anyway.