I want to calculate hash for strings in hive without writing any UDF only using exisiting functions . So that I can use similar approach to get consistent hash in other languages. for ex : are there any functions using which I can do something like adding characters or taking Xor.
It depends on the version of Hive, cf. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF#LanguageManualUDF-Misc.Functions
select XYZ, hash(XYZ) from ABC
has been available for years and applies plain old java.lang.String.hashCode(), returning an INT (32 bit hash)
[Edit 2] Actually it's a bit more complex since hash() accepts a list of arguments of any type (incl. primitive types that have no built-in hashing method), so a custom approach is used -- check ObjectInspectorUtils.hashCode() and ObjectInspectorUtils.getBucketHashCode() in the source code here (for V2.1)
select XYZ, crc32(XYZ) from ABC
requires Hive 1.3 and applies plain old Cyclic Redundancy Check (probably via java.util.zip.CRC32), returning a BIGINT (32 bit hash)
select XYZ, md5(XYZ), sha1(XYZ), sha2(XYZ,256), sha2(XYZ,512) from ABC
requires Hive 1.3 and applies strong, cryptographic hash functions, returning a STRING with the hexadecimal representation of the binary (128, 160, 256 and 512 bit hashes)
[Edit 1] the answer to that post has also a very good workaround for applying crypto hash functions with older versions of Hive, using Apache Commons static methods and reflect().
Related
We have two databases/warehouses on two different platforms--Microsoft SQL Server and Snowflake (cloud data warehouse).
Across both, customers are identified via a unique AccountId (integer) and Uuid (32 character).
For a particular use case, we need to take one of these unique values (say, the AccountId for instance), pass it into a system function, and generate a unique 20-character identifier (it can't be longer/shorter).
This function needs to exist in both systems. (e.g. select sys.myfn(1234) returns the same in each)
I am aware that Snowflake has functions like sha1(): https://docs.snowflake.com/en/sql-reference/functions/sha1.html
Which are equivalent to HASHBYTES() in SQL Server: https://learn.microsoft.com/en-us/sql/t-sql/functions/hashbytes-transact-sql?view=sql-server-ver15
How do I take the output from either and truncate it down to 20 characters and maintain uniqueness?
A UUID is a 128bit value (with a few bits reserved for version information). If you run that through a hash function, perform a base64 encoding of the hash, and then truncate to 20 characters, you still get 20 * 6 = 120 bits of range. The chance of collision is still in in the life-of-the-universe ballpark.
(Note: If you choose to base64 encode the UUID directly, truncation may yield collisions for sequentially assigned UUIDs.)
The integer value can be similarly encoded with little chance of collision with the UUID based values.
If you can find equivalent usable base64 wncoding implementations on both platforms, I think you will be on your way to a solution.
A pretty simple question: which version of CityHash is hidden behind the HASH function of BigQuery? Is it always the latest (today v1.1), or rather a fixed version?
Now, a little bit of backgroud. I plan on relying heavily upon BigQuery to store large sets of data. From those data, in a first time, I would like to compute some hash value and store it (something like hashed_value = HASH(CONCAT(column_0, column_1))). So far so good.
In a second time, I would like to retrieve rows with a given hash value with a request such as SELECT something FROM [mytable] WHERE hashed_value = HASH(CONCAT('12345', 'foobar')).
My concern here is that it is specified on the CityHash webpage that those functions are not supposed to be backward compatible. So that if BigQuery relies always on the latest version of CityHash, I will not be able to retrieve my data based on the hash value of some computed columns after the next CityHash update. And for my application my large database will essentially become useless.
If so, would it be possible to give access to a fixed (or backward-compatible) hash function, in addition to HASH ? One on the SHA, MD and so on for exemple, or even a fixed version of CityHash.
Thank you.
CityHash used in BigQuery is the version from
http://code.google.com/p/cityhash/
Looking at the history, it seems like the value can change over time. This might be a good question for:
https://groups.google.com/forum/?fromgroups#!forum/cityhash-discuss
BigQuery should support a consistent hash. We do have support for sha1, but right now the result is unusable because of encoding issues. You can, however, do SELECT TO_BASE64(SHA1(CONCAT('12345', 'foobar')))
Note that we will likely change SHA1 in the near future to automatically base64 encode the results. I've filed an internal bug to make this change.
Very specific issue here…and no this isn’t homework (left that far…far behind). Basically I need to compute a checksum for code being written to an EPROM and I’d like to write this function in an Ada program to practice my bit manipulation in the language.
A section of a firmware data file for an EPROM is being changed by me and that change requires a new valid checksum at the end so the resulting system will accept the changed code. This checksum starts out by doing a modulo 256 binary sum of all data it covers and then other higher-level operations are done to get the checksum which I won’t go into here.
So now how do I do binary addition on a mod type?
I assumed if I use the “+” operator on a mod type it would be summed like an integer value operation…a result I don’t want. I’m really stumped on this one. I don’t want to really do a packed array and perform the bit carry if I don’t have to, especially if that’s considered “old hat”. References I’m reading claim you need to use mod types to ensure more portable code when dealing with binary operations. I’d like to try that if it’s possible. I'm trying to target multiple platforms with this program so portability is what I'm looking for.
Can anyone suggest how I might perform binary addition on a mod type?
Any starting places in the language would be of great help.
Just use a modular type, for which the operators do unsigned arithmetic.
type Word is mod 2 ** 16; for Word'Size use 16;
Addendum: For modular types, the predefined logical operators operate on a bit-by-bit basis. Moreover, "the binary adding operators + and – on modular types include a final reduction modulo the modulus if the result is outside the base range of the type." The function Update_Crc is an example.
Addendum: §3.5.4 Integer Types, ¶19 notes that for modular types, the results of the predefined operators are reduced modulo the modulus, including the binary adding operators + and –. Also, the shift functions in §B.2 The Package Interfaces are available for modular types. Taken together, the arithmetical, logical and shift capabilities are sufficient for most bitwise operations.
I would like to create unique string columns (32 characters in length) from combination of columns with different data types in SQL Server 2005.
I have found out the solution elsewhere in StackOverflow
SELECT SUBSTRING(master.dbo.fn_varbintohexstr(HashBytes('MD5', 'HelloWorld')), 3, 32)
The answer thread is here
With HASBYTES you can create SHA1 hashes, that have 20 bytes, and you can create MD5 hashes, 16 bytes. There are various combination algorithms that can produce arbitrary length material by repeated hash operations, like the PRF of TLS (see RFC 2246).
This should be enough to get you started. You need to define what '32 characters' mean, since hash functions produce bytes not characters. Also, you need to internalize that no algorithm can possibly produce hashes of fixed length w/o collisions (guaranteed 'unique'). Although at 32 bytes length (assuming that by 'characters' you mean bytes) the theoretical collision probability of 50% is at 4x1038 hashed elements (see birthday problem), that assumes a perfect distribution for your 32 bytes output hash function, which you're not going to achieve.
SQL databases seem to be the cornerstone of most software. However, it seems optimized for textual data. In fact when doing any queries involving numerical data, integers specifically, it seems inefficient that the numbers are getting converted to text and then back to native formats both ways between the application and the database. This same inefficiency seems to apply to BLOB data as well. My understanding is that even with something like Linq to SQL, this two way conversion is occuring in the background.
Are there general ways to bypass this overhead with SQL? Are there certain database management systems that handle this more efficiently than others (ie, with non-standard extensions/API's)?
Clarification. In the following select statement, the list of numbers after IN could be more easily passed as a raw array of int, but there seems to be no way of achieving that optimization level.
SELECT foo FROM bar WHERE baz IN (23, 34, 45, 9854004, ...)
Don't suppose. Measure.
Format conversion is not likely to be a measurable cost for database work, unless you are misusing the database as an arithmetic engine.
The IO cost for LOBs, especially for CLOBS with character conversion, can become significant; the remedy here, once you know that the simplest thing that might work actually has a noticeable performance impact, is to minimize the number of times you copy the LOB data. Use whatever SQL parameter binding style allows you to transfer the data directly between its point of creation or use, and the database -- often this is binding the LOB to a stream or I/O channel.
But don't do this until you have a way to measure the impact, and have measurements showing that this is your bottleneck.
Numerical data in a database is not stored as text. I guess it depends on the database, but it certainly doesn't have to be and isn't.
BLOBs are stored exactly how you set them -- by definition, the DB has no way to interpret the information -- I guess it could compress if it found that to be useful. BLOBs are not translated into text.
Here's how Oracle stores numbers:
http://download.oracle.com/docs/cd/B28359_01/server.111/b28318/datatype.htm#i16209
Internal Numeric Format
Oracle Database stores numeric data in variable-length format. Each value is stored in scientific notation, with 1 byte used to store the exponent and up to 20 bytes to store the mantissa. The resulting value is limited to 38 digits of precision. Oracle Database does not store leading and trailing zeros. For example, the number 412 is stored in a format similar to 4.12 x 102, with 1 byte used to store the exponent(2) and 2 bytes used to store the three significant digits of the mantissa(4,1,2). Negative numbers include the sign in their length.
MySQL info here:
http://dev.mysql.com/doc/refman/5.0/en/numeric-types.html
Look at the table -- a TINYINT is represented in 1 byte (range -128 - 127), not possible if stored as text.
EDIT: With the clarification -- I would say use the API in your language that looks something like this (pseudocode)
stmt = conn.Prepare("SELECT * FROM TABLE where x in (?, ?, ?)");
stmt.SetInt(0, x);
stmt.SetInt(1, y);
stmt.SetInt(2, z);
I don't believe that the underlying protocols use text for the transport of parameters.