Hive: hash function unique values

Hive: hash function unique values - hive

In context with the below discussion, can someone confirm if the hash function would always result with unique integer value, say for millions of account numbers, and the same number would not repeat again?
Hive hash function resulting in 0,null and 1, why?

Hash function should not be assumed to be a unique id/key generator. There are chances of collision (getting duplicates). Check this link for detailed explanation http://preshing.com/20110504/hash-collision-probabilities/

Related

is redis has possible to process the ids collection in one query

Now I am store a list data with article ids, now I am using lrange to get the article id list. then second I want to query the article from redis(I also cached hot article into redis), is there any way to using in query with article id? or get the article like this:
for(long id in ids){
redis.get(id)
}
this way may access redis for n times, only one time to get all article may be the best way. like this:
redis.get(ids)
access only one time.

You can use MGET to get values for a list of keys it returns null if a key is not found.
For example
MGET article_1 article_2 article_3 article_4 will return 4 articles in the same order
Output: Article1, Article2, Article3, Article4
For example, if article_3 is missing, still it will return 4 items but 3rd one would be null.
Output: Article1, Article2, null, Article4
MGET key [key ...] Available since 1.0.0.
Time complexity: O(N) where N is the number of keys to retrieve.
Returns the values of all specified keys. For every key that does not
hold a string value or does not exist, the special value nil is
returned. Because of this, the operation never fails.
https://redis.io/commands/mget

Is there a Postgres feature or built-in function that limits the display of uuids to only that needed to make them uniquely identifiable?

It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?

I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.

No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.

Checksum() for minus

Why does checksum() function return 0 for the minus?
select checksum('-') /* 0 */
select checksum('---') /* 0 */
select checksum('-+-') /* 67 */
select checksum('+') /* 67 */

From Wikipedia, Hash Functions (http://en.wikipedia.org/wiki/Hash_function):
A hash function is any function that can be used to map digital data of arbitrary size to digital data of fixed size. The values returned by a hash function are called hash values, hash codes, hash sums, or simply hashes.
Under this definition, you could have a function that returns a zero for any input, but it would be the worst possible hash. So, if you are asking why they didn't choose a better algorithm, I'm not sure you are asking the right people. You may want to ask Microsoft. We can infer by a comment made on MSDN that at least someone at Microsoft is aware that CHECKSUM is not the ideal hash function. Here, they recommend HASHBYTES if you need a particularly good hash.
From MSDN, CHECKSUM(Transact-SQL): (https://msdn.microsoft.com/en-us/library/ms189788.aspx)
CHECKSUM satisfies the properties of a hash function: CHECKSUM applied over any two lists of expressions returns the same value if the corresponding elements of the two lists have the same type and are equal when compared using the equals (=) operator. For this definition, null values of a specified type are considered to compare as equal. If one of the values in the expression list changes, the CHECKSUM of the list also generally changes. However, there is a small chance that the CHECKSUM will not change. For this reason, we do not recommend using CHECKSUM to detect whether values have changed, unless your application can tolerate occasionally missing a change. Consider using HashBytes instead. When an MD5 hash algorithm is specified, the probability of HashBytes returning the same result for two different inputs is much lower than that of CHECKSUM.

How to truncate column in order to create indexes?

I have the following table in postgresql:
database=# \d dic
Table "public.dic"
Column | Type | Modifiers
-------------+-------------------------+-----------
id | bigint |
stringvalue | character varying(2712) |
database=# create index idStringvalue on dic(id,stringvalue);
ERROR: index row size 2728 exceeds maximum 2712 for index "idstringvalue"
HINT: Values larger than 1/3 of a buffer page cannot be indexed.
Consider a function index of an MD5 hash of the value, or use full text indexing.
I dont know why is the error is coming when the size of stringvalue is 2712.
I want to truncate all the stringvalue's in dic which cause the above error. However, I am not getting how to do so. Can someone please help me with this?
I am am even fine with deleting the rows which cause this error. Is there some way by which I may do so?

Your column probably contains multibyte data: whereas the varchar(2712) deals with that just fine, it kind of makes sense that the indexing algorithm would be computing the c-string length, since memory considerations is what the latter is worrying about.
Theoretically, you can't go wrong by dividing the limit by four, i.e. use an unbounded varchar for the column, and index the first 600 characters or so, e.g.:
create index on dic((left(stringvalue, 600)));
This does raise the question of whether you actually need to index anything this large, though, since the value of doing so primarily lies in sorting. Postgres (correctly) suggests that you use an md5 of the value (if you're only interested in strict equality) or full text search (if you're interested in fuzzy matching).

ROR - Generate an alpha-numeric string for a DB ID

In our DB, every Person has an ID, which is the DB generated, auto-incremented integer. Now, we want to generate a more user-friendly alpha-numeric ID which can be publicly exposed. Something like the Passport number. We obviously don't want to expose the DB ID to the users. For the purpose of this question, I will call what we need to generate, the UID.
Note: The UID is not meant to replace the DB ID. You can think of the UID as a prettier version of the DB ID, which we can give out to the users.
I was wondering if this UID can be a function of the DB ID. That is, we should be able to re-generate the same UID for a given DB ID.
Obviously, the function will take a 'salt' or key, in addition to the DB ID.
The UID should not be sequential. That is, two neighboring DB IDs should generate visually different-looking UIDs.
It is not strictly required for the UID to be irreversible. That is, it is okay if somebody studies the UID for a few days and is able to reverse-engineer and find the DB ID. I don't think it will do us any harm.
The UID should contain only A-Z (uppercase only) and 0-9. Nothing else. And it should not contain characters which can be confused with other alphabets or digits, like 0 and O, l and 1 and so on. I guess Crockford's Base32 encoding takes care of this.
The UID should be of a fixed length (10 characters), regardless of the size of the DB ID. We could pad the UID with some constant string, to bring it to the required fixed length. The DB ID could grow to any size. So, the algorithm should not have any such input limitations.
I think the way to go about this is:
Step 1: Hashing.
I have read about the following hash functions:
SHA-1
MD5
Jenkin's
The hash returns a long string. I read here about something called XOR folding to bring the string down to a shorter length. But I couldn't find much info about that.
Step 2: Encoding.
I read about the following encoding methods:
Crockford Base 32 Encoding
Z-Base32
Base36
I am guessing that the output of the encoding will be the UID string that I am looking for.
Step 3: Working around collisions.
To work around collisions, I was wondering if I could generate a random key at the time of UID generation and use this random key in the function.
I can store this random key in a column, so that we know what key was used to generate that particular UID.
Before inserting a newly generated UID into the table, I would check for uniqueness and if the check fails, I can generate a new random key and use it to generate a new UID. This step can be repeated till a unique UID is found for a particular DB ID.
I would love to get some expert advice on whether I am going along the correct lines and how I go about actually implementing this.
I am going to be implementing this in a Ruby On Rails app. So, please take that into consideration in your suggestions.
Thanks.
Update
The comments and answer made me re-think and question one of the requirements I had: the need for us to be able to regenerate the UID for a user after assigning it once. I guess I was just trying to be safe, in the case where we lose a user's UID and we will able to get it back if it is a function of an existing property of the user. But we can get around that problem just by using backups, I guess.
So, if I remove that requirement, the UID then essentially becomes a totally random 10 character alphanumeric string. I am adding an answer containing my proposed plan of implementation. If somebody else comes with a better plan, I'll mark that as the answer.

As I mentioned in the update to the question, I think what we are going to do is:
Pre-generate a sufficiently large number of random and unique ten character alphanumeric strings. No hashing or encoding.
Store them in a table in a random order.
When creating a user, pick the first these strings and assign it to the user.
Delete this picked ID from the pool of IDs after assigning it to a user.
When the pool reduces to a low number, replenish the pool with new strings, with uniqueness checks, obviously. This can be done in a Delayed Job, initiated by an observer.
The reason for pre-generating is that we are offloading all the expensive uniqueness checking to a one-time pre-generation operation.
When picking an ID from this pool for a new user, uniqueness is guaranteed. So, the operation of creating user (which is very frequent) becomes fast.

Would db_id.chr work for you? It would take the integers and generate a character string from them. You could then append their initials or last name or whatever to it. Example:
user = {:id => 123456, :f_name => "Scott", :l_name => "Shea"}
(user.id.to_s.split(//).map {|x| (x.to_i + 64).chr}).join.downcase + user.l_name.downcase
#result = "abcdefshea"

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas