In my application, I need to compare vcards and figure out if there are changes between them.
I don't want to keep in memory a whole vcard, since this could be quite massive. What I would like to do is to keep a hashed value of that vcard, but the hash value has to be very precise as to not repeat itself in cases of very close/similar vcards (e.g. where differences are by a character).
I have tried Objective-C's hash method for strings, but that doesn't really work well, as it will create duplicates in case of similar vcards.
I am now thinking of using SHA256 to encrypt the vcards and then compare the SHA's (similar to how one would do password comparison).
Would that be a good idea? Any other ideas of how I can save a smaller version of a big string and then be able to compare it with another one for changes?
Related
The below image is from : 6.006-Introduction to algorithms,
While doing the course 6.006-Introduction to algorithms, provided by MIT OCW, I came across the Rabin-Karp algorithm.
Can anyone help me understand as to why the first rs()==rt() required? If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead? Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
In the image, rs() is for hash value, and rs.skip[arg] is to remove the first character of that string assuming it is ‘arg’
Can anyone help me understand as to why the first rs()==rt() required?
I assume you mean the one right before the range-loop. If the strings have the same length, then the range-loop will not run (empty range). The check is necessary to cover that case.
If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead?
Not sure what you mean here. The posted code leaves blank (with ...) after matching hashes are found. Let's not forget that at that point, we must compare strings to confirm we really found a match. And, it's up to the (not shown) implementation to continue searching until the end or not.
Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
I really don't get this part. Note the first two loops are to populate the rolling hashes for the input strings. Then comes a check if we have a match at this point, and then the loop updating the rolling hashes pair wise, and then comparing them. The entire t is checked, from start to end.
It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.
A pretty simple question: which version of CityHash is hidden behind the HASH function of BigQuery? Is it always the latest (today v1.1), or rather a fixed version?
Now, a little bit of backgroud. I plan on relying heavily upon BigQuery to store large sets of data. From those data, in a first time, I would like to compute some hash value and store it (something like hashed_value = HASH(CONCAT(column_0, column_1))). So far so good.
In a second time, I would like to retrieve rows with a given hash value with a request such as SELECT something FROM [mytable] WHERE hashed_value = HASH(CONCAT('12345', 'foobar')).
My concern here is that it is specified on the CityHash webpage that those functions are not supposed to be backward compatible. So that if BigQuery relies always on the latest version of CityHash, I will not be able to retrieve my data based on the hash value of some computed columns after the next CityHash update. And for my application my large database will essentially become useless.
If so, would it be possible to give access to a fixed (or backward-compatible) hash function, in addition to HASH ? One on the SHA, MD and so on for exemple, or even a fixed version of CityHash.
Thank you.
CityHash used in BigQuery is the version from
http://code.google.com/p/cityhash/
Looking at the history, it seems like the value can change over time. This might be a good question for:
https://groups.google.com/forum/?fromgroups#!forum/cityhash-discuss
BigQuery should support a consistent hash. We do have support for sha1, but right now the result is unusable because of encoding issues. You can, however, do SELECT TO_BASE64(SHA1(CONCAT('12345', 'foobar')))
Note that we will likely change SHA1 in the near future to automatically base64 encode the results. I've filed an internal bug to make this change.
In our DB, every Person has an ID, which is the DB generated, auto-incremented integer. Now, we want to generate a more user-friendly alpha-numeric ID which can be publicly exposed. Something like the Passport number. We obviously don't want to expose the DB ID to the users. For the purpose of this question, I will call what we need to generate, the UID.
Note: The UID is not meant to replace the DB ID. You can think of the UID as a prettier version of the DB ID, which we can give out to the users.
I was wondering if this UID can be a function of the DB ID. That is, we should be able to re-generate the same UID for a given DB ID.
Obviously, the function will take a 'salt' or key, in addition to the DB ID.
The UID should not be sequential. That is, two neighboring DB IDs should generate visually different-looking UIDs.
It is not strictly required for the UID to be irreversible. That is, it is okay if somebody studies the UID for a few days and is able to reverse-engineer and find the DB ID. I don't think it will do us any harm.
The UID should contain only A-Z (uppercase only) and 0-9. Nothing else. And it should not contain characters which can be confused with other alphabets or digits, like 0 and O, l and 1 and so on. I guess Crockford's Base32 encoding takes care of this.
The UID should be of a fixed length (10 characters), regardless of the size of the DB ID. We could pad the UID with some constant string, to bring it to the required fixed length. The DB ID could grow to any size. So, the algorithm should not have any such input limitations.
I think the way to go about this is:
Step 1: Hashing.
I have read about the following hash functions:
SHA-1
MD5
Jenkin's
The hash returns a long string. I read here about something called XOR folding to bring the string down to a shorter length. But I couldn't find much info about that.
Step 2: Encoding.
I read about the following encoding methods:
Crockford Base 32 Encoding
Z-Base32
Base36
I am guessing that the output of the encoding will be the UID string that I am looking for.
Step 3: Working around collisions.
To work around collisions, I was wondering if I could generate a random key at the time of UID generation and use this random key in the function.
I can store this random key in a column, so that we know what key was used to generate that particular UID.
Before inserting a newly generated UID into the table, I would check for uniqueness and if the check fails, I can generate a new random key and use it to generate a new UID. This step can be repeated till a unique UID is found for a particular DB ID.
I would love to get some expert advice on whether I am going along the correct lines and how I go about actually implementing this.
I am going to be implementing this in a Ruby On Rails app. So, please take that into consideration in your suggestions.
Thanks.
Update
The comments and answer made me re-think and question one of the requirements I had: the need for us to be able to regenerate the UID for a user after assigning it once. I guess I was just trying to be safe, in the case where we lose a user's UID and we will able to get it back if it is a function of an existing property of the user. But we can get around that problem just by using backups, I guess.
So, if I remove that requirement, the UID then essentially becomes a totally random 10 character alphanumeric string. I am adding an answer containing my proposed plan of implementation. If somebody else comes with a better plan, I'll mark that as the answer.
As I mentioned in the update to the question, I think what we are going to do is:
Pre-generate a sufficiently large number of random and unique ten character alphanumeric strings. No hashing or encoding.
Store them in a table in a random order.
When creating a user, pick the first these strings and assign it to the user.
Delete this picked ID from the pool of IDs after assigning it to a user.
When the pool reduces to a low number, replenish the pool with new strings, with uniqueness checks, obviously. This can be done in a Delayed Job, initiated by an observer.
The reason for pre-generating is that we are offloading all the expensive uniqueness checking to a one-time pre-generation operation.
When picking an ID from this pool for a new user, uniqueness is guaranteed. So, the operation of creating user (which is very frequent) becomes fast.
Would db_id.chr work for you? It would take the integers and generate a character string from them. You could then append their initials or last name or whatever to it. Example:
user = {:id => 123456, :f_name => "Scott", :l_name => "Shea"}
(user.id.to_s.split(//).map {|x| (x.to_i + 64).chr}).join.downcase + user.l_name.downcase
#result = "abcdefshea"
For example commit list on GitHub shows only first 10, or this line from tornadoweb which uses only 5
return static_url_prefix + path + "?v=" + hashes[abs_path][:5]
Are only the first 5 chars enough to make sure that 2 different hashes for 2 different files won't collide?
LE: The example above from tornadoweb uses md5 hash for generating a query sting for static file caching.
In general, No.
In fact, even if a full MD5 hash were given, it wouldn't be enough to prevent malicious users from generating collisions---MD5 is broken. Even with a better hash function, five characters is not enough.
But sometimes you can get away with it.
I'm not sure exactly what the context of the specific example you provided is. However, to answer your more general question, if there aren't bad guys actively trying to cause collisions, than using part of the hash is probably okay. In particular, given 5 hex characters (20 bits), you won't expect collisions before around 2^(20/2) = 2^10 ~ one thousand values are hashed. This is a consequence of the the Birthday paradox.
The previous paragraph assumes the hash function is essentially random. This is not an assumption anyone trying to make a cryptographically secure system should make. But as long as no one is intentionally trying to create collisions, it's a reasonable heuristic.