The below image is from : 6.006-Introduction to algorithms,
While doing the course 6.006-Introduction to algorithms, provided by MIT OCW, I came across the Rabin-Karp algorithm.
Can anyone help me understand as to why the first rs()==rt() required? If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead? Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
In the image, rs() is for hash value, and rs.skip[arg] is to remove the first character of that string assuming it is ‘arg’
Can anyone help me understand as to why the first rs()==rt() required?
I assume you mean the one right before the range-loop. If the strings have the same length, then the range-loop will not run (empty range). The check is necessary to cover that case.
If it’s used, then shouldn’t we also check first by brute force whether the strings are equal and then move ahead?
Not sure what you mean here. The posted code leaves blank (with ...) after matching hashes are found. Let's not forget that at that point, we must compare strings to confirm we really found a match. And, it's up to the (not shown) implementation to continue searching until the end or not.
Why is it that we are not considering equality of strings when hashing is done from t[0] and then trying to find other string matches?
I really don't get this part. Note the first two loops are to populate the rolling hashes for the input strings. Then comes a check if we have a match at this point, and then the loop updating the rolling hashes pair wise, and then comparing them. The entire t is checked, from start to end.
Related
Could you explain how to convert from lz77 to huffman on the example in the below picture?
Easy:
In the first step your output is essentially 3 numbers:
prev index
number of characters to repeat
next character (be it ascii or unicode)
The algorithm demands that you specify a sliding window up front. That means you know how big (1) and (2) can be at most.
In other words, you know how many bits (1) and (2) will take up.
Since (3) is essentially also a character from a fixed length alphabet, you also know the bit-length of (3)
That means it's safe to simply concatenate them.
So, the output of the first algorithm can be thought of as outputting a bit-sequence, where every item in the sequence has a fixed length.
That's ideal for applying huffman.
Of course the specifics are not mentioned, and you can choose from a lot of options.
normalized huffman table
1 on left-branch vs 0 on left-branch
priorities when merging items of similar count
etc
So I can not readily explain the exact output values you are showing.
But I hope I can at least explain how to get from A to B.
You can't. The coding shown is, well, figurative. Not literal. The symbols A, B, and C are all coded to the single bit 0. Obviously that's not going to be very helpful on the decoding end.
It would have to return the portion necessary to uniquely identify the row even if a select statement didn't return all rows, of course, to be of any use. And I'm not sure how it would work if the uuid column were not part of a pk/index and was repeated.
Does this exist?
I think you would have to decide what constitutes uniquely identifiable by assuming that a number of places from the right make it uniquely identifiable. I think this is folly but the way you would do that is something like this:
SELECT RIGHT(uuid_column_name::text, 7) as your_truncated_uuid FROM table_with_uuid_column;
That takes the 7 places from the right of the text value of the uuid column.
No, there is not. A UUID is a hex representation of a 120 bit random number, at least the v4 variant. It's not even guaranteed to be unique though it likely is.
You have a few options to implement this:
shave off characters and hope you don't introduce a collision. For instance, if you make d8366842-8c1d-4a31-a4c0-f1765b8ab108 d8366842, you have 16**8 possible combinations, or 4,294,967,296. how likely is your dataset to have a collision with 4.2 billion (2**32) possibilities? Perhaps you can add 8c1d back in to make it 16**12 or 28,147,497,6710,656 possibilities.
process and hash each row looking for collisions and recursively increase the frame of characters until no collisions are found, or hash every possible permutation.
That all said, another idea is to use ints and not uuids and then to use http://hashids.org/ which has a plugin for PostgreSQL. This is the method YouTube uses afaik.
No keyboard patterns. i.e. keys that are adjacent vertically or horizontally on a keyboard. For example, 'ZXCVBN123' should be rejected.
No commonly used words and no words written backwards or disguised with special characters. For example 'Universe1' and 'Un1ver$e' should be rejected.
Well, first you need to define exactly what you want. What are keyboard patterns? Is 'jk' a keyboard pattern, or just 'jkl'? What's the shortest pattern there is? Is 'gy' a pattern? First you need to define what a pattern really.
Then you should make a list of all the available patterns (There aren't all that many. You have 36 starting points and 4 directions to go from each starting point). When you get a password, try to locate each of the patterns in it. Note that if you decide the shortest pattern is 3 letters long, you don't need to search for 4-letter patterns, all 4-letter patterns already contain 3-letter patterns.
As for words, that's easier, but first you need to make a list of all disallowed transformations ($->S, 1->i, etc...). Once you get a word, apply all the transformations and get yourself a 'normalized' word. Compare the normalized password against a dictionary of all legal words twice - the second time reverse the password.
You will probably need to do something a little more complicated than that, because you need to ignore numbers at the end of the word - sometimes. 1ncredible can be a substitute for 'incredible', although ncredible is not a word.
If you inspect the code of http://howsecureismypassword.net you can see that the password is compared to a large array of usual passwords.
On the page threre is a reference to the page http://xato.net/passwords/more-top-worst-passwords/ which lists the top 10.000 most common passwords.
One approach would be to download that list and check the users passwords against it or at least some top 100 of them.
In my application, I need to compare vcards and figure out if there are changes between them.
I don't want to keep in memory a whole vcard, since this could be quite massive. What I would like to do is to keep a hashed value of that vcard, but the hash value has to be very precise as to not repeat itself in cases of very close/similar vcards (e.g. where differences are by a character).
I have tried Objective-C's hash method for strings, but that doesn't really work well, as it will create duplicates in case of similar vcards.
I am now thinking of using SHA256 to encrypt the vcards and then compare the SHA's (similar to how one would do password comparison).
Would that be a good idea? Any other ideas of how I can save a smaller version of a big string and then be able to compare it with another one for changes?
For example commit list on GitHub shows only first 10, or this line from tornadoweb which uses only 5
return static_url_prefix + path + "?v=" + hashes[abs_path][:5]
Are only the first 5 chars enough to make sure that 2 different hashes for 2 different files won't collide?
LE: The example above from tornadoweb uses md5 hash for generating a query sting for static file caching.
In general, No.
In fact, even if a full MD5 hash were given, it wouldn't be enough to prevent malicious users from generating collisions---MD5 is broken. Even with a better hash function, five characters is not enough.
But sometimes you can get away with it.
I'm not sure exactly what the context of the specific example you provided is. However, to answer your more general question, if there aren't bad guys actively trying to cause collisions, than using part of the hash is probably okay. In particular, given 5 hex characters (20 bits), you won't expect collisions before around 2^(20/2) = 2^10 ~ one thousand values are hashed. This is a consequence of the the Birthday paradox.
The previous paragraph assumes the hash function is essentially random. This is not an assumption anyone trying to make a cryptographically secure system should make. But as long as no one is intentionally trying to create collisions, it's a reasonable heuristic.