I'm using hash on an NSString to get an integer to uniquely represent a URL and then store it in Core Data to unique the object.
Is that enough to make sure it'll be unique? The URL string is usually 50 to 80 chars.
If it's not I'll gladly accept any suggestion to make it better!
No, the hash is not enough to unique an URL. The purpose of hash is to distribute objects like for example, computing the hash table index.
With hash code you can do a fast comparison and if two objects have two different hashes they're different, if they have the same hash you gotta use compare.
Related
I have a table with an integer identity column as a surrogate key for two other columns (int and datetime). To keep the value of this key in sync across test and production environments, I had the idea to make a trigger that sets the surrogate key to some deterministic value instead of the auto-generated identity (in this case a hash of the natural key). The tradeoff, as far as I can tell, is that I introduce the risk of collisions (which can be offset by changing the surrogate column to bigint).
CREATE TRIGGER dbo.TRG_TestTable_SetID ON dbo.TestTable
INSTEAD OF INSERT
AS
BEGIN
insert into dbo.TestTable (ID, IntKey, DateKey, MoreData)
select convert(bigint, hashbytes('md5', convert(binary(4), IntKey) + convert(binary(8), DateKey))),
IntKey, DateKey, MoreData
from inserted
END
Is this a good solution from a design standpoint? Will it still perform better than using the natural composite key as the primary key?
Edit: The int in the natural key is a foreign key to another table, where it is the surrogate key for a guid and a varchar. So the "natural key" alternative on this table would be the rather ugly composite of guid, varchar, and datetime.
I have used similar techniques before for similar reasons and with good success. To get the deterministic qualities that you want, you might try coercing the the composite natural key column values to strings, string-concatenating them together, and then generating an MD5 hash from that to use as your deterministic primary key.
Some considerations:
Case-sensitivity. Unless some of your business keys are meant by design to be case-sensitive, it is a good idea to establish a convention in your system to downcase or upcase letters first as 'a' is not the same as 'A' to a hash function. This can help avoid issues if you are creating a key from possibly manually user keyed data. For example, if a user keyed in item number 'itm009876' instead of 'ITM009876', and your various source systems aren't robust enough to conform the value before storing them.
String coercion: Make sure that you coerce values into strings in a way that makes sense and is very specific. For example, using ISO dates and date times plus time zone, or converting dates and date times to Unix timestamp integers before coercing to string
String delimiter. Use a good string separater between the strings before concatenation, such as ';'. (E.g., string concatenation of A+CB should not be the same as AB+C)
Store hash as binary: If possible store the MD5 hash as a 16-byte binary value on the table, and use a HEX() function to display it in a human readable format. Storing an MD5 hash as binary uses exactly half of the amount of space it would take to store a 32 byte hexadecimal string, which has advantages for performance of lookups and joins because it is both shorter and completely avoids any possible cycles wasted on special string comparison logic.
Pros
May avoid accidental duplication of row data at times
May avoid unnecessary round trips to single authority that must generate or retrieve serial or UUID surrogate keys.
Single column keys are easier for end users work with.
Single column keys are easier for downstream developers writing SQL, generating urls, etc to work with.
MD5 is old and well established so it's very well supported as an SQL function by most DBMS, so you can use compute them there too as needed without and third party extensions.
With MD5, collisions are extremely rare. As in more likely that your data center gets destroyed by a meteor than to experience a collision, even with hundreds of billions of rows and a single table. There is quite a bit of robust discussion about this online if you Google for one popular methodology that employs hash keys: 'data vault hash keys'.
Cons
Collisions are of course still theoretically possible with MD5. Many organizations are still very hesitant about this. So if you must have more bytes on the hash space, and you can live with the potential performance hit during joins and index updates, you can always choose a longer SHA hash.
Generation is complicated. You must choose and document the algorithm for generating the composite key hashes very well and communicate well with other developers in the organization. Just make sure that everyone is doing it the same way.
because of the non-sequential nature of hashes, they can be inefficient to query in some scenarios, such as in clustered index tables. Be careful with this as some DBMS's use clustered index as the default - or may not even have any other option - such as MYSQL's InnoDB. Heap tables are generally better, such as is supported/default in PostgreSQL and Microsoft SQL Server.
(Sorry for any typos and grammar mistakes. I am writing this on my mobile phone. I'll try to come back later and clean it up.)
I have to perform a full scan and get the PK result as well.
I know that the PK is not stored by default and I am pretty sure I did not store that in my persistence query.
I also know that what is stored is a hash of the key to avoid large keys.
I got that information from: AQL - How to show PK in a SELECT
Now, is there a way to reverse engineer the hash and get the PK?
There's no way to reverse engineer the digest into the original PK, unfortunately. Can you deduce it from the data that's in the bins? The default policy regarding key is to use the digest only, rather than send the PK, because that takes extra space that you may not intend to use.
I have to save the combination of lastname, firstname and birth-date of a person as a hash. This hash is later used to search for the same person with the exactly same properties.
My question is, if SHA-1 is a meaningfull algorithm for this.
As far as I understand SHA-1, there is virtually no possibility that two different persons (with different attributes) will ever get the same hash-value. Is this right?
If you want to search for a person knowing only those credentials, you could store the SHA-1 in the database(or MD5 for speed, unless you have like a quadrillion people to sample).
The hash will be worthless, as it stores no information about the person, but it can work for searching a database. You just want to make sure that the three pieces of information match, so it would be safe to just concatenate them:
user.hash = SHA1(user.firstName + user.DOB + user.lastName)
And when you query, you could check if the two match:
hash = SHA1(query.firstName + query.DOB + query.lastName)
for user in database:
if user.hash == hash:
return user
I put query.DOB in the middle because the first and last name might collide, like if JohnDoe Bob was born on the same day as John DoeBob. I'm not aware of numeric names, so I think this will stop collisions like those ;)
But if this is a big database, I'd try MD5. It's faster, but there is a chance of a collision (in your case, I can guarantee that one won't occur). The chance of a collision, however, is really small.
To put that into perspective, a collision is a 1 / 2^128 occurrence, which is:
1
---------------------------------------------------
340,282,366,920,938,463,463,374,607,431,768,211,456
And that's a little smaller than:
0.0000000000000000000000000000000000000293873 %
I'm pretty sure you're not going to get a collision ;)
Hash collisions are inevitable. However small can be the chance of the collision, you shouldn't really rely only on hash if you really want 100% identification.
If you use hashing to speed up database search, there is no need to use SHA256. Use whatever hash function your system has with the smallest size (MD5() for MySQL or you might even try CRC32, if your database is not-so-big). Just when you query table, you need to provide all conditions you are searching by:
SELECT * from user WHERE hash="AABBCCDD" AND firstname="Pavel" AND surname="Sokolov"
Databases maintain a value, that is called index cardinality. It's a measure of uniqueness of the data on the given index. So, you can index fields you want together with hash field and database will choose the most selective index for the query himself. Adding additional conditions doesn't affect performance negatively because most database can use only one index when selecting data from the table and they will select the one with the most cardinality value.
The database will need to first select all rows matches the index and then scan through them to discard rows that doesn't match other conditions.
If you cannot use the method I described, well, I think even MD5 collision probability is very low to occur on database of people names.
P.S. I hope you know, that you know that "the combination of lastname, firstname and birth-date of a person" is not enough to 100% identify a human? And sooner this combination will match than some hashes collide.
If you are concerned with collisions there is a good discussion here:
Understanding sha-1 collision weakness
If you have security concerns, I would consider SHA-256 instead.
We have a URL with cool names of things, for example:
domain.com/name-of-a-news-with-cool-keywords-4673612453
My question is about the last hash, the hash you usually use to get the ID of the news from your database.
Our application is already done and built in a way that the new articles ID are not incremental in database, they are "random" INT (this is done because we use a encoder/decoder function to generate alphanumeric keys as Youtube).
A friend of mine told me to change this to shorter numbers (that would implied to change a lot of things in the application and internal logic).
The SEO question is: is so important to have short numbers as a hash ?
I mean ... is it really a SEP improvement to have
domain.com/name-of-a-news-with-cool-keywords-314
instead of
domain.com/name-of-a-news-with-cool-keywords-4673612453
?
How many articles do you have?
To uniquely represent them all like this, you're saying that you have over 1,000,000,000 articles all with exactly the same keywords.
Having these numbers might not affect SEO, but logically, I'd shorten them too. It's like generating a 100 character hash when building a database for 1,000 items.
In short, don't overkill. Keep it short.
I have a big MySQL InnoDB table (about 1 milion records, increase by 300K weekly) let's say with blog posts. This table has an url field with index.
By adding new records in it I'm checking for existent records with the same url. Here is how query looks like:
SELECT COUNT(*) FROM `tablename` WHERE url='http://www.google.com/';
Currently system produces about 10-20 queries per second and this amount will be increased. I'm thinking about improving performance by adding additional field which is MD5 hash of the URL.
SELECT COUNT(*) FROM `tablename` WHERE md5url=MD5('http://www.google.com/');
So it will be shorter and with constant length which is better for index compared to URL field. What do you guys think about it. Does it make sense?
Another suggestion by friend of mine is to use CRC32 instead of MD5, but I'm not sure about how unique will be result of CRC32. Let me know what you think about CRC32 for this role.
UPDATE: the URL column is unique for each row.
Create a non-clustered index on URL. That will let your SQL engine do all the optimization internally and will produce the best results!
If you create an index on a VARCHAR column, SQL will create a hash internally anyway and using the index can give better performance by an order of magnitude or even more!
Also, something to keep in mind if you're only checking whether a URL exists, is that certain SQL products will produce faster results with a query like this:
IF NOT EXISTS(SELECT * FROM `tablename` WHERE url='')
-- return TRUE or do your logic here
I think CRC32 would actually be better for this role, as it's shorter and it saves more SQL space. If you're receiving that many queries, the object is to save space anyways? If it does the job, I'd say go for it.
Although, since it's only 32bit, and shorter in length, it's not as unique as MD5 of course. You will have to decide if you want unique, or if you want to save space.
I still think I'd choose CRC32.
My system generates roughly 4k queries per second, and I use CRC32 for links.
Using the build-in indexing is always best, or you should volunteer to add to their codebase anyways ;)
When using a hash, create a 2 column index on the hash and the URL. If you only choose the first couple of letters on the index, it still does a complete match, but it doesn't index more then the first few letters.
Something like this:
INDEX(CRC32_col, URL_col(5))
Either hash would work in that case. It's a trade-off of space vs speed.
Also, this query will be much faster:
SELECT * FROM table WHERE hash_col = 'hashvalue' AND url_col = 'urlvalue' LIMIT 1;
This will find the first value and stop. Much faster then finding many matches for the COUNT(*) calculation.
Ultimately the best choice is to make test cases for each variant and benchmark.
Don't most SQL engines use hash functions internally for text column searches?
If you're going to use hashed keys and you're concerned about collisions, use two different hash functions and concatenate the two hashed values.
But even if you do this, you should always store the original key value in the row as well.
If the tendency is for the result of that select statement to be rather high, an alternative solution would be to have a separate table which keeps track of the counts. Obviously there are high penalties for using that technique, but if this specific query is a common one and is too slow, this might be a solution.
There are obvious trade-offs involved in this solution, and you probably do not want to update this 2nd table after every individual insertion of a new record inserted, as that would slow down your insertions.
If you choose a hash you need to take into account collissions. Even with a large hash like MD5 you have to account the meet-in-the-middle probability, better known as birthday attack. For a smaller hash like CRC-32 the collision probability will be quite large and your WHERE has to specify hash and the full URL.
But I gotta ask, is this the best way to spend your efforts? Is there nothing else left to optimize? You may be well doing premature optimizations unless you have clear metrics and measurements indicating that this problem is the bottleneck of the system. After all, this kind of seek is what databases are optimized for (all of them), and by doing something like a hash you may actually decrease performance (eg. your index may become fragmented becuase hashes have a different distribution than URLs).