I have to perform a full scan and get the PK result as well.
I know that the PK is not stored by default and I am pretty sure I did not store that in my persistence query.
I also know that what is stored is a hash of the key to avoid large keys.
I got that information from: AQL - How to show PK in a SELECT
Now, is there a way to reverse engineer the hash and get the PK?
There's no way to reverse engineer the digest into the original PK, unfortunately. Can you deduce it from the data that's in the bins? The default policy regarding key is to use the digest only, rather than send the PK, because that takes extra space that you may not intend to use.
Related
I have a use case where I have to search by key and in another use case, I have to search by value. Given this scenario, what's the best approach as scanning the entire cache can degrade performance (to filter by value).
Do reverse store i.e store value as key and key as the value in the same logical table?
Use different database and store Value, Key as K | V. I see a few posts that suggest using a different database is a bad idea and deprecated?
Or is there a better alternative/approach?
Do you really need to use Redis? Redis (and generally key-value stores) are not optimized for this kind of task.
If you need to stick with Redis you can create index to implement search by value. It will not be as storage effective and intuitive as e.g. SQL database table though. See documentation here: https://redis.io/topics/indexes
I have a table with an integer identity column as a surrogate key for two other columns (int and datetime). To keep the value of this key in sync across test and production environments, I had the idea to make a trigger that sets the surrogate key to some deterministic value instead of the auto-generated identity (in this case a hash of the natural key). The tradeoff, as far as I can tell, is that I introduce the risk of collisions (which can be offset by changing the surrogate column to bigint).
CREATE TRIGGER dbo.TRG_TestTable_SetID ON dbo.TestTable
INSTEAD OF INSERT
AS
BEGIN
insert into dbo.TestTable (ID, IntKey, DateKey, MoreData)
select convert(bigint, hashbytes('md5', convert(binary(4), IntKey) + convert(binary(8), DateKey))),
IntKey, DateKey, MoreData
from inserted
END
Is this a good solution from a design standpoint? Will it still perform better than using the natural composite key as the primary key?
Edit: The int in the natural key is a foreign key to another table, where it is the surrogate key for a guid and a varchar. So the "natural key" alternative on this table would be the rather ugly composite of guid, varchar, and datetime.
I have used similar techniques before for similar reasons and with good success. To get the deterministic qualities that you want, you might try coercing the the composite natural key column values to strings, string-concatenating them together, and then generating an MD5 hash from that to use as your deterministic primary key.
Some considerations:
Case-sensitivity. Unless some of your business keys are meant by design to be case-sensitive, it is a good idea to establish a convention in your system to downcase or upcase letters first as 'a' is not the same as 'A' to a hash function. This can help avoid issues if you are creating a key from possibly manually user keyed data. For example, if a user keyed in item number 'itm009876' instead of 'ITM009876', and your various source systems aren't robust enough to conform the value before storing them.
String coercion: Make sure that you coerce values into strings in a way that makes sense and is very specific. For example, using ISO dates and date times plus time zone, or converting dates and date times to Unix timestamp integers before coercing to string
String delimiter. Use a good string separater between the strings before concatenation, such as ';'. (E.g., string concatenation of A+CB should not be the same as AB+C)
Store hash as binary: If possible store the MD5 hash as a 16-byte binary value on the table, and use a HEX() function to display it in a human readable format. Storing an MD5 hash as binary uses exactly half of the amount of space it would take to store a 32 byte hexadecimal string, which has advantages for performance of lookups and joins because it is both shorter and completely avoids any possible cycles wasted on special string comparison logic.
Pros
May avoid accidental duplication of row data at times
May avoid unnecessary round trips to single authority that must generate or retrieve serial or UUID surrogate keys.
Single column keys are easier for end users work with.
Single column keys are easier for downstream developers writing SQL, generating urls, etc to work with.
MD5 is old and well established so it's very well supported as an SQL function by most DBMS, so you can use compute them there too as needed without and third party extensions.
With MD5, collisions are extremely rare. As in more likely that your data center gets destroyed by a meteor than to experience a collision, even with hundreds of billions of rows and a single table. There is quite a bit of robust discussion about this online if you Google for one popular methodology that employs hash keys: 'data vault hash keys'.
Cons
Collisions are of course still theoretically possible with MD5. Many organizations are still very hesitant about this. So if you must have more bytes on the hash space, and you can live with the potential performance hit during joins and index updates, you can always choose a longer SHA hash.
Generation is complicated. You must choose and document the algorithm for generating the composite key hashes very well and communicate well with other developers in the organization. Just make sure that everyone is doing it the same way.
because of the non-sequential nature of hashes, they can be inefficient to query in some scenarios, such as in clustered index tables. Be careful with this as some DBMS's use clustered index as the default - or may not even have any other option - such as MYSQL's InnoDB. Heap tables are generally better, such as is supported/default in PostgreSQL and Microsoft SQL Server.
(Sorry for any typos and grammar mistakes. I am writing this on my mobile phone. I'll try to come back later and clean it up.)
I am debating which datatype to use when storing a SHA256 hash in SQL Server. Should it be CHAR(64) or BINARY(32) ... The column will be part of a unique clustered index. I know that I'm probably splitting hairs at this point, however I want to do this right the first time and I know that at times primitive data types are faster and other times the newer more exotic types perform better. ( yes I know char(64) isn't all that new, but it's newer than byte storage )
I've looked around and can't find anything about the performance of one vs. the other in terms of search, etc.
you do know that by using CHAR(64), each row will occupy 64 bits even if your key is "A"?
I am not going to discuss the fact that you are using a string as a clustered index, I just assume you have a good reason for that, but using CHAR instead of VARCHAR? Are you planing to update the value? Because that would be the only reason I see to use char instead of varchar
I want to create a large table (about 45 billion rows) that is always accessed by a unique key.
Outside of the DB, the best structure to hold this is a Dictionary or a HashSet, but of course due to the size of data, it's not possible to do this outside of the database.
Does SQL Server provide a structure that's optimized for key-value access? I understand that a clustered key is very fast, but still it's an index and therefore there will be some additional disk reads associated with traversing index pages. What I would like to get from SQL Server is a "native" structure that stores data as key-value pairs and then makes it possible to access values based on keys.
In other words, my question is how to store in SQL Server 45 billion rows and efficiently access them WITHOUT having an index, clustered or non-clustered, because reading the index non-leaf pages may result in substantial IO, and since each value can be accessed by a unique key, it should be possible to have a structure where the hash of a key resolves into a physical location of the value. To get 1 value, we would need to do 1 read (unless there are hash collisions).
(an equivalent in Oracle is Hash Cluster)
Thanks for your help.
No such thing in SQL server. Your only option is an index. If you're going to be requesting all columns for a given key, you should use a clustered index. If you're only going to be requesting a subset, you should use a non-clustered index including only the columns you want like this:
create index IX_MyBigTable on MyBigTable(keyColumn) include (col1, col2, col3youneed);
This will be pretty efficient.
According to my benchmarks, the best approach is to create a hash column for the key. Details.
here is my question: why should I use autoincremental fields as primary keys on my tables instead of something like UUID values?
What are the main advantages of one over another? What are the problems and strengths of them?
Simple numbers consume less space. UUID values take consume 128 bits each. Working with numbers is also simpler. For most practical purposes 32-bit or 64-bit integers can server well as the primary key. 264 is a very large number.
Consuming less space doesn't just save hard disk space In means faster backups, better performance in joins, and having more real stuff cached in the database server memory.
You don't have to use auto-incrementing primary keys, but I do. Here's why.
First, if you're using int's, they're smaller than UUIDs.
Second, it's much easier to query using ints than UUIDs, especially if your primary keys turn up as foreign keys in other tables.
Also, consider the code you'll write in any data access layer. A lot of my constructors take a single id as an int. It's clean, and in a type-safe language like C# - any problems are caught at compile time.
Drawbacks of autoincrementers? Potentially running out of space. I have a table which is at 200M on it's id field at the moment. It'll bust the 2 billion limit in a year if I leave as is.
You could also argue that an autoincrementing id has no intrinsic meaning, but then the same is true of a UUID.
I guess by UUID you mean like a GUID? GUIDs are better when you will later have to merge tables. For example, if you have local databases spread around the world, they can each generate unique GUIDs for row identifiers. Later the data can be combined into a single database and the id's shouldn't conflict. With an autoincrement in this case you would have to have a composite key where the other half of the key identifies the originating location, or you would have to modify the IDs as you imported data into the master database.