Hash lookup table primary key - sql

I have to populate a database with a set of $string,md5($string) CSV files, essentially a hash lookup table.
My question is:
should I use the string as Primary key? The hash? Add an extra ID column?
I think the hash would be good since thats what I'll be asking the database, but hashes can collide, Strings should be unique anyways (to save space) but I wanted a second opinion on it.
I'm asking with performance in mind considering it will be populated with at least 35GB of data. So really any suggestions appreciated

If the string is going to be used for foreign key references, then I would not (necessarily) recommend hashing. You can:
Create a serial (auto-incremented) id column as the primary key.
Create a unique index on name.
This should facilitate lookups in the table as well as verifying that name is unique. It is better to use fixed-length numbers for foreign key references than variable length strings.
If you use a hash value and really do not want duplicates, then you would need some mechanism for distinguishing between different strings with the same hash value. A natural choice would be some sort of incremental counter -- but that would leave you pretty close to the solution with just the counter and no hash. I don't, per se, see the advantage of storing such a hash value in the table.

I ended up using a SERIAL id field, so Icould count how many entrys I had.
The initial problem started as I thought you coul only index columns with PRIMARY KEY.
So problem solved now, I just indexed properly and performance is great!

Related

Database: Should ids be sequential?

I want to use an id as a primary key for my table. In each record, I am also storing an id from an other source, but these ids are in no way sequential.
Should I add an (auto-incremented) column with a "new" id? It is very important that queries by the id are as fast as possible.
Some info:
The content of my table is only stored "temporary", The table gets often cleared (TRUNCATE) and than filled with new content.
It's a sql-server 2008
After writing content to the table, I create an index for the id column
Thanks!
As long as you are sure the supplied id's are unique, there's no need to create another (surrogate) id to use as primary key.
Under most circumstances, an index on the existing id should be sufficient. You can make it slightly faster by declaring it as a primary key.
From what you describe a new id is not necessary for performance. If you do add one, the table will be slightly larger, which has a (very small) negative effect on performance.
If the existing id is not numeric (or not an integer), then there might be a small gain from using a more efficient type for the index. But, your best bet is to make the existing id a primary key (although this might affect load performance).
Note: I usually prefer synthetic primary keys, so this answer is very specific to your question.
If you are after speed I would join the two IDs together (either from the application or stored proc) and then put them in one column

Surrogate key from all column hash

I would like to create a surrogate key for a hive table, but one that could be replicated every time the data was put in the table. Other tables would reference this table through the surrogate key, and the table could be regenerated to add more rows, and that association wouldn't be broken. My thought is to basically have a composite key of all columns in the table.
Is it reasonable to concatenate all of my columns and take the md5 hash of that string to use as an easy look-up to that row?
The problems that I see with this solution are:
If the data changes in the rows, the association will still be broken
There is no real guarantee that the hash values are unique (though with my numbers, collisions are very unlikely)
notes on the data:
The data is partitioned by day, and there are around 100k rows for
each day.
There are cases that two rows have the exact same data and
it's fine if they end up with the same key.
You have answered your own question:
There is no real guarantee that the hash values are unique (though
with my numbers, collisions are very unlikely)
Keys need to be unique, that's their purpose. If you give me a records key (be it surrogate or natural) I can find that record. Hashes are not going to be unique.
You need to go back and ask yourself WHY you want this surrogate key. If its just for a unique identifier then use your DB's unique identifier|sequence type and be done with it.
If there is a business requirement (The need to replicate the SK <- why?) then go back to that reason and try and come up with a more direct|simple solution for it.
(We tried hashes for type2 change detection - it did not work and we went back to column by column comparisons)
This concerns me:
There are cases that two rows have the exact same data and it's fine if they end up with the same key
If you have 2 records in your database that are exactly the same then you are missing data: even a sequence or timestamp, something that can be used to differentiate your records. If you don't have a natural key, you are probably missing something.

Warning message for a primary key constraint

I have a problem with assigning a primary key in one of the tables containing employee info. There is no unique column in that table, The only option I am left with is taking combination of three columns as a primary key.
But it gives a warning message as Warning! The maximum key length is 900 bytes. The index 'pk_hrempid' has maximum length of 1530 bytes.For some combination of large values, the insert/update operation will fail I came to know that this would be a major problem in the future for inserting the data. Is there a solution for this warning?
Other question is can I put an auto-increment value as a unique id, is it recommended? I want to make sure that it does not give problems in the future as I have many tables containing employee info from the other departments. Some employees may be present in two or more tables
Any help is appreciated!
Whilst is it sounds from your attempt at a compound primary key that you're attempting the best-practice of using a "natural key", there's nothing 'wrong' with using an auto-incrementing ID field.
If your suggested fields are too large to be used as a key, perhaps they weren't the best choice in the first place. Could you add another "natural" key column with a better datatype perhaps?
Don't forget to take into account the optimizations that are possible by choosing good indexes and suitable datatypes for tables that are going to be heavily-queried.
This is a limitation of a primary key. You can not have a PK larger that 900 bytes.
You could add a identity column to the table and set it as the primary key. I prefer to use Guids as they are globally unique.
I'd go with an auto-increment type solution for the primary key, problem with using personal data for this sort of thing is that you cannot guarantee uniqueness which is the fundamental requirement of a primary key.

Do I need a primary key if something will NOT be changed?

If I had a site where a user can flag another user post and it cannot be undone or changed, do I need to have a primary key? All my selects would be on the post_id and with a where clause to see if the user already flagged it.
It seems to me from some of your other posts that the reason you are trying to avoid adding a primary key to your table is to save space.
Stop thinking like that.
It's a bad idea to make non-standard optimizations like this without having tested them first to see if they actually work. Have you run some tests that shows that you save a significant amount of space in your database by omitting the primary key on this table? Or are you just guessing?
Using a primary key doesn't necessarily mean that you will use more space. Depending on the database, if you omit the primary key it might add a hidden field for you anyway (for example if you don't have a PK in MySQL/InnoDB it adds a hidden clustered index on a synthetic column containing 6 byte row ID values (source)). If you do use a primary key, rather than adding a new column you can just choose some existing columns that you know should be unique anyway. It won't take up any more space, it will just mean that the data will be stored in a different order to make it easier to search.
When you add an index, that index is going to take up extra space, as an index is basically just a copy of a few columns of the table, plus a link back to the row in the original table. Remember that hidden column the database uses when you don't have a PK? Well now it has to use that to find your rows, so you'll get a copy of it in your index too. If you use a primary key then you probably don't need one of your indexes that you would have added, so you're actually saving space here.
Besides all this, some useful database tools just won't work well if you don't have a primary key on your table. You will annoy everyone that has to maintain your database after you are gone.
So tell me, why do you think it's a good idea to NOT have one?
A primary key has nothing to do with whether data can be changed - it's a single point of reference for an entire row, which can make looking up and/or changing data faster.
All my selects would be on the post_id and with a where clause to see if the user already flagged it.
You need to provide more information about business rules. For example, should the system support more than one user flagging the same post?
If the answer is "no", then I would model a POST_STATUS_CODE table and have a foreign key to the table in your POSTS table.
If the answer is "yes", then I would still have a POST_STATUS_CODE table but also a table linking the POSTS and POST_STATUS_CODE tables - say POSTS_STATUS_XREF.
I have a post_flag table with post_id, user_id (who flagged it) and flag_type (ATM as a byte). I don't see how PK will make it faster in this case but I imagine it will take up 4 or 8 bytes per row. I was thinking about indexing post_id. If I do should I still create a PK?
At a minimum, I would make the primary key to be a combination of:
post_id
user_id
The reason being that a primary key ensures that there can't be duplicates.
A primary key can be made up of more than one column - this is called a compound key. It means that the pair of values is unique. IE: You can't have more than one combination of 1, 1 values, but you could have 1,2, 1,3, etc (and vice versa). Attempts to add duplicates will result in duplicate primary key errors.
Primary keys help speed up lookups and joins, so it's always nice to have if you can.
You don't need a primary key, not even if users are going to modify rows. A primary key optimizes the performance every time you query that table though. If you think your table will grow larger than about a thousand rows or so, then setting a primary key will give a noticeable performance boost.
The only advantage in not creating a primary key really is that it means you don't have to create one, which is fair enough I suppose :-P
You could just not bother creating one for now. You can always add one later. Not a big deal. Don't let anyone bully you into thinking you absolutely must create a primary key right now! You'll see it being horribly slow soon enough :-P and then you can just add the primary key at that point. If you don't have too many duplicates by then :-P
Best have one, if just because you may have to delete the occasional record manually (e.g. duplicates) and one should have a unique identifier for that.
The simple answer is yes. every table should have a primary key (made of at least one column). what benefit do you get for not having one?
In such a situation, you might be able to get away without one, but I'd be inclined to throw a primary key on there anyway, simply because it's relatively simple to do and will save rework if the requirements change.
The software requirements may change rapidly. The customer may introduce new requirements. So having a primary key may be useful because you can eliminate totally unnecessary data migrations in such a situations.
Read this: "Is it OK not to use a Primary Key When I don’t Need one?"
Yes, you do need a primary key.
You may as well use text files for storage if you don't think you do because it means you don't understand them...

Should i have a primary ID? i am indexing another field

Using sqlite i need a table to hold a blob to store a md5 hash and a 4byte int. I plan to index the int but this value will not be unique.
Do i need a primary key for this table? and is there an issue with indexing a non unique value? (I assume there is not issue or reason for any).
Personally, I like to have a unique primary id on all tables. It makes finding unique records for updating/deleting easier.
How are you going to reference on a SELECT * FROM Table WHERE or an UPDATE ... WHERE? Are you sure you want each one?
You already have one.
SQLite automatically creates an integer ROWID column for every row of every table. This can function as a primary key if you don't declare your own.
In general it's a good idea to declare your own primary key column. In the particular instance you mentioned, ROWID will probably be fine for you.
My advice is to go with primary key if you want to have referential integrity. However there is no issue with indexing a non unique value. The only thing is that your performance will downgrade a little.
What are the consequences of letting two identical rows somehow get into this table?
One consequence is, of course, wasted space. But I'm talking about something more fundamental, here. There are times when duplicate rows in data give you wrong results. For example, if you grouped by the int column (field), and listed the count of rows in each group, a duplicate row (record) might throw you off, depending on what you are really looking for.
Relational databases work better if they are based on relations. Relations are always in first normal form. The primary reason for declaring a primary key is to prevent the table from getting out of first normal form, and thus not representing a relation.