General SQL question about Primary Keys - sql

I know this is pretty elementary but here it goes.
I would like to know how you know what columns are a primary key in a table that does not have a primary key? Is there a technique or something that I should read?
Thank you in advance

You need to take a look at your data structures.
A primary key must:
never be NULL (no exceptions)
reliably and uniquely identify each single row
and it helps if it's
small and easy to use
stable (doesn't change at all, or at least not often)
a single column (or at most two)
Check your data - which columns or set of columns can fulfill these requirements??
Once you have those potential primary keys (the "candidate keys") - think about how you will access the data, and what other data might need to be associated with this one entity in question - what would make sense as a foreign key? Do you want to reference your department by its name? Probably not a good idea, since the name could be misspelled, it might change over time etc. By the department's office location? Bad choice, too. But something like a unique "department ID" might be a good idea.
If you don't find any appropriate column(s) in your actual data that could serve as primary key and would make sense, it's a common practice to introduce a "surrogate key" - an extra column, often an INT (and often something like an "auto-increment" INT) that will serve as an artificial identifier for each row. If you do this, one common best practice is to never show that artificial key on any data screen - it has no meaning whatsoever to the users of your system - so don't even show it to them.
Checking these requirements, and a lot of experience, will help you find the right primary key.

It really depends on the data itself. You need to determine what fields can be used to identify the record uniquely.

In SQL server it'll have a key next to it. It's typically ID or something with ID in it. It's also unique and typically increments. When you look at it in SQL. Server management studio under table design you'll see it towards the top of the list of columns with the Lil key icon.
It's a unique identifier that deciphers each record from one another. Kind of like how each person has a ssn.

Related

SQL - Must there always be a primary key?

There are a couple of similar questions already out there and the consensus seemed to be that a primary key should always be created.
But what if you have a single row table for storing settings (and let's not turn this into a discussion about why it might be good/bad to create a single row table please)?
Surely having a primary key on a single row table becomes completely useless?
It may seem completely useless, but it's also completely harmless, and I'd vote for harmless with good design principles vs. useless with no design principles every time.
Other people have commented, rightly, that you don't know how you're going to use the table in a year or five years... what if someone comes along and decides they want to duplicate the configuration -- move it to a distributed environment or add a test environment by using a duplicate configuration string or whatever. Having a field that acts like a primary key means that whenever you query the table, if you use the key, you'll be certain no matter what anyone else may do to your table, that you're getting the correct record.
You're right there are a million other aspects -- surrogate keys vs. intelligent keys, indexing, partitioning (silly on a single row table, I know), whatever... but without getting into that I'd vote add the key rather than not add it. You could have done it by the time you read this thread.
Short answer, no key, duplicate records possible. Your planning a single row now, but what about six months in the future when you single row multiplies. Put a primary key on the table, even for single row.
You could always base your primary key on the name of the setting. Then your table would become a key-value store.
But no, in many RDBMS you are not REQUIRED to have a primary key per table.
Having declared a primary key on a single row table in SQL will ensure that there will be no duplicates. Whether it is useless depends on your requirements. Usually it is a good idea to avoid duplicates.

Using string as PK vs using GUID or int Id with Unique Constraint for Names

Hi I was wondering what is the best practice for tables in which you have a record that must be unique. I've seen the two ways of doing that: use a Primary Key or add a Unique constraint to the column.
If you use a primary key, is it bad practice to have a primary key such as "UserName" that is varchar(*)? Does that impact performance enough that it is problematic? Is it best to use an integer id with a unique constraint on the username?
I see some other factors that may impact choosing a column as PK vs Unique. Am I right about these?
PK
- Column should be one that doesn't ever need to be changed
Unique
- Column could be changed later on
Having a primary key on the UserName is not the best idea, but it isn't so bad in performance as you maybe think.
The best idea would be using a ID (INT) as PRIMARY KEY and the UserName as UNIQUE.
Usernames change over time, that is why they are a bad candidate for a PK especally since it is extremely likely you have child records associated with the username. For instance suppose my username included some variation of my real name. If I then got divorced and returned to my maiden name, the last thing I want to do is be reminded of that SOB I was married to and so I change my username. Do you really want to change the 2 million posts I've made in the last ten years as well? I didn't think so.
Yes string comparisons are slower but this may or may not be an issue depending on the overall amount of action the database will get. Small copmany database with less than 200 users, probaly not a problem, Internet site with millions of users, much more likely to be a problem.
It may or may not be a good idea as others have already discussed. Let me just add one more detail...
I see some other factors that may impact choosing a column as PK vs Unique.
The main difference is usually related to clustering. Most DBMSes (that support clustering) automatically use PK as a clustering index. For example MySQL/InnoDB always clusters data and you can't event turn it off, while MS SQL Server clusters by default (you have to use special syntax to turn it off).
Should you choose to use clustering (or are forced by your DBMS), having fewer indexes is usually better (e.g. see "Disadvantages of clustering" in this article), even when leading to "fatter" foreign keys.

Do I need a primary key if something will NOT be changed?

If I had a site where a user can flag another user post and it cannot be undone or changed, do I need to have a primary key? All my selects would be on the post_id and with a where clause to see if the user already flagged it.
It seems to me from some of your other posts that the reason you are trying to avoid adding a primary key to your table is to save space.
Stop thinking like that.
It's a bad idea to make non-standard optimizations like this without having tested them first to see if they actually work. Have you run some tests that shows that you save a significant amount of space in your database by omitting the primary key on this table? Or are you just guessing?
Using a primary key doesn't necessarily mean that you will use more space. Depending on the database, if you omit the primary key it might add a hidden field for you anyway (for example if you don't have a PK in MySQL/InnoDB it adds a hidden clustered index on a synthetic column containing 6 byte row ID values (source)). If you do use a primary key, rather than adding a new column you can just choose some existing columns that you know should be unique anyway. It won't take up any more space, it will just mean that the data will be stored in a different order to make it easier to search.
When you add an index, that index is going to take up extra space, as an index is basically just a copy of a few columns of the table, plus a link back to the row in the original table. Remember that hidden column the database uses when you don't have a PK? Well now it has to use that to find your rows, so you'll get a copy of it in your index too. If you use a primary key then you probably don't need one of your indexes that you would have added, so you're actually saving space here.
Besides all this, some useful database tools just won't work well if you don't have a primary key on your table. You will annoy everyone that has to maintain your database after you are gone.
So tell me, why do you think it's a good idea to NOT have one?
A primary key has nothing to do with whether data can be changed - it's a single point of reference for an entire row, which can make looking up and/or changing data faster.
All my selects would be on the post_id and with a where clause to see if the user already flagged it.
You need to provide more information about business rules. For example, should the system support more than one user flagging the same post?
If the answer is "no", then I would model a POST_STATUS_CODE table and have a foreign key to the table in your POSTS table.
If the answer is "yes", then I would still have a POST_STATUS_CODE table but also a table linking the POSTS and POST_STATUS_CODE tables - say POSTS_STATUS_XREF.
I have a post_flag table with post_id, user_id (who flagged it) and flag_type (ATM as a byte). I don't see how PK will make it faster in this case but I imagine it will take up 4 or 8 bytes per row. I was thinking about indexing post_id. If I do should I still create a PK?
At a minimum, I would make the primary key to be a combination of:
post_id
user_id
The reason being that a primary key ensures that there can't be duplicates.
A primary key can be made up of more than one column - this is called a compound key. It means that the pair of values is unique. IE: You can't have more than one combination of 1, 1 values, but you could have 1,2, 1,3, etc (and vice versa). Attempts to add duplicates will result in duplicate primary key errors.
Primary keys help speed up lookups and joins, so it's always nice to have if you can.
You don't need a primary key, not even if users are going to modify rows. A primary key optimizes the performance every time you query that table though. If you think your table will grow larger than about a thousand rows or so, then setting a primary key will give a noticeable performance boost.
The only advantage in not creating a primary key really is that it means you don't have to create one, which is fair enough I suppose :-P
You could just not bother creating one for now. You can always add one later. Not a big deal. Don't let anyone bully you into thinking you absolutely must create a primary key right now! You'll see it being horribly slow soon enough :-P and then you can just add the primary key at that point. If you don't have too many duplicates by then :-P
Best have one, if just because you may have to delete the occasional record manually (e.g. duplicates) and one should have a unique identifier for that.
The simple answer is yes. every table should have a primary key (made of at least one column). what benefit do you get for not having one?
In such a situation, you might be able to get away without one, but I'd be inclined to throw a primary key on there anyway, simply because it's relatively simple to do and will save rework if the requirements change.
The software requirements may change rapidly. The customer may introduce new requirements. So having a primary key may be useful because you can eliminate totally unnecessary data migrations in such a situations.
Read this: "Is it OK not to use a Primary Key When I don’t Need one?"
Yes, you do need a primary key.
You may as well use text files for storage if you don't think you do because it means you don't understand them...

Is it OK not to use a Primary Key When I don't Need one

If I don't need a primary key should I not add one to the database?
You do need a primary key. You just don't know that yet.
A primary key uniquely identifies a row in your table.
The fact it's indexed and/or clustered is a physical implementation issue and unrelated to the logical design.
You need one for the table to make sense.
If you don't need a primary key then don't use one. I usually have the need for primary keys, so I usually use them. If you have related tables you probably want primary and foreign keys.
Yes, but only in the same sense that it's okay not to use a seatbelt if you're not planning to be in an accident. That is, it's a small price to pay for a big benefit when you need it, and even if you think you don't need it odds are you will in the future. The difference is you're a lot more likely to need a primary key than to get in a car accident.
You should also know that some database systems create a primary key for you if you don't, so you're not saving that much in terms of what's going on in the engine.
No, unless you can find an example of, "This database would work so much better if table_x didn't have a primary key."
You can make an arguement to never use a primary key, if performance, data integrity, and normalization are not required. Security and backup/restore capabilities may not be needed, but eventually, you put on your big-boy pants and join the real world of database implementation.
Yes, a table should ALWAYS have a primary key... unless you don't need to uniquely identify the records in it. (I like to make absolute statements and immediately contradict them)
When would you not need to uniquely identify the records in a table? Almost never. I have done this before though for things like audit log tables. Data that won't be updated or deleted, and wont be constrained in any way. Essentially structured logging.
A primary key will always help with query performance. So if you ever need to query using the "key" to a "foreign key", or used as lookup then yes, craete a foreign key.
I don't know. I have used a couple tables where there is just a single row and a single column. Will always only be a single row and a single column. There is no foreign key relationships.
Why would I put a primary key on that?
A primary key is mainly formally defined to aid referencial Integrity, however if the table is very small, or is unlikely to contain unique data then it's an un-necessary overhead.
Defining indexes on the table can normally be used to imply a primary key without formally declaring one.
However you should consider that defining the Primary key can be useful for Developers and Schema generation or SQL Dev tools, as having the meta data helps understanding, and some tools rely on this to correctly define the Primary/foreign key relationships in the model.
Well...
Each table in a relational DB needs a primary key. As already noted, a primary key is data that identies a record uniquely...
You might get away with not having an "ID" field, if you have a N-M table that joins 2 different tables, but you can uniquely identifiy the record by the values from both columns you join. (Composite primary key)
Having a table without an primary key is against the first normal form, and has nothing to do in a relational DB
You should always have a primary key, even if it's just on ID. Maybe NoSQL is what you're after instead (just asking)?
That depends very much on how sure you can be that you don't need one. If you have just the slightest bit of doubt, add one - you'll thank yourself later. An indicator being if the data you store could be related to other data in your DB at one point.
One use case I can think of is a logging kind-of table, in which you simply dump one entry after the other (to properly process them later). You probably won't need a primary key there, if you're storing enough data to filter out the relevant messages (like a date). Of course, it's questionable to use a RDBMS for this.

What should I consider when selecting a data type for my primary key?

When I am creating a new database table, what factors should I take into account for selecting the primary key's data type?
Sorry to do that, but I found that the answers I gave to related questions (you can check this and this) could apply to this one. I reshaped them a little bit...
You will find many posts dealing with this issue, and each choice you'll make has its pros and cons. Arguments for these usually refer to relational database theory and database performance.
On this subject, my point is very simple: surrogate primary keys ALWAYS work, while Natural keys MIGHT NOT ALWAYS work one of these days, and this for multiple reasons: field too short, rules change, etc.
To this point, you've guessed here that I am basically a member of the uniqueIdentifier/surrogate primary key team, and even if I appreciate and understand arguments such as the ones presented here, I am still looking for the case where "natural" key is better than surrogate ...
In addition to this, one of the most important but always forgotten arguments in favor of this basic rule is related to code normalization and productivity:
each time I create a table, shall I lose time
identifying its primary key and its physical characteristics (type, size)
remembering these characteristics each time I want to refer to it in my code?
explaining my PK choice to other developers in the team?
My answer is no to all of these questions:
I have no time to lose trying to identify "the best Natural Primary Key" when the surrogate option gives me a bullet-proof solution.
I do not want to remember that the Primary Key of my Table_whatever is a 10 characters long string when I write the code.
I don't want to lose my time negotiating the Natural Key length: "well if You need 10 why don't you take 12 to be on the safe side?". This "on the safe side" argument really annoys me: If you want to stay on the safe side, it means that you are really not far from the unsafe side! Choose surrogate: it's bullet-proof!
So I've been working for the last five years with a very basic rule: each table (let's call it 'myTable') has its first field called 'id_MyTable' which is of uniqueIdentifier type. Even if this table supports a "many-to-many" relation, where a field combination offers a very acceptable Primary Key, I prefer to create this 'id_myManyToManyTable' field being a uniqueIdentifier, just to stick to the rule, and because, finally, it does not hurt.
The major advantage is that you don't have to care anymore about the use of Primary Key and/or Foreign Key within your code. Once you have the table name, you know the PK name and type. Once you know which links are implemented in your data model, you'll know the name of available foreign keys in the table.
And if you still want to have your "Natural Key" somewhere in your table, I advise you to build it following a standard model such as
Tbl_whatever
id_whatever, unique identifier, primary key
code_whatever, whateverTypeYouWant(whateverLengthYouEstimateTheRightOne), indexed
.....
Where id_ is the prefix for primary key, and code_ is used for "natural" indexed field. Some would argue that the code_ field should be set as unique. This is true, and it can be easily managed either through DDL or external code. Note that many "natural" keys are calculated (invoice numbers), so they are already generated through code
I am not sure that my rule is the best one. But it is a very efficient one! If everyone was applying it, we would for example avoid time lost answering to this kind of question!
If using a numeric key, make sure the datatype is giong to be large enough to hold the number of rows you might expect the table to grow to.
If using a guid, does the extra space needed to store the guid need to be considered? Will coding against guid PKs be a pain for developers or users of the application.
If using composite keys, are you sure that the combined columns will always be unique?
I don't really like what they teach in school, that is using a 'natural key' (for example ISBN on a bookdatabase) or even having a primary key made up off 2 or more fields. I would never do that. So here's my little advice:
Always have one dedicated column in every table for your primary key.
They all should have the same colomn name across all tables, i.e. "ID" or "GUID"
Use GUIDs when you can (if you don't need performance), otherwise incrementing INTs
EDIT:
Okay, I think I need to explain my choices a little bit.
Having a dedicated column namend the same across all table for you primary key, just makes your SQL-Statements a lot of easier to construct and easier for someone else (who might not be familiar with your database layout) easier to understand. Especially when you're doing lots of JOINS and things like that. You won't need to look up what's the primary key for a specific table, you already know, because it's the same everywhere.
GUIDs vs. INTs doesn't really matters that much most of the time. Unless you hit the performance cap of GUIDs or doing database merges, you won't have major issues with one or another. BUT there's a reason I prefer GUIDs. The global uniqueness of GUIDs might always come in handy some day. Maybe you don't see a need for it now, but things like, synchronizing parts of the database to a laptop / cell phone or even finding datarecords without needing to know which table they're in, are great examples of the advantages GUIDs can provide. An Integer only identifies a record within the context of one table, whereas a GUID identifies a record everywhere.
In most cases I use an identity int primary key, unless the scenario requires a lot of replication, in which case I may opt for a GUID.
I (almost) never used meaningful keys.
Unless you have an ultra-convenient natural key available, always use a synthetic (a.k.a. surrogate) key of a numeric type. Even if you do have a natural key available, you might want to consider using a synthetic key anyway and placing an additional unique index on your natural key. Consider what happened to higher-ed databases that used social security numbers as PKs when federal law changed, the costs of changing over to synthetic keys were enormous.
Also, I have to disagree with the practice of naming every primary key the same, e.g. "id". This makes queries harder to understand, not easier. Primary keys should be named after the table. For example employee.employee_id, affiliate.affiliate_id, user.user_id, and so on.
Do not use a floating point numeric type, since floating point numbers cannot be properly compared for equality.
Where do you generate it? Incrementing number's don't fit well for keys generated by the client.
Do you want a data-dependent or independent key (sometimes you could use an ID from business data, can't say if this is always useful or not)?
How well can this type be indexed by your DB?
I have used uniqueidentifiers (GUIDs) or incrementing integers so far.
Cheers
Matthias
Numbers that have meaning in the real world are usually a bad idea, because every so often the real world changes the rules about how those numbers are used, in particular to allow duplicates, and then you've got a real mess on your hands.
I'm partial to using an generated integer key. If you expect the database to grow very large, you can go with bigint.
Some people like to use guids. The pro there is that you can merge multiple instances of the database without altering any keys but the con is that performance can be affected.
For a "natural" key, whatever datatype suits the column(s). Artifical (surrogate) keys are usually integers.
It all depends.
a) Are you fine having unique sequential numeric numbers as your primary key? If yes, then selecting UniqueIdentifier as your primary key will suffice.
b) If your business demand is such that you need to have alpha numeric primary key, then you got to go for varchar or nvarchar.
These are the two options I could think of.
A great factor is how much data you're going to store. I work for a web analytics company, and we have LOADS of data. So a GUID primary key on our pageviews table would kill us, due to the size.
A rule of thumb: For high performance, you should be able to store your entire index in memory. Guids could easily break this!
Use natural keys when they can be trusted. Some sources of natural keys can't be trusted. Years ago, the Social Security Administration used to occasionally mess up an assign the same SSN to two different people. Theyv'e probably fixed that by now.
You can probably trust VINs for vehicles, and ISBNs for books (but not for pamphlets, which may not have an ISBN).
If you use natural keys, the natural key will determine the datatype.
If you can't trust any natural keys, create a synthetic key. I prefer integers for this purpose. Leave enough room for reasonable expansion.
I usually go with a GUID column primary key for all tables (rowguid in mssql). What could be natural keys I make unique constraints. A typical example would be a produkt identification number that the user have to make up and ensure that is unique. If I need a sequence, like in a invoice i build a table to keep a lastnumber and a stored procedure to ensure serialized access. Or a Sequence in Oracle :-) I hate the "social security number" sample for natural keys as that number will never be alway awailable in a registration process. Resulting in a need for a scheme to generate dummy numbers.
I usually always use an integer, but here's an interesting perspective.
https://blog.codinghorror.com/primary-keys-ids-versus-guids/
Whenever possible, try to use a primary key that is a natural key. For instance, if I had a table where I logged one record every day, the logdate would be a good primary key. Otherwise, if there is no natural key, just use int. If you think you will use more than 2 billion rows, use a bigint. Some people like to use GUIDs, which works well, as they are unique, and you will never run out of space. However, they are needlessly long, and hard to type in if you are just doing adhoc queries.