What could be the purpose of primary keys on all tables being derived from a single table? - sql

I've really been scratching my head over this and don't know how to ask the question well enough to find an answer on Google or StackOverflow etc.
There is a very old system used at work - I don't have access to the server side so can't view its tables, but I do know its an SQL database and have done enough experimenting with the API to see what adding to each table does, and I'm questioning how it allocates primary keys;
It has a lot of tables, each with a primary key as expected, but the primary key on any/all of its tables seems to be allocated so that there is absolutely no duplication of primary keys anywhere in the system.
e.g.
add row to table 1 get pk = 1
add row to table 2 get pk = 2
add row to table 1 again, get pk = 3
add row to table 10 and get pk = 4
Is this method some sort of old database technique?
What could be the purpose of doing this?
There are more funny nuances that I won't get into detail of, e.g. a certain range of pk's being allocated for certain tables but I just wanted to see if anyone recognises the main principle here and if there's a point to it, or if it's just bad / weird design

A primary key only needs to be unique within a single table. There is no such thing as a primary key across multiple tables.
This might be useful under some circumstances. For instance, this would allow all entities to be represented in a single table. This can be handy for "generic" information, such as adding comments to the entities.
More prosaically, I have seen this in older Oracle databases. Oracle did not have any automated mechanism for generating ids, so this required using a sequence. As a matter of convenience, laziness, or design, multiple tables might use the same sequence -- resulting in the behavior that you see.

Related

SQL Server database design with foreign keys

I have the following partial database design:
All the tables are dependent on each other so the table bvd_docflow_subdocuments is dependent on the table bdd_docflow_subsets
and the table bvd_docflow_subdocuments is dependent on bvd_docflow_subsets. So I thought I could me smart and use foreign keys on every table (and ON DELETE CASCADE). However the FK are being drilldown how further I go in to the tables.
The problem is the table bvd_docflow_documents has no point having a reference to the 1docflow_documentset_id` PK / FK. Is there a way (and maybe my design is crappy) that only the table standing above it has an FK relationship between the tables and not all the tables above it.
Edit:
More explanation:
In the bvd_docflow_subsets table information is stored about objects to create documents. There is an relation between that table and bvd_docflow_subdocuments table (This table stores master data about all the documents for an subset. (docflow_subset_id is in both tables). This is the link between those to tables.
Going further down we also got the table bvd_docflow_documents this table contains the actual document data. The link between bvd_docflow_documents and bvd_docflow_subdocuments is bvd_docflow_subdocument_id.
On every table I got an foreign key defined so when data is removed on a table all the data linked to that data is also removed.
However when we look to the bvd_docflow_documents table it has all the foreign keys from the other tables (docflow_subset_id and docflow_documentset_id) and there is the problem. The only foreign key needed for that bvd_docflow_documents table is docflow_subdocument_id and no other.
Edit 2
I have changed my design further and removed information that I don't need after initial import of the data.
See the following link for the (total) databse design:
https://sqldbm.com/Project/SQLServer/Share/_AUedvNutCEV2DGLJleUWA
The tables subsets, subdocuments and documents have a many to many relationship so I thought a table in between those 3 documents_subdocuments is the way to go were I define all the different keys for those tables.
I am not used to the database design first and then build it. But, for everything there is a first time, and I try to do make a database that is using standards and is using the power of SQL Server the correct way.
I'll address the bottom-most table and ignore the rest for the most part.
But first some comments. Your schema is simply a model of a system. To provide feedback, one must understand this "system" and how it actually works to evaluate your model. In addition, it is important to understand your entities and your reasons for choosing them and modelling them in the specified manner. Without that understanding all of this guessing based on experience.
And another comment. Slapping an identity column into every table is just lazy modelling IMO. Others will disagree, but you need to also enforce all natural keys. Do you have natural keys? It is rare not to have any. Enforce those that do exist.
And one last comment. Stop the ridiculous pattern of prepending the column names with the table names. And you should really think long and hard about using very long table names. Given what you have, I sense you need a schema for your docflow stuff.
For the documents table, your current PK makes no sense. Again, you've slapped an identity column into the table. By itself, this column is a key for the table. The inclusion of any other columns does not make the key any more "unique" - that inclusion is logical nonsense. Following your pattern, you would designate the identity column as the primary key. But ...
According to your image, the documents table is related to one and only one subdocument. You added a foreign key to that table - which matches the image. You also added additional columns and foreign keys to the "higher" tables. So now a document "points" to a specific subdocument. It also points to a specific subset - which may have no relationship to the subdocument. The same thought applies to the other FK. I have a doubt that this is logically correct. So why do these columns (and related FKs) exist? Perhaps this is the result of premature optimization - which everyone knows is the root of all evil coding. Again, it is impossible to know if this is "right" or even "useful" for your model.
To answer your question "... is there a way", the answer is obviously yes. You remove the columns of which you complain. You added them - Why? Is this perhaps a problem with the tool you are using?
And some last comments. There is nothing special about "varchar(50)". Perhaps this is a place holder that will be updated later. It may also be another sign of laziness. And generally speaking, columns with names like "type" and "code" tend to be foreign keys to "lookup" tables - because people like to add, modify, or remove these sorts categorization values over time. I'm also concerned about the column name overlap among the tables. "Location" exists in multiple tables, as do action_code and action_id. And a column named "id" (action_id) suggests a lookup to another table - is it? Should it be? Is there a relationship between action_id and action_code? From a distance it is impossible to answer any of these questions.
But designing a database is more art than science. Sometimes you just need to create something, populate it with some sample data, and then determine if it works for your needs. Everyone will get something wrong in the first try. That is expected; that is how you learn. The most difficult part is actually completing your first attempt.

PK on table with (simulated) polymorphic association

I am trying to set up the right indices on a table I have just created which contains 4 "polymorphic associations" and a PK ID. The 4 associations allow me not to have to quadruple the number of tables to the addition I am making to the database and should not be modified in this discussion. My question is how should I set up the indices so that I get optimal performance (speed, space not so much) ? None of the 4 keys is candidate for PK. More specifically all 4 are but only one at a time. I have added a PK "ID" because I had read that adding a PK, even if not used, is better than not adding a PK. However, I am questionning this assertion more and more.
More about the table : the logic that only 1 of the 4 FKs should be used is enforced by an Access form. Nobody non-dev has access to the tables directly. I expect there will be no more than a couple hundred entries every month for as long as this database is in use. Assuming we use it 10 more years and average 500 entries a month (which is probably a bit more than what it will be) we should have no more than 60k entries in 10 years. Basically, this is not a hugely populated table.
The db and forms run on Access 2003 (yeah I know...).
I hope that is enough information for you to help me. In the image below you can see the table structure as it is right now. The 4 FKs are NoDemandeAmendementTransit, NoDemandeAmendementRubrique, NoAmendementTransit, NoAmendementRubrique.
Many thanks.
A more practical design is to create a single supertype table for all of the four subtypes you are referencing. Then reference the supertype table with a single foreign key instead of having four separate FKs. It's a design pattern you can find in most good books on database design and it is simpler and more efficient than having multiple "optional" foreign keys. It will also provide you with a more useful primary key.

SQL Server, does the id change if an element gets deleted?

I wondered if I insert, let's say, 10 entries into a SQL Server table.
If i then delete one of them, will the id/index change correspondingly?
Example:
1 | Simon Cowell | 56 years
2 | Frank Lampard| 24 years
3 | Harry Bennet | 12 years
If I delete #2, will Harry Bennet's index change to 2?
Thanks :)
EDIT:
Sorry for my outrage, had a bad day. And yes, I should have researched it myself, I deserve to be downvoted.
I don't ask for anything, I just want to say that I'm sorry :|
Since you seem to be conflating "id/index" let's talk a little but about the primary key and indexes in the context of a relational database.
The "id" or primary key assigned to a row in a SQL database is the unique identifier for that row. It can consist of one or more columns. (When more than one column is involved it is known as a "composite" or "multi-part" key.) The primary key should really do nothing more than be a unique handle for addressing a row: the primary key should not contain any information about the entity represented by the row, especially if that info has the potential to be mutable; an example would be a part number that has a suffix that stands for the type of metal the part is made from; if that metal can possibly change from titanium to unobtainium, say, that part number would make a bad choice as a primary key; it would be better to have another column to store the type of metal than to make the metal-type suffix part of the primary key. "Meaningful" primary keys might have made some sense in legacy non-relational databases but in a relational database they are to be avoided.
When seeking to enforce the uniqueness of a primary key, a database engine can make use of an index so it can rapidly test whether the key value exists. It could use a binary algorithm to find the value, avoiding the need to scan the actual data "brute force", row by row, looking for the value. But the index that is used behind the scenes by the engine to assist it with the primary key housekeeping is not the same as the primary key itself.
If you have a simple sequential integer as your primary key, there's an infinite number of them, so there is no need to reuse an integer when it becomes available when the row to which it was assigned has been deleted. So the relational database engine won't automatically attempt to reuse it, and it won't by any means change the primary key values that have been assigned to all other rows in the table when "gaps" in the number sequence are created by a deletion. Many other rows in other tables could be referencing those values and having them be mutable would create either chaos or a huge inefficiency.
Hashing algorithms are another very efficient way a database engine can quickly test for the existence of a key value. It computes the location in the hashed-file where the key would be if it did exist, and then looks there for it. The rows are stored in no particular order, so such schemes are optimized for instant finding of records in a large table, not for culling records that have a common characteristic, such as all customers in zipcode 10023.
No. You can set up triggers or logic to do it if you want; however, it will not automatically do this.
No it will not change automatically
No, it wont. And hopefully, that's the answer you're hoping for. For any auto-generated identifiers (such as IDENTITY columns), you should, so far as possible, ignore the data type and treat it as an opaque "blob" of identity information.
It gets assigned during insert, and you can use it for cross-referencing purposes, but the fact that it's numeric is not something you should use or rely upon. It's just a stable identifier for the row.

Changing a table's primary key column referenced by foreign key in other tables

In our DB (on SQL Server 2005) we have a "Customers" table, whose primary key is Client Code, a surrogate, bigint IDENTITY(1,1) key; the table is referenced by a number of other tables in our DB thru a foreign key.
A new CR implementation we are estimating would require us to change ID column type to varchar, Client Code generation algorithm being shifted from a simple numeric progression to a strict 2-char representation, with codes ranging from 01 to 99, then progressing like this:
1A -> 2A -> ... -> 9A -> 1B -> ... 9Z
I'm fairly new to database design, but I smell some serious problems here. First of all, what about this client code generation algorithm? What if I need a Client Code to go beyond 9Z code limit?
The I have some question: would this change be feasible, the table being already filled with a fair amount of data, and referenced by multiple entities? If so, how would you approach this problem, and how would you implement Client Code generation?
I would leave the primary key as it is and would create another key (unique) on the client code generated.
I would do that anyway. It's always better to have a short number primary key instead of long char keys.
In some situation you might prefer a GUID (for replication purposes) but a number int/bigint is alway preferable.
You can read more here and here.
My biggest concern with what you are proposing is that you will be limited to 360 primary records. That seems like a small number.
Performing the change is a multi-step operation. You need to create the new field in the core table and all its related tables.
To do an in-place update, you need to generate the code in the core table. Then you need to update all the related tables to have the code based on the old id. Then you need to add the foreign key constraint to all the related tables. Then you need to remove the old key field from all the related tables.
We only did that in our development server. When we upgraded the live databases, we created a new database for each and copied the data over using a python script that queried the old database and inserted into the new database. I now update that script for every software upgrade so the core engine stays the same, but I can specify different tables or data modifications. I get the bonus of having a complete backup of the original database if something unexpected happens when upgrading production.
One strong argument in favor of a non-identity/guid code is that you want a human readable/memorable code and you need to be able to move records between two systems.
Performance is not necessarily a concern in SQL Server 2005 and 2008. We recently went through a change where we moved from int ids everywhere to 7 or 8 character "friendly" record codes. We expected to see some kind of performance hit, but we in fact saw a performance improvement.
We also found that we needed a way to quickly generate a code. Our codes have two parts, a 3 character alpha prefix and a 4 or 5 digit suffix. Once we had a large number of codes (15000-20000) we were finding it to slow to parse the code into prefix and suffix and find the lowest unused code (it took several seconds). Because of this, we also store the prefix and the suffix separately (in the primary key table) so that we can quickly find the next available lowest code with a particular prefix. The cached prefix and suffix made the search almost fee.
We allow changing of the codes and they changed values propagate by cascade update rules on the foreign key relationship. We keep an identity key on the core code table to simplify the update of the code.
We don't use an ORM, so I don't know what specific things to be aware of with that. We also have on the order of 60,000 primary keys in our biggest instance, but have hundreds of tables related and tables with millions of related values to the code table.
One big advantage that we got was, in many cases, we did not need to do a join to perform operations. Everywhere in the software the user references things by friendly code. We don't have to do a lookup of the int ID (or a join) to perform certain operations.
The new code generation algorithm isn't worth thinking about. You can write a program to generate all possible codes in just a few lines of code. Put them in a table, and you're practically done. You just need to write a function to return the smallest one not yet used. Here's a Ruby program that will give you all the possible codes.
# test.rb -- generate a peculiar sequence of two-character codes.
i = 1
('A'..'Z').each do |c|
(1..9).each do |n|
printf("'%d%s', %d\n", n, c, i)
i += 1
end
end
The program will create a CSV file that you should be able to import easily into a table. You need two columns to control the sort order. The new values don't naturally sort the way your requirements specify.
I'd be more concerned about the range than the algorithm. If you're right about the requirement, you're limited to 234 client codes. If you're wrong, and the range extends from "1A" to "ZZ", you're limited to less than a thousand.
To implement this requirement in an existing table, you need to follow a careful procedure. I'd try it several times in a test environment before trying it on a production table. (This is just a sketch. There are a lot of details.)
Create and populate a two-column table to map
existing bigints to the new CHAR(2).
Create new CHAR(2) columns in all the
tables that need them.
Update all the new CHAR(2) columns.
Create new NOT NULL UNIQUE or PRIMARY KEY constraints and new FOREIGN KEY constraints on the new CHAR(2) columns.
Rewrite user interface code (?) to target the new columns. (Might not be necessary if you rename the new CHAR(2) and old BIGINT columns.)
Set a target date to drop the old BIGINT columns and constraints.
And so on.
Not really addressing whether this is a good idea or not, but you can change your foreign keys to cascade the updates. What will happen once you're done doing that is that when you update the primary key in the parent table, the corresponding key in the child table will be updated accordingly.

ID fields in SQL tables: rule or law?

Just a quick database design question: Do you ALWAYS use an ID field in EVERY table, or just most of them? Clearly most of your tables will benefit, but are there ever tables that you might not want to use an ID field?
For example, I want to add the ability to add tags to objects in another table (foo). So I've got a table FooTag with a varchar field to hold the tag, and a fooID field to refer to the row in foo. Do I really need to create a clustered index around an essentially arbitrary ID field? Wouldn't it be more efficient to use fooID and my text field as the clustered index, since I will almost always be searching by fooID anyway? Plus using my text in the clustered index would keep the data sorted, making sorting easier when I have to query my data. The downside is that inserts would take longer, but wouldn't that be offset by the gains during selection, which would happen far more often?
What are your thoughts on ID fields? Bendable rule, or unbreakable law?
edit: I am aware that the example provided is not normalized. If tagging is to be a major part of the project, with multiple tables being tagged, and other 'extras', a two-table solution would be a clear answer. However in this simplest case, would normalization be worthwhile? It would save some space, but require an extra join when running queries
As in much of programming: rule, not law.
Proof by exception: Some two-column tables exist only to form relationships between other more meaningful tables.
If you are making tables that bridge between two or more other tables and the only fields you need are the dual PK/FK's, then I don't know why you would need ID column in there as well.
ID columns generally can be very helpful, but that doesn't mean you should go peppering them in at every occasion.
As others have said, it's a general, rather than absolute, rule and there are plenty of exceptions (tables with composite keys for example).
There are some occasional but useful occasions where you might want to create an artificial ID in a table that already has a (usually composite) unique identifier. For example, in one system I've created a table to store part numbers; although the part numbers are unique, they may actually change - we add an arbitrary integer PartID. Not so common, but it's a typical real-world example.
In general what you really want is to be able if at all possible to have some kind of way to uniquely identify a record. It could be an id field or it could be a unique index (which does not have to be on just one field). Anytime I thought I could get away without creating a way to uniquely identify a record, I have been proven wrong. All tables do not have a natural key though and if they do not, you really need to have an id file of some kind. If you have a natural key, you could use that instead, but I find that even then I need an id field in most cases to prevent having to do too much updating when the natural key changes (it always seems to change). Plus having worked with literally hundreds of databases concerning many many differnt topics, I can tell you that a true natural key is rare. As others have nmentioned there is no need for an id field in a table that is simply there to join two tables that havea many to many relationship, but even this should have a unique index.
If you need to retrieve records from that table with unique id then yes. If you will retrieve them by some other composite key made up of foreign keys then no. The last thing you need is fields, data, and indexes that you do not use.
A clustered index does not need to be on primary key or a surrogate (identity column) either.
Your design, however, is not normalized. Typically for tagging, I use two tables, a table of tags (with a surrogate key) and a table of links from the tags to the subject table(s) using the surrogate key in the tag table and theprimary key in the subject table. This allows your tags to apply to different entities (photos, articles, employees, locations, products, whatever). It allows you to enforce foreign key relationships to multiple tables, and also allows you to invent tag hierarchies and other things about the tag table.
As far as the indexes on this design, it will be dictated by the usage patterns.
In general developers love having an ID field on all tables except for 'linking' tables because it makes development much easier, and I am no exception to this. DBA's on the other hand see no problem with making natural primary keys made up of 3 or 4 columns. It can be a butting of heads to try and get a good database design.