SQL Find smallest unique key in table - sql

I have a table with 8 columns and would like to know the smallest possible unique key for this table. (there are no indexes).
This is as distinct as it gets:
select count(*)
from (select distinct id1, id2, id3,
id4, id5, id6,
id7, id8
from mytable);
Is there a quick and easy way to figure out what column combination is unique for this table? How to proceed?

No, there is no quick and easy way as people have said; as Barmar commented:
Any answer you get will be dependent on the data that happens to be loaded at the time. Any change made to the table could invalidate the result
Additionally you have not previously had a primary key searching for uniqueness could easily come up with a false positive. You haven't been enforcing uniqueness so the set of columns that should be unique might not be.
A unique key is determined by the data within your table. You need to understand the data in order to determine what should be unique.
There should be a natural way of having a unique key, i.e. if you've got a table of OS users then the unique key should probably be their username. Equally, if you have a table of employees there's no way to guarantee uniqueness across employee name (or any other attribute or set of attributes) so you will need to create a surrogate key.
Only you can determine this; study your data and you should be able to work it out.
Generally:
Every table should have a primary key; should should be created at the same time as the table.
There is nothing more important than understanding the data within your database. You are not able to effectively use the database until you understand what it is that you're looking at.

Related

Foreign and Primary Key Conceptual Questions

I am a newbie at SQL/PostgreSQL, and I had a conceptual question about foreign keys and keys in general:
Let's say I have two tables: Table A and Table B.
A has a bunch of columns, two of which are A.id, A.seq. The primary key is btree(A.id, A.seq) and it has a foreign key constraint that A.id references B.id. Note that A.seq is a sequential column (first row has value 1, second has 2, etc).
Now say B has a bunch of columns, one of which is the above mentioned B.id. The primary key is btree(B.id).
I have the following questions:
What exactly does btree do? What is the significance of having two column names in the btree rather than just one (as in btree(B.id)).
Why is it important that A references B instead of B referencing A? Why does order matter when it comes to foreign keys??
Thanks so much! Please correct me if I used incorrect terminology for anything.
EDIT: I am using postgres
A btree index stored values in sorted order, which means you can not only search for a single primary key value, but you can also efficiently search for a range of values:
SELECT ... WHERE id between 6060842 AND 8675309
PostgreSQL also supports other index types, but only btree is supported for a unique index (for example, the primary key).
In your B table, the primary key being a single column id means that only one row can exist for each value in id. In other words, it is unique, and if you search for a value by primary key, it will find at most one row (it may also find zero rows if you don't have a row with that value).
The primary key in your A table is for (id, seq). This means you can have multiple rows for each value of id. You can also have multiple rows for each value of seq as long as they are for different id values. The combination must be unique though. You can't have more than one row with the same pair of values.
When a foreign key in A references B, it means that the row must exist in B before you are allowed to store the row in A with the same id value. But the reverse is not necessary.
Example:
Suppose B is for users and A is for phones. You must store a user before you can store a phone record for that user. You can store one or more phones for that user, one per row in the A table. We say that a row in A therefore references the user row in B, meaning, "this phone belongs to user #1234."
But the reverse is not restricted. A user might have no phones (at least not known to this database), so there is no requirement for B to reference A. In other words, it is not required for a user to have a phone. You can store a row in B (the user) even if there is no row in A (the phones) for that user.
The reference also means you are not allowed to DELETE FROM B WHERE id = ? if there are rows in another table that reference that given row in B. Deleting that user would cause those other rows to become orphaned. No one would be able to know who those phones belonged to, if the user row they reference is deleted.
To your questions:
There are several strategies to implement unique keys. The most common one is to use an index that is "unique" using a "b-tree" strategy. That's what "btree" means in PostgreSQL.
Having two columns in a key just depends on how you want to design your table. When you have a key with more than one column that is called a "composite key".
When A references B, the columns in B must represent a "key". The columns in A do not represent a key, but just a reference to one. In fact the values in A for that column can be repeated; that is, multiple rows in A can point to the same row in B.
Your data structure makes no sense. Why would the primary key of A haver both id and name? Normally it would just be id. In some data models, you might have a version or timestamp added. I can't think of a reasonable data model where name would also be included.
In addition, B's foreign key would have to be to both id and name.
But, your question is what is btree for? Most databases don't have such an option. A primary key would typically be expressed as:
id int primary key;
constraint unq_t_id primary key (id);
btree is a type of index -- in fact the default type of index in all databases that I'm aware of. Databases that have a plethora of available types of indexes -- such as Postgres -- you can specify the index type associated with the primary key.

How can I create a composite primary key (id_x, id_y), where order is unique?

I have a similarity matrix between two items and I want to store it as a table in a relational database. Thus, I want the table to have three fields; the id of the first item, the id of the second item and the value of their similarity.
I think about making the two id columns a composite primary key. But how can I ensure that given two items (x, y), the composite key (id_x, id_y) is the same as (id_y, id_x) so I don't have duplicate entries?
Assuming your id values are using a data type that is well-ordered, one of the simplest ways to achieve this is to enforce a CHECK constraint such that id_x is less than id_y.
This may present some difficulties if you do not have a way to enforce this ordering before entries are created. Some databases allow computation within index definitions, in which case you may be able to define your index on (MIN(id_x,id_y),MAX(id_x,id_y)) and omit the CHECK. (For something like SQL Server, you can put such a computation into an indexed view definition)

Is it a bad idea to make a count the primary key?

Just thinking about database design issues. Suppose i have a table like this:
CREATE TABLE LEGACYD.CV_PLSQL_COUNT
(
RUN_DTTM DATE NOT NULL,
TABLE_NAME VARCHAR (80) NOT NULL,
COUNT Number(50) PRIMARY KEY
);
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Is that a bad idea - to make the COUNT a PRIMARY KEY ? Generally, how should I make the decision of what is a primry key?
Candidate keys are based on functional dependencies. A primary key is one of the candidate keys. There's no formal way to look at a set of candidate keys and say, "This one must be the primary key." That decision is based on practical matters, not on formal logic.
Your table structure tells us these things.
COUNT is unique. No matter how many rows this table has, you'll never find two rows that have the same value for COUNT.
COUNT determines TABLE_NAME. That is, given a value for COUNT, we will forever find one and only one value for TABLE_NAME.
TABLE_NAME is not unique. We expect to find several rows that have the same value for TABLE_NAME.
COUNT determines RUN_DTTM. That is, given a value for COUNT, we will forever find one and only one value for RUN_DTTM.
RUN_DTTM is not unique. We expect to find several rows that have the same value for RUN_DTTM.
The combination of TABLE_NAME and RUN_DTTM is not unique. We expect to find several rows that have the same values for TABLE_NAME and RUN_DTTM in a single row.
There are no other determinants. That means that given a value for TABLE_NAME, we'll find multiple unrelated values for COUNT and RUN_DTTM. Likewise, if we're given a value for RUN_DTTM, or values for the pair of columns {TABLE_NAME, RUN_DTTM}.
If all those things are true, then COUNT might be a good primary key. But I doubt that all those things are true.
Based only on the column names--a risky way to proceed--I think it's far more likely that the only candidate key is {TABLE_NAME, RUN_DTTM}. I think it's also likely that either RUN_DTTM is misnamed, or RUN_DTTM has the wrong data type. If it's a date, you probably meant to name it RUN_DT; if it's a timestamp, the data type should be TIMESTAMP.

Table design, composite key

I have a table with some data summary which consist of client_id, location_id, category_id and summary columns. Values of the three id's columns are not unique.
At the moment I have created a composite key from client_id, location_id, category_id using primary keys. Those three columns will uniquely identify rows.
My question is, if I still should include unique primary key for that table for example column with auto-increment id ?
That depends completely on your uses of the table. If you don't want to refer to a given row in a query (for example, having a dependent table), the separate PK is unnecessary (eg. if you always ask for statistics for a given client and a given location and a given category). However, if you do have dependent tables, you probably want a separate PK as well.
If your composite key is the primary clustered index then I would say it's not necessary.

How do I implement this multi-table database design/constraint, normalized?

I have data that kinda looks like this...
Elements
Class | Synthetic ID (pk)
A | 2
A | 3
B | 4
B | 5
C | 6
C | 7
Elements_Xref
ID (pk) | Synthetic ID | Real ID (fk)
. | 2 | 77-8F <--- A class
. | 3 | 30-7D <--- A class
. | 6 | 21-2A <--- C class
. | 7 | 30-7D <--- C class
So I have these elements that are assigned synthetic IDs and are grouped into classes. But these synthetic IDs are then paired with Real IDs that we actually care about. There is also a constraint that a Real ID cannot recur in a single class. How can I capture all of this in one coherent design?
I don't want to jam the Real ID into the upper table because
It is nullable (there are periods where we don't know what the Real ID of something should be).
It's a foreign key to more data.
Obviously this could be done with triggers acting as constraints, but I'm wondering if this could be implemented with regular constraints/unique indexes. Using SQL Server 2005.
I've thought about having two main tables SyntheticByClass and RealByClass and then putting IDs of those tables into another xref/link table, but that still doesn't guarantee that the classes of both elements match. Also solvable via trigger.
Edit: This is keyword stuffing but I think it has to do with normalization.
Edit^2: As indicated in the comments below, I seem to have implied that foreign keys cannot be nullable. Which is false, they can! But what cannot be done is setting a unique index on fields where NULLs repeat. Although unique indexes support NULL values, they cannot constraint more than one NULL in a set. Since the Real ID assignment is initially sparse, multiple NULL Real IDs per class is more than likely.
Edit^3: Dropped the redundant Elements.ID column.
Edit^4: General observations. There seems to be three major approaches at work, one of which I already mentioned.
Triggers. Use a trigger as a constraint to break any data operations that would corrupt the integrity of the data.
Index a view that joins the tables. Fantastic, I had no idea you could do that with views and indexes.
Create a multi-column foreign key. Didn't think of doing this, didn't know it was possible. Add the Class field to the Xref table. Create a UNIQUE constraint on (Class + Real ID) and a foreign key constraint on (Class + Synthetic ID) back to the Elements table.
Comments from before the question was made into a 'bonus' question
What you'd like to be able to do is express that the join of Elements and Elements_Xref has a unique constraint on Class and Real ID. If you had a DBMS that supported SQL-92 ASSERTION constraints, you could do it.
AFAIK, no DBMS supports them, so you are stuck with using triggers.
It seems odd that the design does not constrain Real ID to be unique across classes; from the discussion, it seems that a given Real ID could be part of several different classes. Were the Real ID 'unique unless null', then you would be able to enforce the uniqueness more easily, if the DBMS supported the 'unique unless null' concept (most don't; I believe there is one that does, but I forget which it is).
Comments before edits made 2010-02-08
The question rules out 'jamming' the Real_ID in the upper table (Elements); it doesn't rule out including the Class in the lower table (Elements_Xref), which then allows you to create a unique index on Class and Real_ID in Elements_Xref, achieving (I believe) the required result.
It isn't clear from the sample data whether the synthetic ID in the Elements table is unique or whether it can repeat with different classes (or, indeed whether a synthetic ID can be repeated in a single class). Given that there seems to be an ID column (which presumably is unique) as well as the Synthetic ID column, it seems reasonable to suppose that sometimes the synthetic ID repeats - otherwise there are two unique columns in the table for no very good reason. For the most part, it doesn't matter - but it does affect the uniqueness constraint if the class is copied to the Elements_Xref table. One more possibility; maybe the Class is not needed in the Elements table at all; it should live only in the Elements_Xref table. We don't have enough information to tell whether this is a possibility.
Comments for changes made 2010-02-08
Now that the Elements table has the Synthetic ID as the primary key, things are somewhat easier. There's a comment that the 'Class' information actually is a 'month', but I'll try to ignore that.
In the Elements_Xref table, we have an unique ID column, and then a Synthetic ID (which is not marked as a foreign key to Elements, but presumably must actually be one), and the Real ID. We can see from the sample data that more than one Synthetic ID can map to a given Real ID. It is not clear why the Elements_Xref table has both the ID column and the Synthetic ID column.
We do not know whether a single Synthetic ID can only map to a single Real ID or whether it can map to several Real ID values.
Since the Synthetic ID is the primary key of Elements, we know that a single Synthetic ID corresponds to a single Class.
We don't know whether the mapping of Synthetic ID to Real ID varies over time (it might as Class is date-related), and whether the old state has to be remembered.
We can assume that the tables are reduced to the bare minimum and that there are other columns in each table, the contents of which are not directly material to the question.
The problem states that the Real ID is a foreign key to other data and can be NULL.
I can't see a perfectly non-redundant design that works.
I think that the Elements_Xref table should contain:
Synthetic ID
Class
Real ID
with (Synthetic ID, Class) as a 'foreign key' referencing Elements, and a NOT NULL constraint on Real ID, and a unique constraint on (Class, Real ID).
The Elements_Xref table only contains rows for which the Real ID is known - and correctly enforces the uniqueness constraint that is needed.
The weird bit is that the (Synthetic ID, Class) data in Elements_Xref must match the same columns in Elements, even though the Synthetic ID is the primary key of Elements.
In IBM Informix Dynamic Server, you can achieve this:
CREATE TABLE elements
(
class CHAR(1) NOT NULL,
synthetic_id SERIAL NOT NULL PRIMARY KEY,
UNIQUE(class, synthetic_id)
);
CREATE TABLE elements_xref
(
class CHAR(1) NOT NULL,
synthetic_id INTEGER NOT NULL REFERENCES elements(synthetic_id),
FOREIGN KEY (class, synthetic_id) REFERENCES elements(class, synthetic_id),
real_id CHAR(5) NOT NULL,
PRIMARY KEY (class, real_id)
);
I would:
Create a UNIQUE constraint on Elements(Synthetic ID, Class)
Add Class column to Elements_Xref
Add a FOREIGN KEY constraint on Elements_Xref table, referring to (Synthetic ID, Class)
At this point we know for sure that Elements_Xref.Class always matches Elements.Class.
Now we need to implement "unique when not null" logic. Follow the link and scroll to section "Use Computed Columns to Implement Complex Business Rules":
Indexes on Computed Columns: Speed Up Queries, Add Business Rules
Alternatively, you can create an indexed view on (Class, RealID) with WHERE RealID IS NOT NULL in its WHERE clause - that will also enforce "unique when not null" logic.
Create an indexed view for Elements_Xref with Where Real_Id Is Not Null and then create a unique index on that view
Create View Elements_Xref_View With SchemaBinding As
Select Elements.Class, Elements_Xref.Real_Id
From Elements_Xref
Inner Join Element On Elements.Synthetic_Id = Elements_Xref.Synthetic_Id
Where Real_Id Is Not Null
Go
Create Unique Clustered Index Elements_Xref_Unique_Index
On Elements_Xref_View (Class, Real_Id)
Go
This serves no other purpose other than simulating a unique index that treats nulls properly i.e. null != null
You can
Create a view from the the result set of joining Elements_Xref and Elements together on Synthetic ID
add a unique constraint on class, and [Real ID]. In other news, this is also how you do functional indexes in MSSQL, by indexing views.
Here is some sql:
CREATE VIEW unique_const_view AS
SELECT e.[Synthetic ID], e.Class, x.[Real ID]
FROM Elements AS e
JOIN [Elements_Xref] AS x
ON e.[Synthetic ID] = x.[Synthetic ID]
CREATE UNIQUE INDEX unique_const_view_index ON unique_const_view ( Class, [Real ID] );
Now, apparently, unbeknownst to myself this solution doesn't work in Microsoft-land-place because with MS SQL Server duplicate nulls will violate a UNIQUE constraint: this is against the SQL spec. This is where the problem is discussed about.
This is the Microsoft workaround:
create unique nonclustered index idx on dbo.DimCustomer(emailAddress)
where EmailAddress is not null;
Not sure if that is 2005, or just 2008.
I think a trigger is your best option. Constraints can't cross to other tables to get information. Same thing with a unique index (although I suppose a materialized view with an index might be possible), they are unique within the table. When you put the trigger together, remember to do it in a set-based fashion not row-by-row and test with a multi-row insert and multi-row update where the real key is repeated in the dataset.
I don't think either of your two reasons are an obstacle to putting Real ID in Elements. If a given element has 0 or 1 Real IDs (but never more than 1), it should absolutely be in the Elements table. This would then allow you to constrain uniqueness within Class (I think).
Could you expand on your two reasons not to do this?
Create a new table real_elements with fields Real ID, Class and Synthetic ID with a primary key of Class, RealId and add elements when you actually add a RealID
This constrains Real IDs to be unique for a class and gives you a way to match a class and real ID to the synthetic ID
As for Real ID being a foreign key do you mean that if it is in two classes then the data keyed off it will be the same. If so the add another table with key Real Id. This key is then a foreign key into real_elements and any other table needing real ID as foreign key