trying to determine unique identifier for database table - sql

I have a database table with many columns and there is no specified primary key. There isn't a list of super keys either. Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?
For example, a table may have 4 columns first name, last name, address and zip and the data I see is:
John, Smith, 1 main st, 00001
Mary, Smith, 1 main st, 00001
Mary, Smith, 2 sub st, 00002
In this case, I'll need first, last and zip as my unique key.
John, Smith, 1 main st, 00001
John, Smith, 1 main st, 00001
In this case, there is no unique key.
Please don't comment on my table construction and/or normalization of databases, I'm just trying to find a practical answer. Thanks.
This is my question: Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?

Looking for a subset of unique values in this case seems so specific to the particular data set. What if you arrive at a subset today and find you can't insert a new row tomorrow?
Use an artificial key, like an auto-incrementing integer.

In short: no, there's no way to do this in T-SQL really.
My advice: just add a ID INT IDENTITY PRIMARY KEY column to the table. It's guaranteed to be unique, it will be filled automagically when you create it, it's fast and easy, no messy "is this really unique or are there any combinations of rows that violate the uniqueness" questions......
Just do it - it's the easiest way to go!!

You cannot find if a combination "can" make a primary key. You can find if one WILL make a good primary key for an existing set of data.
To find if a set of fields is candidate or not, you can count the distinct of those fields (using group-by with rollup) and compare that with count (*)

There is a much faster method.
Enterprise dbms have had it for many years but MS SQL Server 2005 (useable in 2008) and later provided the HashBytes() function. Convert the columns to CHAR() (VARCHAR on MS), concatenate them; then hash them; then compare the hashes. You can compare the two tables in a single SELECT command. IIRC max 8000 characters per row.
(If you use this answer, please undo and redo your Answer choice.)

if you are comparing two databases, then you can see if any duplicate rows exist in the source db with structures like this:
select a,b,c,d
from mytable
having count(*) > 1
group by a,b,c,d
include all columns.
then use all columns as the 'row key' to see if it exists in the target system

there are update anomalies in this schema:
you cannot a person without knowing his address
better approach is to separate to three tables, one for persons and one for PersonAddress
> perons: id,firstname, lastname
> address: id,address:
> personaddress: personid, addressid

You cannot find if a combination "can" make a primary key.
I actually disagree with this, I think it is possible to write a query that will SELECT all possible permutations of columns from the table and combine each permutation into a single unique value (the simplest, crudest way is to CAST them all to VARCHAR and connect them with a spacer character - a better way would be some kind of hash function).
With a single pass you would then have set of columns like P1, P12, P123, P2, P23, P3 etc (in case of three columns). Then you can do a query with COUNT(*) vs COUNT(DISTINCT) for each permutation column and you will see which permutations are unique.
Using dynamic SQL you could probably make it so that it would work on any table, although I don't know about the column limit for SQL Server.

Related

Composite primary key: Finding one attribute using another

Data fields
I am designing a database table structure. Say that we need to record employee profiles from different companies. We have the following fields:
+---------+--------------+-----+--------+-----+
| Company | EmployeeName | Age | Gender | Tel |
+---------+--------------+-----+--------+-----+
It's possible that two employees from different company may have the same name (and assume that no 2 employee has the same name in the same company). In this case a composite primary key (Company, EmployeeName) would be necessary in my opinion.
Search
Now I need to get all information by using only one of the 2 attributes in the primary key. For example,
I want to search all employees' profile of Company A:
SELECT EmployeeName, Age, Gender, Tel FROM table WHERE Company = 'Company A'
And I can also search all employees from different company named Donald:
SELECT Company, Age, Gender, Tel FROM table WHERE EmployeeName = 'Donald'
Strategy
In order to implement this requirement, my strategy would be storing all data in a single table, which is easy to read and understandable. However I noticed that it may take a long time to search as the query may need to iterate through all rows. I would like to retrieve these information as quick as possible. Would there be a better strategy for this?
First, your rows should have a unique identifier for each row -- identity/auto-increment/serial, depending on the database. Second, you might reconsider names being unique. Why can't two people at the same company have the same name?
In any case, you have a primary key on, say, (company, name). For the opposite search you simply want another index on (name, company):
create index idx_profiles_name_company on profiles(name, company);
A note explaining Gordon's suggestion for an identity on each row. This is supplemental to his answer above.
In theory there is nothing wrong with a primary key that crosses columns and in a db like PostgreSQL I like to have identity values as secondary keys (i.e. not null unique) and specify natural primary keys. Of course on MS SQL Server or MySQL/InnoDB that would be a recipe for problems. I would also not say "all" but rather "almost all" since there are times when breaking this rule is good.
Regardless, having an identity row simplifies a couple of things and it provides an abstraction around keys in case you get things wrong. Composite keys provide a couple issues that end up eating time (and possibly resulting in downtime) later. These include:
Joins on composite keys are often more expensive than those on simple values, and
Adding or changing a natural primary key which crosses columns is far harder when joins are involved
So depending on your db you should either specify a unique secondary key or make your natural primary key separate (which you should do depends on storage and implementation specifics).

Why is this table not normalized?

I am taking a database course and I am studying table normalization.
Could anyone explain to me, why the second table in the first row on the right not normalized?
It is not normalized because
For a student who has signed for more than one course, the entries in the table will be:
23 Jake Smith CS101 B+
23 Jake Smith B102 C+
Clearly the data is being repeated(redundant data). It is leading to anomalies(insert, update, delete anomalies).
Ex:When you have to change the name of a Student say Jake Smith, you have to modify all of the rows,this is called an update anomalie.
Normalization is used to avoid these kind of anomalies and redundant data storage.
The table on the right hand side in the second row handles this situation in a better way, as it stores id, name and DOB in a separate table, the edits can be made easily using id attribute on a single row.
There are several normal forms like 1NF, 2NF, 3NF etc. Each normal form has some constraints associated with it. Each Higher form being stricter than the previous one.
I suppose it is table for students grades. It is not normalized because it contains students names directly, instead of references to students records.
It's better not to include student_name into this table, but store all students data in separate students table and reference it by student_id foreign key (something like first table in second row except the ids.).
It's not normalised because neither id nor student_name is the key (both have duplicates) so the key must be one of those (probably id) together with the course code. The other one (name) then doesn't depend on that key, but just on id.
The simple rule for 3NF is that every non-key column must depend on "the key, the whole key, and nothing but the key" - to which we all solemnly intone "so help me Codd"!
The higher normal forms deal with dependencies inside the parts of a key.
Because in your first right table you have twice values
23 - j.smith
that is repeated and do not adhere to Codd 1 normal form

How to force ID column to remain sequential even if a recored has been deleted, in SQL server?

I don't know what is the best wording of the question, but I have a table that has 2 columns: ID and NAME.
when I delete a record from the table the related ID field deleted with it and then the sequence spoils.
take this example:
if I deleted row number 2, the sequence of ID column will be: 1,3,4
How to make it: 1,2,3
ID's are meant to be unique for a reason. Consider this scenario:
**Customers**
id value
1 John
2 Jackie
**Accounts**
id customer_id balance
1 1 $500
2 2 $1000
In the case of a relational database, say you were to delete "John" from the database. Now Jackie would take on the customer_id of 1. When Jackie goes in to check here balance, she will now show $500 short.
Granted, you could go through and update all of her other records, but A) this would be a massive pain in the ass. B) It would be very easy to make mistakes, especially in a large database.
Ids (primary keys in this case) are meant to be the rock that holds your relational database together, and you should always be able to rely on that value regardless of the table.
As JohnFx pointed out, should you want a value that shows the order of the user, consider using a built in function when querying.
In SQL Server identity columns are not guaranteed to be sequential. You can use the ROW_NUMBER function to generate a sequential list of ids when you query the data from the database:
SELECT
ROW_NUMBER() OVER (ORDER BY Id) AS SequentialId,
Id As UniqueId,
Name
FROM dbo.Details
If you want sequential numbers don't store them in the database. That is just a maintenance nightmare, and I really can't think of a very good reason you'd even want to bother.
Just generate them dynamically using tSQL's RowNumber function when you query the data.
The whole point of an Identity column is creating a reliable identifier that you can count on pointing to that row in the DB. If you shift them around you undermine the main reason you WANT an ID.
In a real world example, how would you feel if the IRS wanted to change your SSN every week so they could keep the Social Security Numbers sequential after people died off?

Bending the rules of UNIQUE column SQLITE

I am working with an extensive amount of third party data. Each data set has items with unique identifiers. So it is very easy for me to utilise UNIQUE column in SQLITE to enforce some data integrity.
Out of thousands of records I have id from third party source A matching 2 unique ids from third party source B.
Is there a way of bending the rules, and allowing a duplicate entry in a unique column? If not how should I reorganise my data to take care of this single edge case.
UPDATE:
CREATE TABLE "trainer" (
"id" INTEGER PRIMARY KEY AUTOINCREMENT,
"name" TEXT NOT NULL,
"betfair_id" INTEGER NOT NULL UNIQUE,
"racingpost_id" INTEGER NOT NULL UNIQUE
);
Problem data:
Miss Beverley J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=20514
Miss B J Thomas http://www.racingpost.com/horses/trainer_home.sd?trainer_id=11096
vs. Miss Beverley J. Thomas http://form.horseracing.betfair.com/form/trainer/1/00008861
Both Racingpost entires (my primary data source) match a single Betfair entry. This is the only one (so far) out of thousands of records.
If racingpost should have had only 1 match it is an error condition.
If racingpost is allowed to have 2 matches per id, you must either have two ids, select one, or combine the data.
Since racingpost is your primary source, having 2 ids may make sense. However if you want to improve upon that data set combining that data or selecting the most useful may be more accurate. The real question is how much data overlaps between these two records and when it does can you detect it reliably. If the overlap is small or you have good detection of an overlap condition, then combining makes more sense. If the overlap is large and you cannot detect it reliably, then selecting the most recent updated or having two ids is more useful.

Is using Null to represent a Value Bad Practice?

If I use null as a representation of everything in a database table is that bad practice ?
i.e.
I have the tables: myTable(ID) and myRelatedTable(ID,myTableID)
myRelatedTable.myTableID is a FK of myTable.ID
What I want to accomplish is: if myRelatedTable.myTableID is null then my business logic will interpret that as being linked to all myTable rows.
The reason I want to do this is because I have an uknown amount of rows that could be inserted into myTable after the myRelatedTable row is created and some rows in the myRelatedTable need to reference all existing rows in myTable.
I think you might agree that it would be bad to use the number 3 to represent a value other an 3.
By the same reasoning it is therefore a bad idea to use NULL to represent anything other than the absence of a value.
If you disagree and twist NULL to some other purpose, the maintenance programmers that come after you will not be grateful.
Not a good idea, because then you cannot use the "related to all entries" fact in SQL queries at all. At some point, you'll probably want/need to do this.
Ideally there should be no nulls at all. There should be another table to represent the relation.
If you are going to assign special meanings however NULL should only ever mean "not assigned" - ie no relationship exists, use negative numbers, ie -1 if you want to trigger some business layer trickery. It should be obvious to any developers that come across this in the future that -1 is an extraordinary value that should not be treated as normal.
I don't think NULL is the best way to do it but you might use a separate tinyInt column to indicate that the row in MyRelatedTable is related to everything in MyTable, e.g. MyRelatedTable.RelatedAll. That would make it more explicit for other that have to maintain it. Then you could do some sort of Union query e.g.
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M INNER JOIN MyRelated Table R ON R.myTableId = M.Id
UNION
SELECT M.ID, R.ID AS RelatedTableID,....
FROM MyTable M, MyRelatedTable R
WHERE R.RelatedAll = 1
Yes, for the simple reason that NULL represents no value. Not a special value; not a blank value, but nothing.
If the foreign key is just a simple integer, and it's generated automatically, then you could use 0 to represent the "magic" value.
What you posted, namely that a NULL in a foreign key asserts a relationship with all the rows in the referenced table, is very non standard. Off the top of my head, I think it's fraught with dangers.
What most people who use NULLs in FKs mean by it is that it asserts a relationship to NONE of the rows in the referenced table. This is common in the case of optional relationships, ones that can occur zero times.
Example: We have an HR database, with a table called "EMPLOYEES". We have two columns, called "EmpID" and "SupervisorID". (Many people call the first column simply "ID"). Every employee in the table has an entry under SupervisorID with the sole exception of the CEO of the company. THe CEO has a NULL in the SupervisorID column, meaning that the CEO has no supervisor. The CEO is accountable to the BOD, but that isn't represented in SupervisorID.
What you might mean by a relationship with ALL the rows in the refernced table is this: There's a POSSIBLE relationship between the row in question and ANY ONE of the rows in the reference table. When you start to get into the questions of the facts that are true in the real world but unknown to the database you open a whole big can of worms.