Composite primary key: Finding one attribute using another - sql

Data fields
I am designing a database table structure. Say that we need to record employee profiles from different companies. We have the following fields:
+---------+--------------+-----+--------+-----+
| Company | EmployeeName | Age | Gender | Tel |
+---------+--------------+-----+--------+-----+
It's possible that two employees from different company may have the same name (and assume that no 2 employee has the same name in the same company). In this case a composite primary key (Company, EmployeeName) would be necessary in my opinion.
Search
Now I need to get all information by using only one of the 2 attributes in the primary key. For example,
I want to search all employees' profile of Company A:
SELECT EmployeeName, Age, Gender, Tel FROM table WHERE Company = 'Company A'
And I can also search all employees from different company named Donald:
SELECT Company, Age, Gender, Tel FROM table WHERE EmployeeName = 'Donald'
Strategy
In order to implement this requirement, my strategy would be storing all data in a single table, which is easy to read and understandable. However I noticed that it may take a long time to search as the query may need to iterate through all rows. I would like to retrieve these information as quick as possible. Would there be a better strategy for this?

First, your rows should have a unique identifier for each row -- identity/auto-increment/serial, depending on the database. Second, you might reconsider names being unique. Why can't two people at the same company have the same name?
In any case, you have a primary key on, say, (company, name). For the opposite search you simply want another index on (name, company):
create index idx_profiles_name_company on profiles(name, company);

A note explaining Gordon's suggestion for an identity on each row. This is supplemental to his answer above.
In theory there is nothing wrong with a primary key that crosses columns and in a db like PostgreSQL I like to have identity values as secondary keys (i.e. not null unique) and specify natural primary keys. Of course on MS SQL Server or MySQL/InnoDB that would be a recipe for problems. I would also not say "all" but rather "almost all" since there are times when breaking this rule is good.
Regardless, having an identity row simplifies a couple of things and it provides an abstraction around keys in case you get things wrong. Composite keys provide a couple issues that end up eating time (and possibly resulting in downtime) later. These include:
Joins on composite keys are often more expensive than those on simple values, and
Adding or changing a natural primary key which crosses columns is far harder when joins are involved
So depending on your db you should either specify a unique secondary key or make your natural primary key separate (which you should do depends on storage and implementation specifics).

Related

SQL database structure with two changing properties

Let's assume I am building the backend of a university management software.
I have a users table with the following columns:
id
name
birthday
last_english_grade
last_it_grade
profs table columns:
id
name
birthday
I'd like to have a third table with which I can determine all professors teaching a student.
So I'd like to assign multiple teachers to each student.
Those Professors may change any time.
New students may be added any time too.
What's the best way to achieve this?
The canonical way to do this would be to introduce a third junction table, which exists mainly to relate users to professors:
users_profs (
user_id,
prof_id,
PRIMARY KEY (user_id, prof_id)
)
The primary key of this junction table is the combination of a user and professor ID. Note that this table is fairly lean, and avoids the problem of repeating metadata for a given user or professor. Rather, user/professor information remains in your two original tables, and does not get repeated.

Why is this table not normalized?

I am taking a database course and I am studying table normalization.
Could anyone explain to me, why the second table in the first row on the right not normalized?
It is not normalized because
For a student who has signed for more than one course, the entries in the table will be:
23 Jake Smith CS101 B+
23 Jake Smith B102 C+
Clearly the data is being repeated(redundant data). It is leading to anomalies(insert, update, delete anomalies).
Ex:When you have to change the name of a Student say Jake Smith, you have to modify all of the rows,this is called an update anomalie.
Normalization is used to avoid these kind of anomalies and redundant data storage.
The table on the right hand side in the second row handles this situation in a better way, as it stores id, name and DOB in a separate table, the edits can be made easily using id attribute on a single row.
There are several normal forms like 1NF, 2NF, 3NF etc. Each normal form has some constraints associated with it. Each Higher form being stricter than the previous one.
I suppose it is table for students grades. It is not normalized because it contains students names directly, instead of references to students records.
It's better not to include student_name into this table, but store all students data in separate students table and reference it by student_id foreign key (something like first table in second row except the ids.).
It's not normalised because neither id nor student_name is the key (both have duplicates) so the key must be one of those (probably id) together with the course code. The other one (name) then doesn't depend on that key, but just on id.
The simple rule for 3NF is that every non-key column must depend on "the key, the whole key, and nothing but the key" - to which we all solemnly intone "so help me Codd"!
The higher normal forms deal with dependencies inside the parts of a key.
Because in your first right table you have twice values
23 - j.smith
that is repeated and do not adhere to Codd 1 normal form

PK for table that have not unique data

I have 2 tables like
Company( #id_company, ... )
addresses( address, *id_company*, *id_city* )
cities( #id_city, name_city, *id_county* )
countries( #id_country, name_country )
What i want is :
It is a good design ? ( a company can have many addresses )
And the important thing is that you my notice that i didn't add a PK for addresses table because every address of a companies will be different, so am I right ?
And i will never have a where in a select that specify a address.
First of all we should distinguish natural keys and technical keys. As to natural keys:
A country is uniquely identified by its name.
A city can be uniquely identified by its country and a unique name. For instance there are two Frankfurt in Germany. To make sure what we are talking about we either use the distinct names Frankfurt/Main and Frankfurt/Oder or use the city name with its zip codes range.
A company gets identified by its full name usually. Or use some tax id, code, whatever.
To uniquely identify a company address we would take the company plus country, city and address in the city (street name and number usually).
You've decided to use technical keys. That's okay. But you should still make sure that names are unique. You don't want France and France in your table, it must be there just once. You don't want Frankfurt and Frankfurt without any distinction in your city table for Germany either. And you don't want to have the same address twice entered for one company.
company( #id_company, name_company, ... ) plus a unique constraint on name_country or whatever makes a company unique
countries( #id_country, name_country ) plus a unique constraint on name_country
cities( #id_city, name_city, id_county ) plus a unique constraint on name_city, id_country
addresses( address, id_company, id_city ) with a unique constraint on all three columns
From what you say, it looks like you want the addresses only for lookup. You don't want to use them in any other table, not now and not in the future. Well, then you are done. As you need a unique constraint on all three columns, you could just as well declare this as your primary key, but you don't have to.
Keep in mind, that to reference a company address in any other future table, you would have to store address + id_company + id_city in that table. At that point you would certainly like to have an address id instead. But you can add that when needed. For now you can do without.
It's okay - you might want to add some non-unique index on company_id so company address queries are sped up. Another option would be making a joining table between Company and Address, but that would probably only be justified if Address stored more data(so searches would be slower).
This design is fine.
A (relational) table always has a (candidate) key. (One of which you can choose as the primary key, but candidate keys, aka keys, are what matter.) Because if no subset of columns smaller than set of all columns is unique then the key is the set of all columns.
Since every table has one, in SQL you should declare it. Eg in SQL if you want to declare a FOREIGN KEY constraint to the key of this table then you have to declare that column set a key via PRIMARY KEY, KEY or UNIQUE. Also, telling the DBMS what you know helps optimize your use of it.
What matters to determining keys are subsets of columns that are unique that don't have smaller subsets that are unique. Those are the keys.
A company, address or city is not unique since you are going to have multiple of each.
A (city,address) is not unique normally.
A (city,company) is not unique normally.
A (company,address) is not unique normally.
So (company,address,city) is the (only) (candidate) key.
Note that if there were only ever one city, then (company,address) would be the key. And if there were only ever one company, then (address,city) would be the key. So your given reason that the "because every address[+city?] of a company [?] will be different" isn't sound unless we're supposed to assume other things.
I'm making this an answer instead of a comment because of length. As to the address table having a defined primary key, the answer is yes. There are several good reasons but just consider this one.
Suppose a company had several addresses and a move required you to delete one of the addresses. You can't just delete where comp_id = x as that would delete all the addresses for that company. You have to have where comp_id = x and something_else where the something else must differentiate the one address from all the others for that company. So you have to have someone look at the different addresses to see how they differ and select the one difference that correctly identifies the one address and then write that correctly into the where clause.
That's a lot of work to do every time you want to delete (or update) an address.
It also means it's more difficult to write a parameterized delete statement that can be used to delete any address. Suppose a company has several locations in the same building: Shipping in Suite 101, Marketing in Suite 202 and IT in (of course) the basement. So the street, city, state, everything is the same, different only in Suite_No or whatever is used to refine the address.
Then consider your user. Most of the time, a user isn't going to be interested in seeing every single address you have listed for a company. He's only interested in Product Testing. You should be able to give them Product Testing's address and no other. Users are not known for their patience when presented with a data dump every time they do a query and it's up to them to select the one they're looking for.
It just solves so many problems to be able to specify where addr_id = x.
An address is a thing and should have its own table.
An address can exist without a company, therefore it should not have a foreign key to company. Also, what if you start selling to/buying from individuals?
A company can have zero, one, or many addresses.
Two or more companies can have the exact same address. You assumption is flawed.
Use a junction table:
company -< company_address >- address

Database structure, one big entity for multiple entities

Suppose that I have a store-website where user can leave comments about any product.
Suppose that I have tables(entities) in my website database: let it be 'Shoes', 'Hats' and 'Skates'.
I don't want to create separate "comments" table for every entity (like 'shoes_comments', 'hats_comments', 'skates_comments').
My idea is to somehow store all the comments in one big table.
One way to do this, that I thought of, is to create a table:
table (comments):
ID (int, Primary Key),
comment (text),
Product_id (int),
isSkates (boolean),
isShoes (boolean),
isHats (boolean)
and like flag for every entity that could have comments.
Then when I want to get comments for some product the SELECT query would look like:
SELECT comment
FROM comments, ___SOMETABLE___
WHERE ____SOMEFLAG____ = TRUE
AND ___SOMETABLE___.ID = comments.Product_id
Is this an efficient way to implement database for needed functionality?
What other ways i can do this?>
Sorry, this feels odd.
Do you indeed have one separate table for each product type? Don't they have common fields (e.g. name, description, price, product image, etc.)?
My recommendation as for tables: product for common fields, comments with foreign key to product but no hasX columns, hat with only the fields that are specific to the hat product line. The primary key in hat is either the product PK or an individual unique value (then you'd need an extra field for the foreign key to product).
I would recommend you to make one table for the comments and use a foreign key of other tables in the comments table.
The "normalized" way to do this is to add one more entity (say, "Product") that groups all characteristics common to shoes, hats and skates (including comments)
+-- 0..1 [Shoe]
|
[Product] 1 --+-- 0..1 [Hat]
1 |
| +-- 0..1 [Skate]
*
[Comment]
Besides performance considerations, the drawback here is that there is nothing in the data model preventing a row in Product to be referenced both by a row in Shoe and one in Hat.
There are other alternatives too (each with perks & flaws) - you might want to read something about "jpa inheritance strategies" - you'll find java-specific articles that discuss your same issue (just ignore the java babbling and read the rest)
Personally, I often end up using a single table for all entities in a hierarchy (shoes, hats and skates in our case) and sacrificing constraints on the altar of performance and simplicity (eg: not null in a field that is mandatory for shoes but not for hats and skates).

trying to determine unique identifier for database table

I have a database table with many columns and there is no specified primary key. There isn't a list of super keys either. Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?
For example, a table may have 4 columns first name, last name, address and zip and the data I see is:
John, Smith, 1 main st, 00001
Mary, Smith, 1 main st, 00001
Mary, Smith, 2 sub st, 00002
In this case, I'll need first, last and zip as my unique key.
John, Smith, 1 main st, 00001
John, Smith, 1 main st, 00001
In this case, there is no unique key.
Please don't comment on my table construction and/or normalization of databases, I'm just trying to find a practical answer. Thanks.
This is my question: Besides iteratively trying all candidate keys/columns, is there a way for me, using SQL, to try and figure our whether a subset of keys can make a unique identifier for my table?
Looking for a subset of unique values in this case seems so specific to the particular data set. What if you arrive at a subset today and find you can't insert a new row tomorrow?
Use an artificial key, like an auto-incrementing integer.
In short: no, there's no way to do this in T-SQL really.
My advice: just add a ID INT IDENTITY PRIMARY KEY column to the table. It's guaranteed to be unique, it will be filled automagically when you create it, it's fast and easy, no messy "is this really unique or are there any combinations of rows that violate the uniqueness" questions......
Just do it - it's the easiest way to go!!
You cannot find if a combination "can" make a primary key. You can find if one WILL make a good primary key for an existing set of data.
To find if a set of fields is candidate or not, you can count the distinct of those fields (using group-by with rollup) and compare that with count (*)
There is a much faster method.
Enterprise dbms have had it for many years but MS SQL Server 2005 (useable in 2008) and later provided the HashBytes() function. Convert the columns to CHAR() (VARCHAR on MS), concatenate them; then hash them; then compare the hashes. You can compare the two tables in a single SELECT command. IIRC max 8000 characters per row.
(If you use this answer, please undo and redo your Answer choice.)
if you are comparing two databases, then you can see if any duplicate rows exist in the source db with structures like this:
select a,b,c,d
from mytable
having count(*) > 1
group by a,b,c,d
include all columns.
then use all columns as the 'row key' to see if it exists in the target system
there are update anomalies in this schema:
you cannot a person without knowing his address
better approach is to separate to three tables, one for persons and one for PersonAddress
> perons: id,firstname, lastname
> address: id,address:
> personaddress: personid, addressid
You cannot find if a combination "can" make a primary key.
I actually disagree with this, I think it is possible to write a query that will SELECT all possible permutations of columns from the table and combine each permutation into a single unique value (the simplest, crudest way is to CAST them all to VARCHAR and connect them with a spacer character - a better way would be some kind of hash function).
With a single pass you would then have set of columns like P1, P12, P123, P2, P23, P3 etc (in case of three columns). Then you can do a query with COUNT(*) vs COUNT(DISTINCT) for each permutation column and you will see which permutations are unique.
Using dynamic SQL you could probably make it so that it would work on any table, although I don't know about the column limit for SQL Server.