One lookup table or many lookup tables? [closed] - sql

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I need to save basic member's data with additional attributes such as gender, education, profession, marital_status, height, residency_status etc.
I have around 15-18 lookup tables all having (id, name, value), all attributes have string values.
Shall I create member's table tbl_members and separate 15-18 lookup tables for each of the above attributes:
tbl_members:
mem_Id
mem_email
mem_password
Gender_Id
education_Id
profession_id
marital_status_Id
height_Id
residency_status_Id
or shall I create only one lookup table tbl_Attributes and tbl_Attribute_Types?
tbl_Attributes:
att_Id
att_Value
att_Type_Id
Example data:
001 - Male - 001
002 - Female - 001
003 - Graduate - 002
004 - Masters - 002
005 - Engineer - 003
006 - Designer - 003
tbl_Attribute_Types:
att_type_Id
att_type_name
Example data:
001 - Gender
002 - Education
003 - Profession
To fill look-up drop-downs I can select something like:
SELECT A.att_id, A.att_value, AT.att_type_name
FROM tbl_Attributes A
INNER JOIN tbl_Attribute_Types AT ON AT.att_type_Id = A.att_type_Id
WHERE att_Type_Id = #att_Type_Id
and an additional table tbl_mem_att_value to save member's attributes and values
tbl_mem_att_value:
mem_id
att_id
Example data for member_id 001, is Male, Masters, Engineer
001 - 001
001 - 004
001 - 005
So my question is shall I go for one lookup table or many lookup tables?
Thanks

Never use one lookup table for everything. It will make it more difficult to find things, and it will need to be joined in every query probably multiple times which will mean that it may cause locking and blocking problems. Further in one table you can't use good design to make sure the data type for the descriptor is correct. For instance suppose you wanted a lookup of the state abbreviations which are two characters. If you use a onesize fits all table, then it has to be wide enough for teh largest possible value of any lookup and you lose the possibility of it rejecting an incorrect entry because it is too long. This is a guarantee of later data integrity issues.
Further you can't properly use foreign keys to make sure data entry is limited only to the correct values. This will also cause data integrity issues.
There is NO BENEFIT whatsoever to using one table except a few minutes of dev time (possibly the least important concern in designing a database). There are plenty of negatives.

The primary reason for using multiple lookup tables is that you can then enforce foreign key constraints. This is quite important for maintaining relational integrity.
The primary reason for using a single lookup table is so you have all the string values in one place. This can be very useful for internationalization of the software.
In general, I would go with separate reference tables, because relational integrity is generally a more important concern than internationalization.
There are secondary considerations. Many different reference tables are going to occupy more space than a single reference table -- with most of the pages being empty (how much space do you really need to store the gender lookup information?). However, with a relatively small number of reference tables, this is actually a pretty minor concern.
Another consideration in using a single table is that all the reference keys will have different values. This is useful because it can prevent unlikely joins. However, I prevent this problem by naming join keys the same, both for the primary key and the foreign key. So, GenderId would be the primary key in Gender as well as the foreign key column.

I've struggled with the same question myself. If the only thing in the lookup table is some sort of code or id and a text value, then it certainly works to just add "attribute id" and throw it all in one table. The obvious advantage is that you then have only one table to create and manage. Searches might possibly be slower because there are more records to search, but presumably you create an index on attribute id + value id. At that point, whether performance is better having one big table or ten small tables probably depends on all sorts of details about how the database engine works and the pattern of access. That's a case where I'd say, Unless in practice it proves to be a problem, don't worry about it.
Two caveats:
One: If you do create a single table, I'd create a code for the attribute name, and then another table to list the codes. Like:
lookup_attribute(attribute_id, attribute_name)
lookup_value(attribute_id, value_id, value_text)
Then the first table has records like
1, 'Gender'
2, 'Marital Status'
3, 'Education'
etc
And the second is
1, 1, 'Male'
1, 2, 'Female'
1, 3, 'Undecided'
2, 1, 'Single'
2, 2, 'Married'
2, 3, 'Divorced'
2, 4, 'Widowed'
3, 1, 'High School'
3, 2, 'Associates'
3, 3, 'Bachelors'
3, 4, 'Masters'
3, 5, 'Doctorate'
3, 6, 'Other'
etc.
(The value id could be unique for all attribute ids or it might only be unique within the attribute id, whatever works for you. It shouldn't matter.)
Two: If there is other data you need to store for some attribute besides just the text of a value, then break that out into a separate table. Like if you had an attribute for, say, "Membership Level", and then the user says that there are different dues for each level and you need to record this, then you have an extra field that applies only to this one attribute. At that point it should become its own table. I've seen systems where they have a couple of pieces of extra data for each of several attributes, and they create a field called "extra data" or some such, and for "membership level" it holds annual dues and for "store name" it holds the city where the store is and for "item number" it holds the number of units on hand of that item, etc, and the system quickly becomes a nightmare to manage.
Update
To retrieve values, let's suppose we have just gender and marital status as lookups. The principle is the same for any others.
So we have the monster lookup table as described above. Then we have the member table with, say
member (member_id, name, member_number, whatever, gender_id, marital_status_id)
To retrieve you just write
select m.member_id, m.name, m.member_number, m.whatever,
g.value_text as gender, ms.value_text as marital_status
from member m
join lookup_value g on g.attribute_id=1 and g.attribute_value=m.gender_id
join lookup_value ms on ms.attribute_id=2 and ms.attribute_value=m.marital_status_id
where m.member_id=#member_id
You could, alternatively, have:
member (member_id, name, member_number, whatever)
member_attributes (member_id, attribute_id, value_id)
Then you can get all the attributes w
select a.attribute_name, v.value_text
from member_attribute ma
join lookup_attribute a on a.attribute_id=ma.attribute_id
join lookup_value v on v.attribute_id=a.attribute_id and v.value_id=ma.value_id
where ma.member_id=#member_id
It occurs to me as I try to write the queries that there's a distinct advantage to making the value id globally unique: Not only does that eliminate having to specify the attribute id in the join, but it also means that if you do have a field for, say, gender_id, you can still have a foreign key clause on it.

Putting all the lookup values into a single table is usually referred to as Common Lookup Tables, or Massively Unified Code-Key (MUCK), and is generally considered a design error.
Great argumentation of why it's not a good idea can be found in the article below.
https://www.red-gate.com/simple-talk/sql/database-administration/five-simple-database-design-errors-you-should-avoid/

Related

Composite primary key: Finding one attribute using another

Data fields
I am designing a database table structure. Say that we need to record employee profiles from different companies. We have the following fields:
+---------+--------------+-----+--------+-----+
| Company | EmployeeName | Age | Gender | Tel |
+---------+--------------+-----+--------+-----+
It's possible that two employees from different company may have the same name (and assume that no 2 employee has the same name in the same company). In this case a composite primary key (Company, EmployeeName) would be necessary in my opinion.
Search
Now I need to get all information by using only one of the 2 attributes in the primary key. For example,
I want to search all employees' profile of Company A:
SELECT EmployeeName, Age, Gender, Tel FROM table WHERE Company = 'Company A'
And I can also search all employees from different company named Donald:
SELECT Company, Age, Gender, Tel FROM table WHERE EmployeeName = 'Donald'
Strategy
In order to implement this requirement, my strategy would be storing all data in a single table, which is easy to read and understandable. However I noticed that it may take a long time to search as the query may need to iterate through all rows. I would like to retrieve these information as quick as possible. Would there be a better strategy for this?
First, your rows should have a unique identifier for each row -- identity/auto-increment/serial, depending on the database. Second, you might reconsider names being unique. Why can't two people at the same company have the same name?
In any case, you have a primary key on, say, (company, name). For the opposite search you simply want another index on (name, company):
create index idx_profiles_name_company on profiles(name, company);
A note explaining Gordon's suggestion for an identity on each row. This is supplemental to his answer above.
In theory there is nothing wrong with a primary key that crosses columns and in a db like PostgreSQL I like to have identity values as secondary keys (i.e. not null unique) and specify natural primary keys. Of course on MS SQL Server or MySQL/InnoDB that would be a recipe for problems. I would also not say "all" but rather "almost all" since there are times when breaking this rule is good.
Regardless, having an identity row simplifies a couple of things and it provides an abstraction around keys in case you get things wrong. Composite keys provide a couple issues that end up eating time (and possibly resulting in downtime) later. These include:
Joins on composite keys are often more expensive than those on simple values, and
Adding or changing a natural primary key which crosses columns is far harder when joins are involved
So depending on your db you should either specify a unique secondary key or make your natural primary key separate (which you should do depends on storage and implementation specifics).

Why is this table not normalized?

I am taking a database course and I am studying table normalization.
Could anyone explain to me, why the second table in the first row on the right not normalized?
It is not normalized because
For a student who has signed for more than one course, the entries in the table will be:
23 Jake Smith CS101 B+
23 Jake Smith B102 C+
Clearly the data is being repeated(redundant data). It is leading to anomalies(insert, update, delete anomalies).
Ex:When you have to change the name of a Student say Jake Smith, you have to modify all of the rows,this is called an update anomalie.
Normalization is used to avoid these kind of anomalies and redundant data storage.
The table on the right hand side in the second row handles this situation in a better way, as it stores id, name and DOB in a separate table, the edits can be made easily using id attribute on a single row.
There are several normal forms like 1NF, 2NF, 3NF etc. Each normal form has some constraints associated with it. Each Higher form being stricter than the previous one.
I suppose it is table for students grades. It is not normalized because it contains students names directly, instead of references to students records.
It's better not to include student_name into this table, but store all students data in separate students table and reference it by student_id foreign key (something like first table in second row except the ids.).
It's not normalised because neither id nor student_name is the key (both have duplicates) so the key must be one of those (probably id) together with the course code. The other one (name) then doesn't depend on that key, but just on id.
The simple rule for 3NF is that every non-key column must depend on "the key, the whole key, and nothing but the key" - to which we all solemnly intone "so help me Codd"!
The higher normal forms deal with dependencies inside the parts of a key.
Because in your first right table you have twice values
23 - j.smith
that is repeated and do not adhere to Codd 1 normal form

Is it better to have int joins instead of string columns?

Let's say I have a User which has a status and the user's status can be 'active', 'suspended' or 'inactive'.
Now, when creating the database, I was wondering... would it be better to have a column with the string value (with an enum type, or rule applied) so it's easier to both query and know the current user status or are joins better and I should join in a UserStatuses table which contains the possible user statuses?
Assuming, of course statuses can not be created by the application user.
Edit: Some clarification
I would NOT use string joins, it would be a int join to UserStatuses PK
My primary concern is performance wise
The possible status ARE STATIC and will NEVER change
On most systems it makes little or no difference to performance. Personally I'd use a short string for clarity and join that to a table with more detail as you suggest.
create table intLookup
(
pk integer primary key,
value varchar(20) not null
)
insert into intLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table stringLookup
(
pk varchar(4) primary key,
value varchar(20) not null
)
insert into stringLookup (pk, value) values
(1,'value 1'),
(2,'value 2'),
(3,'value 3'),
(4,'value 4')
create table masterData
(
stuff varchar(50),
fkInt integer references intLookup(pk),
fkString varchar(4)references stringLookup(pk)
)
create index i on masterData(fkInt)
create index s on masterData(fkString)
insert into masterData
(stuff, fkInt, fkString)
select COLUMN_NAME, (ORDINAL_POSITION %4)+1,(ORDINAL_POSITION %4)+1 from INFORMATION_SCHEMA.COLUMNS
go 1000
This results in 300K rows.
select
*
from masterData m inner join intLookup i on m.fkInt=i.pk
select
*
from masterData m inner join stringLookup s on m.fkString=s.pk
On my system (SQL Server)
- the query plans, I/O and CPU are identical
- execution times are identical.
- The lookup table is read and processed once (in either query)
There is NO difference using an int or a string.
I think, as a whole, everyone has hit on important components of the answer to your question. However, they all have good points which should be taken together, rather than separately.
As logixologist mentioned, a healthy amount of Normalization is generally considered to increase performance. However, in contrast to logixologist, I think your situation is the perfect time for normalization. Your problem seems to be one of normalization. In this case, using a numeric key as Santhosh suggested which then leads back to a code table containing the decodes for the statuses will result in less data being stored per record. This difference wouldn't show in a small Access database, but it would likely show in a table with millions of records, each with a status.
As David Aldridge suggested, you might find that normalizing this particular data point will result in a more controlled end-user experience. Normalizing the status field will also allow you to edit the status flag at a later date in one location and have that change perpetuated throughout the database. If your boss is like mine, then you might have to change the Status of Inactive to Closed (and then back again next week!), which would be more work if the status field was not normalized. By normalizing, it's also easier to enforce referential integrity. If a status key is not in the Status code table, then it can't be added to your main table.
If you're concerned about the performance when querying in the future, then there are some different things to consider. To pull back status, if it's normalized, you'll be adding a join to your query. That join will probably not hurt you in any sized recordset but I believe it will help in larger recordsets by limiting the amount of raw text that must be handled. If your primary concern is performance when querying the data, here's a great resource on how to optimize queries: http://www.sql-server-performance.com/2007/t-sql-where/ and I think you'll find that a lot of the rules discussed here will also apply to any inclusion criteria you enforce in the join itself.
Hope this helps!
Christopher
The whole idea behind normalization is to keep the data from repeating (well at least one of the concepts).
In this case there is only 1 status a user at one time (I assume) can have so their is no reason to put it in its own table. You would simply complicate things. The only reason you would have a seperate table is if for some reason these statuses were not static. Meaning next month you may add "Sort of Active" and "Maybe Inactive". This would mean changing code to make up for that if you didnt put them in their own table. You could create a maintenace page where users could add statuses and then that would require you to create a seperate table.
An issue to consider is whether these status values have attributes of their own.
For example, perhaps you would want to have a default sort order that is different from the alphabetical order of the status text. You might also want to treat two of the statuses in a particular way that you do not treat the other, and that could be an attribute.
If you have a need for that, or suspect a future need for that, then move the status text to a different table and use an integer key value for them.
I would suggest using Integer values like 0, 1, 2. If this is fixed. When interpreting the results in Reports we can change these status back to strings.

How to create history fact table?

I have some entities in my Data Warehouse:
Person - with attributes personId, dateFrom, dateTo, and others those can be changed, e.g. last name, birth date and so on - slowly changing dimension
Document - documentId, number, type
Address - addressId, city, street, house, flat
The relations between (Person and Document) is One-To-Many and (Person and Address) is Many-To-Many.
My target is to create history fact table that can answer us following questions:
What persons with what documents lived at defined address on defined date?
2, What history of residents does defined address have on defined interval of time?
This is not only for what DW is designed, but I think it is the hardest thing in DW's design.
For example, Miss Brown with personId=1, documents with documentId=1 and documentId=2 had been lived at address with addressId=1 since 01/01/2005 to 02/02/2010 and then moved to addressId=2 where has been lived since 02/03/2010 to current date (NULL?). But she had changed last name to Mrs Green since 04/05/2006 and her first document with documentId=1 to documentId=3 since 06/07/2007. Mr Black with personId=2, documentId=4 has been lived at addressId=1 since 02/03/2010 to current date.
The expected result on our query for question 2 where addressId=1, and time interval is since 01/01/2000 to now, must be like:
Rows:
last_name="Brown", documentId=1, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Brown", documentId=2, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Green", documentId=1, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Green", documentId=3, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Black", documentId=4, dateFrom=02/03/2010, dateTo=NULL
I had an idea to create fact table with composite key (personId, documentId, addressId, dateFrom) but I have no idea how to load this table and then get that expected result with this structure.
I will be pleased for any help!
Interesting question #Argnist!
So to create some common language for my example, you want a
DimPerson (PK=kcPerson, suggorate key for unique Persons=kPerson, type 2 dim)
DimDocument (PK=kcDocument, suggorate key for unique Documents=kDocument, type 2 dim)
DimAddress (PK=kcAddress, suggorate key for unique Addresses=kAddress, type 2 dim)
A colleague has written a short blog on the usage of two surrogate keys to explain the above dims 'Using Two Surrogate Keys on Dimensions'.
I would always add
DimDate with PK in the form yyyymmdd
to any data warehouse with extra attribute columns.
Then you would have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDate)
plus any aditional measures.
Then joining on the "kc"s you can show the current Person/Document/Address dimension information.
If you joined on the "k"s you can show the historic Person/Document/Address dimension information.
The downside of this is that this fact table needs one row for each person/document/address/date combination. But it really is a very narrow table, since the table just has a number of foreign keys.
The advantage of this is it is very easy to query for the sorts of questions you were asking.
Alternatively, you could have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDateFrom, kDateTo)
plus any aditional measures.
This is obviously much more compact, but the querying becomes more complex. You could also put a view over the Fact table to make it easier to query!
The choice of solution depends on the frequency of change of the data. I suspect that it will not be changing that quickly, so teh alternate design of the fact table may be better.
Hope that helps.

Basic question: how to properly redesign this schema

I am hopping on a project that sits on top of a Sql Server 2008 DB with what seems like an inefficient schema to me. However, I'm not an expert at anything SQL, so I am seeking for guidance.
In general, the schema has tables like this:
ID | A | B
ID is a unique identifier
A contains text, such as animal names. There's very little variety; maybe 3-4 different values in thousands of rows. This could vary with time, but still a small set.
B is one of two options, but stored as text. The set is finite.
My questions are as follows:
Should I create another table for names contained in A, with an ID and a value, and set the ID as the primary key? Or should I just put an index on that column in my table? Right now, to get a list of A's, it does "select distinct(a) from table" which seems inefficient to me.
The table has a multitude of columns for properties of A. It could be like: Color, Age, Weight, etc. I would think that this is better suited in a separate table with: ID, AnimalID, Property, Value. Each property is unique to the animal, so I'm not sure how this schema could enforce this (the current schema implies this as it's a column, so you can only have one value for each property).
Right now the DB is easily readable by a human, but its size is growing fast and I feel like the design is inefficient. There currently is not index at all anywhere. As I said I'm not a pro, but will read more on the subject. The goal is to have a fast system. Thanks for your advice!
This sounds like a database that might represent a veterinary clinic.
If the table you describe represents the various patients (animals) that come to the clinic, then having properties specific to them are probably best on the primary table. But, as you say column "A" contains a species name, it might be worthwhile to link that to a secondary table to save on the redundancy of storing those names:
For example:
Patients
--------
ID Name SpeciesID Color DOB Weight
1 Spot 1 Black/White 2008-01-01 20
Species
-------
ID Species
1 Cocker Spaniel
If your main table should be instead grouped by customer or owner, then you may want to add an Animals table and link it:
Customers
---------
ID Name
1 John Q. Sample
Animals
-------
ID CustomerID SpeciesID Name Color DOB Weight
1 1 1 Spot Black/White 2008-01-01 20
...
As for your original column B, consider converting it to a boolean (BIT) if you only need to store two states. Barring that, consider CHAR to store a fixed number of characters.
Like most things, it depends.
By having the animal names directly in the table, it makes your reporting queries more efficient by removing the need for many joins.
Going with something like 3rd normal form (having an ID/Name table for the animals) makes you database smaller, but requires more joins for reporting.
Either way, make sure to add some indexes.