Preserve old record values in relational DB - sql

What is the best approach for the following situation?
Assume, that we have the following tables:
Companies:
COMPANY_ID
COMPANY_NAME
People:
PERSON_ID
PERSON_NAME
FK_COMPANY_ID
Meetings:
MEETING_ID
MEETING_NAME
MEETING_DATE
FK1_PERSON_ID
FK2_PERSON_ID
We have the following data:
/* Companies */
1, Company #1
2, Company #2
3, Company #3
/* People */
1, Vasya Pupkin, 1
2, Petya Vasechkin, 2
/* Meetings */
1, Meeting #1, 2014-02-01, 1, 2
Now, imagine that the person from Company #1 changes his company to Company #3. This will be of course reflected in the meetings view due to relation through person. And therefore, in the view form of the meeting it will show incorrect data in terms of the past situation (when a person was working in another company).
So, my question is how to preserve the view data for the meeting that has already past?
Currently I see the following solution. Extend the Meetings table as follows:
MEETING_ID
MEETING_NAME
MEETING_DATE
FK1_PERSON_ID
FK2_PERSON_ID
FK1_COMPANY_ID
COMPANY_NAME_1
FK2_COMPANY_ID
COMPANY_NAME_2
So that for historical purposes it will be possible to select the company_name_1 and company_name_2 rather than check actual person's company. And however, this would work even faster than relation, at the same time this moves me away from relational DB, and as this has to be used not just in one place, but in many places of a big project, it turns out that I will have a lot of data duplication.
Therefore, are there any better solutions for this problem?
UPDATE:
I see one possible solution in terms of relational DB. We need to introduce a new table:
Employments:
EMPLOYMENT_ID
FK_PERSON_ID
FK_COMPANY_ID
EMPLOYMENT_DATE
Assuming that a person can work only in a one company at the same time, we can know for sure, which company he worked in during the meeting. However, a small issue still exists: the company name (if renamed) will be the last one used. This can be fixed by the similar approach (if needed):
Company_Names:
FK_COMPANY_ID
COMPANY_NAME
ACTUAL_FROM_DATE
Seems that it adds some complexity but flexibility at the same time.

This is a standard problem that one encounters in real world databases. Your solution looks good. Keep in mind now, that to find out where some works you will need to do something like:
select * from employment where person_id=123 and
employment_date=(select max(employment_date)
from employment where person_id = 123)
That might be OK, but there are other solutions. For example, you can add a last_known column to employment and then your query for finding where a person works can now be:
select * from employment where person_id=123 and
last_known=1
There is a question https://stackoverflow.com/a/19144370/4350148 that may also be helpful.

Related

Matching Entities - Self many to many

Working with a new master data management product, specifically people matching.
I have two tables: Person and PersonMatch. PersonMatch is a join table that matches rows from Person to another row in Person.
Person: 1,2,3,4 (Per the PersonMatch, these are all the same Person).
PersonMatch: 1+2, 2+3, 3+4, 4+1
I can't wrap my head around a query to treat all four entities from the Person table as the same. Thanks for any help!
It sounds to me like you need a specific column that is a PersonUniqueId, and all the Person instances that are the same could have there own unique id to identify their differences, but they would each share the same 'PersonUniqueId', which reflects that they are really the same person, just with different or additional subsequent info.
If you do not go with a PersonUniqueId then you would need to figure out what in each record identifies the two persons as being the same, perhaps the name, or whatever else. Without knowing more about your overall structure It is hard to say what is the best direction, but hopefully this is a start.
It might be that the PersonMatch table is not even needed. Alternately, you could put the PersonUniqueId column in that table and then do something like:
SELECT p.*
FROM Persons p, PersonMatch pm
WHERE p.PersonId = pm.PersonId
GROUP BY pm.PersonUniqueId
Hope this helps.

One lookup table or many lookup tables? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 years ago.
Improve this question
I need to save basic member's data with additional attributes such as gender, education, profession, marital_status, height, residency_status etc.
I have around 15-18 lookup tables all having (id, name, value), all attributes have string values.
Shall I create member's table tbl_members and separate 15-18 lookup tables for each of the above attributes:
tbl_members:
mem_Id
mem_email
mem_password
Gender_Id
education_Id
profession_id
marital_status_Id
height_Id
residency_status_Id
or shall I create only one lookup table tbl_Attributes and tbl_Attribute_Types?
tbl_Attributes:
att_Id
att_Value
att_Type_Id
Example data:
001 - Male - 001
002 - Female - 001
003 - Graduate - 002
004 - Masters - 002
005 - Engineer - 003
006 - Designer - 003
tbl_Attribute_Types:
att_type_Id
att_type_name
Example data:
001 - Gender
002 - Education
003 - Profession
To fill look-up drop-downs I can select something like:
SELECT A.att_id, A.att_value, AT.att_type_name
FROM tbl_Attributes A
INNER JOIN tbl_Attribute_Types AT ON AT.att_type_Id = A.att_type_Id
WHERE att_Type_Id = #att_Type_Id
and an additional table tbl_mem_att_value to save member's attributes and values
tbl_mem_att_value:
mem_id
att_id
Example data for member_id 001, is Male, Masters, Engineer
001 - 001
001 - 004
001 - 005
So my question is shall I go for one lookup table or many lookup tables?
Thanks
Never use one lookup table for everything. It will make it more difficult to find things, and it will need to be joined in every query probably multiple times which will mean that it may cause locking and blocking problems. Further in one table you can't use good design to make sure the data type for the descriptor is correct. For instance suppose you wanted a lookup of the state abbreviations which are two characters. If you use a onesize fits all table, then it has to be wide enough for teh largest possible value of any lookup and you lose the possibility of it rejecting an incorrect entry because it is too long. This is a guarantee of later data integrity issues.
Further you can't properly use foreign keys to make sure data entry is limited only to the correct values. This will also cause data integrity issues.
There is NO BENEFIT whatsoever to using one table except a few minutes of dev time (possibly the least important concern in designing a database). There are plenty of negatives.
The primary reason for using multiple lookup tables is that you can then enforce foreign key constraints. This is quite important for maintaining relational integrity.
The primary reason for using a single lookup table is so you have all the string values in one place. This can be very useful for internationalization of the software.
In general, I would go with separate reference tables, because relational integrity is generally a more important concern than internationalization.
There are secondary considerations. Many different reference tables are going to occupy more space than a single reference table -- with most of the pages being empty (how much space do you really need to store the gender lookup information?). However, with a relatively small number of reference tables, this is actually a pretty minor concern.
Another consideration in using a single table is that all the reference keys will have different values. This is useful because it can prevent unlikely joins. However, I prevent this problem by naming join keys the same, both for the primary key and the foreign key. So, GenderId would be the primary key in Gender as well as the foreign key column.
I've struggled with the same question myself. If the only thing in the lookup table is some sort of code or id and a text value, then it certainly works to just add "attribute id" and throw it all in one table. The obvious advantage is that you then have only one table to create and manage. Searches might possibly be slower because there are more records to search, but presumably you create an index on attribute id + value id. At that point, whether performance is better having one big table or ten small tables probably depends on all sorts of details about how the database engine works and the pattern of access. That's a case where I'd say, Unless in practice it proves to be a problem, don't worry about it.
Two caveats:
One: If you do create a single table, I'd create a code for the attribute name, and then another table to list the codes. Like:
lookup_attribute(attribute_id, attribute_name)
lookup_value(attribute_id, value_id, value_text)
Then the first table has records like
1, 'Gender'
2, 'Marital Status'
3, 'Education'
etc
And the second is
1, 1, 'Male'
1, 2, 'Female'
1, 3, 'Undecided'
2, 1, 'Single'
2, 2, 'Married'
2, 3, 'Divorced'
2, 4, 'Widowed'
3, 1, 'High School'
3, 2, 'Associates'
3, 3, 'Bachelors'
3, 4, 'Masters'
3, 5, 'Doctorate'
3, 6, 'Other'
etc.
(The value id could be unique for all attribute ids or it might only be unique within the attribute id, whatever works for you. It shouldn't matter.)
Two: If there is other data you need to store for some attribute besides just the text of a value, then break that out into a separate table. Like if you had an attribute for, say, "Membership Level", and then the user says that there are different dues for each level and you need to record this, then you have an extra field that applies only to this one attribute. At that point it should become its own table. I've seen systems where they have a couple of pieces of extra data for each of several attributes, and they create a field called "extra data" or some such, and for "membership level" it holds annual dues and for "store name" it holds the city where the store is and for "item number" it holds the number of units on hand of that item, etc, and the system quickly becomes a nightmare to manage.
Update
To retrieve values, let's suppose we have just gender and marital status as lookups. The principle is the same for any others.
So we have the monster lookup table as described above. Then we have the member table with, say
member (member_id, name, member_number, whatever, gender_id, marital_status_id)
To retrieve you just write
select m.member_id, m.name, m.member_number, m.whatever,
g.value_text as gender, ms.value_text as marital_status
from member m
join lookup_value g on g.attribute_id=1 and g.attribute_value=m.gender_id
join lookup_value ms on ms.attribute_id=2 and ms.attribute_value=m.marital_status_id
where m.member_id=#member_id
You could, alternatively, have:
member (member_id, name, member_number, whatever)
member_attributes (member_id, attribute_id, value_id)
Then you can get all the attributes w
select a.attribute_name, v.value_text
from member_attribute ma
join lookup_attribute a on a.attribute_id=ma.attribute_id
join lookup_value v on v.attribute_id=a.attribute_id and v.value_id=ma.value_id
where ma.member_id=#member_id
It occurs to me as I try to write the queries that there's a distinct advantage to making the value id globally unique: Not only does that eliminate having to specify the attribute id in the join, but it also means that if you do have a field for, say, gender_id, you can still have a foreign key clause on it.
Putting all the lookup values into a single table is usually referred to as Common Lookup Tables, or Massively Unified Code-Key (MUCK), and is generally considered a design error.
Great argumentation of why it's not a good idea can be found in the article below.
https://www.red-gate.com/simple-talk/sql/database-administration/five-simple-database-design-errors-you-should-avoid/

Table with set of 3 columns maximum which have same meaning

I have table Person which have maximum 3 opinion and for each rows of Person we have different opinion, in other word you never find 2 person that have the same opinion, there is no many-to-many relationship between person and opinion.
I will never check opinion for validation like no 2 person have same opinion, it's just for information.
the question it is :
should i make just one table
Person ( #id_person , ... , opinion1 , opinion2 , opinion3 , ... )
or add a new table :
Person ( #id_person , ... )
opinions ( #id_opinion , opinion , *id_person* ) // id_person FK
For me i don't want to create a new table opinions because it will have no meaning i will always add a new rows as much i have a new Person
also if i group them in one table and i have just one opinion there will be problem of waste of space ? even if i declare opinion as varchar ?
And if i create a new table opinions it will need a primary pseudo-key id_opinion ?
a opinion can be a varchar(50).
I would recommend two tables. The first table is the persons table. This would have a PersonId and all sorts of other information about the person.
The second table would be the PersonOpinions table. This would have one row per opinion, with information such as:
PersonId
Opinion
Date and time of the opinion
Topic of the opinion (if appropriate)
Method for inputing the opinion (if appropriate)
From what you say, there is no need for a separate opinions table, because the opinion is basically "unique". However, you probably do want to store the opinions themselves in a separate table, which a separate row per opinion.
You can use a trigger to enforce the constraint that a person only has three opinions. If you decide to change this in the future, then it will be easy with a two-table solution.
I would suggest to go for 3 different tables(2 base and 1 intersection). Since Person and Opinions are 2 different entities and they might share a relationship of M:M so hence the need of intersection table. I believe it will simply things instead of clutering the info in 1 single table.
You should have two tables. (Whether you want opinion ids is a separate issue. You wouldn't need to.)
// person [id_person] has name [name] and ...
person(id_person,name,...)
// person [id_person] has opinion [opinion]
opinion(id_person,opinion)
fk (id_person) references person
Your design is bad partly because: it is not clear what the answers to the following questions are, yet the answers must be known for someone to use the database. When you have given the answers (ie the rest of a full design of your table) they will illustrate other problems.
Suppose:
person p1 has opinion o11.
person p1 has opinion o12.
person p1 has no other opinions.
person p3 has opinion o21
person p3 has no other opinions.
A. What row(s) go in the table?
B. What are queries (without knowing the table value) for rows where
person p has opinion [opinion]
person [person] does not have opinion o
C. Every table needs a statement parameterized by its columns where the rows in the table are the rows that make the statement true.
What is the statement for your table?
Why isn't it
person [id_person] is named [name] and ...
AND person [id_person] holds opinion [opinion1]
AND person [id_person] holds opinion [opinion2]
AND person [id_person] holds opinion [opinion3]
...
You should experience the further badness in answering the questions. Besides being unclear, the desription so far is also bad because the answers are ugly and your table is difficult to use. So much so that I don't even want to type examples of what I know they will be like. So I will leave the questions rhetorical rather than with examples until you give more info.

Two way relationships in SQL queries

I have a small database that is used to track parts. for the sake of this example the table looks like this:
PartID (PK), int
PartNumber, Varchar(50), Unique
Description, Varchar(255)
I have a requirement to define that certain parts are classified as similar to each other.
To do this I have setup a second table that looks like this:
PartID, (PK), int
SecondPartID, (PK), int
ReasonForSimilarity, Varchar(255)
Then a many-to-many relationship has been setup between the two tables.
The problem comes when I need to report on the parts that are considered similar because the relationship is two way I.E. if part XYZ123 is similar to ABC678 then ABC678 is considered to be similar to XYZ123. So if I wanted to list all parts that are similar to a given part I either need to ensure the relationship is setup in both directions (which is bad because data is duplicated) or need to have 2 queries that look at the table in both directions. Neither of these solutions feels right to me.
So, how should this problem be approached? Can this be solved with SQL alone or does my design need to change to accommodate the business requirement?
Consider the following parts XYZ123, ABC123, ABC234, ABC345, ABC456 & EFG456 which have been entered into the existing structure entered above. You could end up with data that looks like this (omitting the reason field which is irrelevant at this point):
PartID, SecondPartID
XYZ123, ABC123
XYZ123, ABC234
XYZ123, ABC345
XYZ123, ABC456
EFG456, XYZ123
My user wants to know "Which parts are similar to XYZ123". This could be done using a query like so:
SELECT SecondPartID
FROM tblRelatedParts
WHERE PartID = 'XYZ123'
The problem with this though is it will not pick out part EFG456 which is related to XYZ123 despite the fact that the parts have been entered the other way round. It is feasible that this could happen depending on which part the user is currently working with and the relationship between the parts will always be two-way.
The problem I have with this though is that I now need to check that when a user sets up a relationship between two parts it does not already exist in the other direction.
#Goran
I have done some initial tests using your suggestion and this is how I plan to approach the problem using your suggestion.
The data listed above is entered into the new table (Note that I have changed the partID to part number to make the example clearer; the semantics of my problem haven't changed though)
The table would look like this:
RelationshipID, PartNumber
1, XYZ123
1, ABC123
2, XYZ123
2, ABC234
3, XYZ123
3, ABC345
4, XYZ123
4, ABC456
5, EFG456
5, XYZ123
I can then retrieve a list of similar parts using a query like this:
SELECT PartNumber
FROM tblPartRelationships
WHERE RelationshipID ANY (SELECT RelationshipID
FROM tblPartRelationships
WHERE PartNumber = 'XYZ123')
I'll carry out some more tests and if this works I'll feedback and accept the answer.
I've dealt with this issue by setting up a relationship table.
Part table:
PartID (PK), int
PartNumber, Varchar(50), Unique
Description, Varchar(255)
PartRelationship table:
RelationshipId (FK), int
PartID (FK), int
Relationship table:
RelationshipId (PK), int
Now similar parts simply get added to Relationship table:
RelationshipId, PartId
1,1
1,2
Whenever you add another part with relationshipId = 1 it is considered similar to any part with relationshipId = 1.
Possible API solutions for adding relationships:
Create new relationship for each list of similar parts. Let client load, change and update the entire list whenever needed.
Retrieve relationship(s) for a similar object. Filter the list by some criteria so that only one remains or let client choose from existing relationships. Create, remove PartRelationship record as needed.
Retrieve list of relationships from Relationship table. Let client specify parts and relationships. Create, remove PartRelationship records as needed.
Add a CHECK constraint e.g.
CHECK (PartID < SecondPartID);
I know this is old but why not just do this query with your original schema? Less tables and rows.
SELECT SecondPartID
FROM tblRelatedParts
WHERE PartID = 'XYZ123'
UNION
SELECT PartID
FROM tblRelatedParts
WHERE SecondPartID = 'XYZ123'
I am dealing with a similar issue and looking at the two approaches and wondering why you thought the schema with the relationship table was better. It seems like the original issue still exists in the sense that you still need to manage the relationships between them from both directions.
How about having two rows for each similarity. For example if you have objects A, B similar you will have in your relation table
A B
B A
I know you will double your relation data, but they are integers so it won't over kill your database. Instead you have some gains:
you won't use union. Union is over kill in any dbms. Especially when you have order by or group by
you can implement more specific relation: a is in relation with b, but b is not in relation with a. For example John can replace Dave, but Dave cannot replace John.

How to create history fact table?

I have some entities in my Data Warehouse:
Person - with attributes personId, dateFrom, dateTo, and others those can be changed, e.g. last name, birth date and so on - slowly changing dimension
Document - documentId, number, type
Address - addressId, city, street, house, flat
The relations between (Person and Document) is One-To-Many and (Person and Address) is Many-To-Many.
My target is to create history fact table that can answer us following questions:
What persons with what documents lived at defined address on defined date?
2, What history of residents does defined address have on defined interval of time?
This is not only for what DW is designed, but I think it is the hardest thing in DW's design.
For example, Miss Brown with personId=1, documents with documentId=1 and documentId=2 had been lived at address with addressId=1 since 01/01/2005 to 02/02/2010 and then moved to addressId=2 where has been lived since 02/03/2010 to current date (NULL?). But she had changed last name to Mrs Green since 04/05/2006 and her first document with documentId=1 to documentId=3 since 06/07/2007. Mr Black with personId=2, documentId=4 has been lived at addressId=1 since 02/03/2010 to current date.
The expected result on our query for question 2 where addressId=1, and time interval is since 01/01/2000 to now, must be like:
Rows:
last_name="Brown", documentId=1, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Brown", documentId=2, dateFrom=01/01/2005, dateTo=04/04/2006
last_name="Green", documentId=1, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=04/05/2006, dateTo=06/06/2007
last_name="Green", documentId=2, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Green", documentId=3, dateFrom=06/07/2007, dateTo=02/01/2010
last_name="Black", documentId=4, dateFrom=02/03/2010, dateTo=NULL
I had an idea to create fact table with composite key (personId, documentId, addressId, dateFrom) but I have no idea how to load this table and then get that expected result with this structure.
I will be pleased for any help!
Interesting question #Argnist!
So to create some common language for my example, you want a
DimPerson (PK=kcPerson, suggorate key for unique Persons=kPerson, type 2 dim)
DimDocument (PK=kcDocument, suggorate key for unique Documents=kDocument, type 2 dim)
DimAddress (PK=kcAddress, suggorate key for unique Addresses=kAddress, type 2 dim)
A colleague has written a short blog on the usage of two surrogate keys to explain the above dims 'Using Two Surrogate Keys on Dimensions'.
I would always add
DimDate with PK in the form yyyymmdd
to any data warehouse with extra attribute columns.
Then you would have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDate)
plus any aditional measures.
Then joining on the "kc"s you can show the current Person/Document/Address dimension information.
If you joined on the "k"s you can show the historic Person/Document/Address dimension information.
The downside of this is that this fact table needs one row for each person/document/address/date combination. But it really is a very narrow table, since the table just has a number of foreign keys.
The advantage of this is it is very easy to query for the sorts of questions you were asking.
Alternatively, you could have your fact table as
FactHistory (FKs=kcPerson, kPerson, kcDocument, kDocument, kcPerson, kPerson, kDateFrom, kDateTo)
plus any aditional measures.
This is obviously much more compact, but the querying becomes more complex. You could also put a view over the Fact table to make it easier to query!
The choice of solution depends on the frequency of change of the data. I suspect that it will not be changing that quickly, so teh alternate design of the fact table may be better.
Hope that helps.