Storing Revisions of Relational Objects in an Efficient Way - sql

I'm not sure if this type of question has been answered before. In my database I have a product table and specifications table. Each product can have multiple specifications. Here I need to store the revisions of each product in database in order to query them later on for history purposes.
So I need an efficient way to store the products' relations to specifications each time users make changes to these relations. Also the amount of data can become very big. For example, suppose there are 100000 products in database: each product can have 30 specifications and also there are minimum of 20 revisions on each product. So by storing all the data in a single table the amount of data becomes enormously high.
Any suggestions?

If this is purely for 'archival' purposes then maybe a separate table for the revisions is better.
However if you need to treat previous revisions equally to current revisions (for example, if you want to give users the ability to revert a product to a previous revision), then it is probably best to keep a single products table, rather than copying data between tables. If you are worried about performance, this is what indexes are for.
You can create a compound primary key for the product table, e.g. PRIMARY KEY (product_id, revision). Maybe a stored proc to find the current revision—by selecting the row with the highest revision for a particular product_id—will be useful.

I would recommend having a table, exact copy of current table with a HistoryDate column, and store the revisions in this table. This you can do for all 3 tables in question.
By keeping the revision separate from the main tables, you will not incur any performance penalties when querying the main tables.
You can also look at keeping a record to indicate the user that changed the data.

Related

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

Connecting one foreign key to multiple tables (primary keys)

I am developing an application for making quotations. First you make cost break down (or calculation) and upon that result you add item to quotation. The problem is that i have many product, so each category of a product will have its own cost break down form with different parameters to be filled in. If I will have only one table for cost breakdown, then it will be huge (a lot of fields in table). I have a feeling that this is not the right approach. So I came up with diagram below:
Is this solution even possible, or I must have "N" (if I have N-tables) different FK for each cost break down table? Do you have any better solutions?
I have another question if my linking table "Quotation_QtnDetail" is necessary?
It would be possible to store a reference to a particular value in one of these tables by having a CalculationType column indicating which table the record is in, along with a generic reference ID column (containing the ID of the relevant record). For example, if you were storing a CalcId of 123 and a CalculationType of 2, this would point to the record with ID 123 in the Calc2 table.
The downside to doing this is you're going to lose the ability to validate your data using FK constraints, and it will also make joins to your calculation tables a bit more complicated.
Regarding the Quotation_QtnDetail table, unless a QtnDetail record could ever be linked to multiple Quotation records, there is no need for this extra linking table. Instead, just link it directly by adding a QtnId column to the QtnDetail table. Similarly, you may also be able to remove the Calc_QtnItm table if an item is only ever linked to a single calculation record.

Update and delete records in the fact table

I have a fact table with five dimension tables associated to it.Typically, the fact table contains the surrogate keys of each dimension and has no business/surrogate key. I am trying to load the fact table with data resulted of the staging fact table i.e.Insert new records. However, I notice the fact table can also handle other operations such as Update or Delete on data. A conditional split was used in the SSIS Package for this purpose to check if all surrogate keys are 0 then make the new insert. My question is, Can I use the surrogate keys in terms of Update or Delete?
I made an insert on the fact table just to give an idea of how the data will look like.
The answer is yes, you can. BUT, will there be a situation where one employee sold the same product, from the same supplier, to the same customer, on the same day? Perhaps a different order on the same day? (this is based on the data you present in the question)
If all the surrogate keys together can uniquely identify a record, update fact records to your hearts content. But, if that is not the case, you could end up updating records when you do not intend to update.
I tend to include an order number in the fact tables I design to help avoid that situation, but you may not have that in your actual fact tables. Including the order number is a pattern referred to a degenerate dimension in the fact table. I have found it to be pretty handy.
Anyway, the answer is the same. You can update fact records based on surrogate keys, as long as all of them together can uniquely identify the row(s) you want to update.
Don't throw caution to the wind, be sure your data warehouse is designed such that you can do this if you need to. Being able to do in place updates of facts can be nice, versus delete and replace, in that there could be fewer steps in the ETL process.

SQL Server Table Design ; one table with type column vs multiple tables

I have made a website in which I have a blog and a products page. Posts and products are stored in different tables. I use microsoft sql server.
I want to create a table to store the views for each post and for each product. My 2 possible designs are:
One table for all views
(id, ref_id, date, ip, type)
where type is either post or product
Two separate tables, table post_views and table product_views
post_views(id, post_id, date, ip)
product_views(id, product_id, date, ip)
Which design is better and why?
My reasoning:
Solution 1 requires more database space (we need to store the type value for each view), also requires a more "complex" query. We need to search by id and type.
Solution 1 is more compact. We have less tables, but the performance won't be so great. If we have 1 million records and 500k views are for the posts and 500k views are for the products, we would have to search all the views to filter by date (just an example)
Pros of 1st solution are the compactness and that I use one less table.
The second solution requires one more table, but the query performance and disk space would be better.
This is a very common design question that I face and I would love to receive an answer from a very good expert.
Since posts and products each have their own tables, go with the second option - meaning different tables for post views and product views.
The reason it's a better option is that this option allows you to use foreign keys between the views tables and the posts and products tables and keep the posts and products separated.
the first option will also allow you to use foreign key between these tables, but it will mean that you can only post views where the ref_id exists in both posts and products tables. Also, it will force more cumbersome select statements to include different joins based on the type column.
With no more information than you have provided, I would go with the first option, simply because it is more concise. Unless you have a compelling reason to break them into two tables, you should keep them in one. It's easy enough to filter any queries by type as needed.
Another consideration is that it's more common practice in stored procs to have a variable field parameter (such as your type field) than it is to have a variable table name, so the function of any stored procs that you create will be a bit more obvious to those who come after you if you keep them in one table.

One mysql table with many fields or many (hundreds of) tables with fewer fields?

I am designing a system for a client, where he is able to create data forms for various products he sales him self.
The number of fields he will be using will not be more than 600-700 (worst case scenario). As it looks like he will probably be in the range of 400 - 500 (max).
I had 2 methods in mind for creating the database (using meta data):
a) Create a table for each product, which will hold only fields necessary for this product, which will result to hundreds of tables but with only the neccessary fields for each product
or
b) use one single table with all availabe form fields (any range from current 300 to max 700), resulting in one table that will have MANY fields, of which only about 10% will be used for each product entry (a product should usualy not use more than 50-80 fields)
Which solution is best? keeping in mind that table maintenance (creation, updates and changes) to the table(s) will be done using meta data, so I will not need to do changes to the table(s) manually.
Thank you!
/**** UPDATE *****/
Just an update, even after this long time (and allot of additional experience gathered) I needed to mention that not normalizing your database is a terrible idea. What is more, a not normalized database almost always (just always from my experience) indicates a flawed application design as well.
i would have 3 tables:
product
id
name
whatever else you need
field
id
field name
anything else you might need
product_field
id
product_id
field_id
field value
Your key deciding factor is whether normalization is required. Even though you are only adding data using an application, you'll still need to cater for anomalies, e.g. what happens if someone's phone number changes, and they insert multiple rows over the lifetime of the application? Which row contains the correct phone number?
As an example, you may find that you'll have repeating groups in your data, like one person with several phone numbers; rather than have three columns called "Phone1", "Phone2", "Phone3", you'd break that data into its own table.
There are other issues in normalisation, such as transitive or non-key dependencies. These concepts will hopefully lead you to a database table design without modification anomalies, as you should hope for!
Pulegiums solution is a good way to go.
You do not want to go with the one-table-for-each-product solution, because the structure of your database should not have to change when you insert or delete a product. Only the rows of one or many tables should be inserted or deleted, not the tables themselves.
While it's possible that it may be necessary, having that many fields for something as simple as a product list sounds to me like you probably have a flawed design.
You need to analyze your potential table structures to ensure that each field contains no more than one piece of information (e.g., "2 hammers, 500 nails" in a single field is bad) and that each piece of information has no more than one field where it belongs (e.g., having phone1, phone2, phone3 fields is bad). Either of these situations indicates that you should move that information out into a separate, related table with a foreign key connecting it back to the original table. As pulegium has demonstrated, this technique can quickly break things down to three tables with only about a dozen fields total.