Best table structure for tracking state changes - sql

I'm currently trying to model an aspect of a system whereby components that are stored can change state, eg OK, FAILED, REPAIRED etc. On the web side I will need to show the current state, but also the history of previous (if any) states.
I'm torn between these two designs, could anyone shed any light on the best way (I'm more a software dev than a dba guy).
Option one:
statehistory table which tracks each time the state changes, the highest sequence number will be the current state : SQLFiddle example
Option two:
Similar to above, except the current state is stored in the component table, and only past states are in the history table. When state is changed the current state is inserted as the most recent in history then the current is set in the component table: SQLFiddle example
As an aside, use either one or two but without the state lookup table, just store the state text as varchar (my thinking is this makes it easier to report from?): SQLFiddle example
Thanks.
EDIT:
There are several component tables, should the state history table contain the data for all of them, or make a statehistory table per component? Each components table will have hundreds of thousands of entries, making the statehistory table pretty large.
eg:
Table: component_a
Table: component_b
etc..
statehistory (
component_a_id,
component_b_id,
state_id,
...
)

I tend to do a hybrid between the two. I always store all state changes including the current state in the history table. That gives you a central place to query them. You can have a column IsCurrent BIT NOT NULL to make your life a little easier. Create a filtered unique index with filter IsCurrent = 1 to enforce basic integrity rules.
I also store the current state in the main table. Probably not just a copy but as a foreign key to the history table. That makes for very convenient querying. Looking up the current state is often useful. For indexing reasons you can also duplicate the values into the main table, of course. The more duplication you have the more error prone the system.
If you want to avoid duplication but still index on the current status, you can create an indexed view to combine main and history tables. You can then create an index on mixed columns from both tables (e.g. on (StatusHistoryItems.Status, Components.Name) to support queries that ask for customers with a specific status and a specific name. This query would be resolved as a single index seek on the view's index).
You'd create a view like this:
SELECT *
FROM Components c
JOIN StatusHistoryItems shi on c.ID = shi.ComponentID
AND c.IsCurrent = 1 --this condition will join exactly one row
And index it. Now you have the current status together with all component data in one efficient index. No duplication, no denormalization at all. Just make sure that there is at least one status row for each customer with IsCurrent = 1.
I recommend having a nightly validation job that validates data consistency and alerts you of problems. Denormalized data has a habit of becoming corrupted over time for various reasons.

Related

Should I apply type 2 history to tables with duplicate keys?

I'm working on a data warehouse project using BigQuery. We're loading daily files exported from various mainframe systems. Most tables have unique keys which we can use to create the type 2 history, but some tables, e.g. a ledger/positions table, can have duplicate rows. These files contain the full data extract from the source system every day.
We're currently able to maintain a type 2 history for most tables without knowing the primary keys, as long as all rows in a load are unique, but we have a challenge with tables where this is not the case.
One person on the project has suggested that the way to handle it is to "compare duplicates", meaning that if the DWH table has 5 identical rows and the staging tables has 6 identical rows, then we just insert one more, and if it is the other way around, we just close one of the records in the DWH table (by setting the end date to now). This could be implemented by adding and extra "sub row" key to the dataset like this:
Row_number() over(partition by “all data columns” order by SystemTime) as data_row_nr
I've tried to find out if this is good practice or not, but without any luck. Something about it just seems wrong to me, and I can't see what unforeseen consequences can arise from doing it like this.
Can anybody tell me what the best way to go is when dealing with full loads of ledger data on a daily basis, for which we want to maintain some kind of history in the DWH?
No, I do not think this would be a good idea to introduce an artificial primary key based on all columns plus the index of the duplicated row.
You will solve the technical problem, but I doubt there will be some business value.
First of all you should distinct – the tables you get with primary key are dimensions and you can recognise changes and build history.
But the table without PK are most probably fact tables (i.e. transaction records) that are typically not full loaded but loaded based on some DELTA criterion.
Anyway you will never be able to recognise an update in those records, only possible change is insert (deletes are typically not relevant as data warehouse keeps longer history that the source system).
So my todo list
Check if the dup are intended or illegal
Try to find a delta criterion to load the fact tables
If everything fails, make the primary key of all columns with a single attribute of the number of duplicates and build the history.

Changelog for a table

I want to design a changelog for a few tables. Lets call it table restaurant. Every time a user modifies the list of restaurants the change should be logged.
Idea 1
My first idea was to create 2 tables. One which contains all the restaurants RESTAURANT_VALUE (restaurantId*, restaurantValueId*, address, phone, ..., username, insertDate). Every time a change is made it creates a new entry. Then a table RESTAURANT (restaurantId*, restaurantValueId) which will link to the current valid restaurantValueId. So one table that holds the current and the previous version.
Idea 2
It starts with 2 tables as well. One of them contains all current restaurants. e.g. RESTAURANT_CURRENT. And a second table which contains all changes RESTAURANT_HISTORY. Therefore both need to have the exactly same columns. Every time a change occurs the values of the 'current' table are copied into the history table, and the new version in the 'current'.
My opinion
Idea 1 doesn't care if columns will ever be added or not, therefore maintenance and adding of columns would be easy. However, I think as the database grows... wouldn't it slow down? Idea 2 has the advantage that the table with the values will never have any 'old' stuff and not get crowded.
Theoretically I think Idea 1 should be the one done
What do you think. Would you go for Idea 1 or another one? Are there any other important practical thoughts I am not aware of?
The approach strongly depends on your needs. Why would you want a history table?
If it's just for auditing purposes, then make a separate restaurant_history table (idea 2) to keep the history aside. If you want to present the history in the application, then go for signle restaurants table with one of below options:
seq_no - record version number incrementing with each update. If you need current data, you must search for highest seq_no for given restaurant_id(s), so optionally use also current marker, allowing straighforward current = true
valid_from, valid_to - where valid_to is NULL for current record
Sometimes there is need to query efficiently which attributes exactly changed. to do this easily you can consider a history table on attribute level: (restaurant_id, attribute, old_value, new_value, change_date, user).

Opinions on planning and avoiding data redundancy

I am currently going to be designing an app in vb.net to work with an access back-end database. I have been trying to think of ways to reduce down data redundancy
and I have an example scenario below:
Lets imagine, for an example purpose, I have a customers table and need to highlight all customers in WI and send them a letter. The customers table would
contain all the customers and properties associated with customers (Name, Address, Etc) so we would query for where the state is "WI" in the table. Then we would
take the results of that data, and append it into a table with a "completion" indicator (So from 'CUSTOMERS' to say 'WI_LETTERS' table).
Lets assume some processing needs to be done so when its completed, mark a field in that table as 'complete', then allow the letters to be printed with
a mail merge. (SELECT FROM 'WI_LETTERS' WHERE INDICATOR = COMPLETE).
That item is now completed and done. But lets say, that every odd year (2013) we also send a notice to everyone in the table with a state of "WI". We now query the
customers table when the year is odd and the customer's state is "WI". Then append that data into a table called 'notices' with a completion indicator
and it is marked complete.
This seems to keep the data "task-based" as the data is based solely around the task at hand. However, isn't this considered redundant data? This setup means there
can be one transaction type to many accounts (even multiple times to the same account year after year), but shouldn't it be one account to many transactions?
How is the design of this made better?
You certainly don't want to start creating new tables for each individual task you perform. You may want to create several different tables for different types of tasks if the information you need to track (and hence the columns in those tables) will be quite different between the different types of tasks, but those tables should be used for all tasks of that particular type. You can maintain a field in those tables to identify the individual task to which each record applies (e.g., [campaign_id] for Marketing campaign mailouts, or [mail_batch_id], or similar).
You definitely don't want to start creating new tables like [WI_letters] that are segregated by State (or any client attribute). You already have the customers' State in the [Customers] table so the only customer-related attribute you need in your [Letters] table is the [CustomerID]. If you frequently want to see a list of Letters for Customers in Wisconsin then you can always create a saved Query (often called a View in other database systems) named [WI_Letters] that looks like
SELECT * FROM Letters INNER JOIN Customers ON Customers.CustomerID=Letters.CustomerID
WHERE Customers.State="WI"

One mysql table with many fields or many (hundreds of) tables with fewer fields?

I am designing a system for a client, where he is able to create data forms for various products he sales him self.
The number of fields he will be using will not be more than 600-700 (worst case scenario). As it looks like he will probably be in the range of 400 - 500 (max).
I had 2 methods in mind for creating the database (using meta data):
a) Create a table for each product, which will hold only fields necessary for this product, which will result to hundreds of tables but with only the neccessary fields for each product
or
b) use one single table with all availabe form fields (any range from current 300 to max 700), resulting in one table that will have MANY fields, of which only about 10% will be used for each product entry (a product should usualy not use more than 50-80 fields)
Which solution is best? keeping in mind that table maintenance (creation, updates and changes) to the table(s) will be done using meta data, so I will not need to do changes to the table(s) manually.
Thank you!
/**** UPDATE *****/
Just an update, even after this long time (and allot of additional experience gathered) I needed to mention that not normalizing your database is a terrible idea. What is more, a not normalized database almost always (just always from my experience) indicates a flawed application design as well.
i would have 3 tables:
product
id
name
whatever else you need
field
id
field name
anything else you might need
product_field
id
product_id
field_id
field value
Your key deciding factor is whether normalization is required. Even though you are only adding data using an application, you'll still need to cater for anomalies, e.g. what happens if someone's phone number changes, and they insert multiple rows over the lifetime of the application? Which row contains the correct phone number?
As an example, you may find that you'll have repeating groups in your data, like one person with several phone numbers; rather than have three columns called "Phone1", "Phone2", "Phone3", you'd break that data into its own table.
There are other issues in normalisation, such as transitive or non-key dependencies. These concepts will hopefully lead you to a database table design without modification anomalies, as you should hope for!
Pulegiums solution is a good way to go.
You do not want to go with the one-table-for-each-product solution, because the structure of your database should not have to change when you insert or delete a product. Only the rows of one or many tables should be inserted or deleted, not the tables themselves.
While it's possible that it may be necessary, having that many fields for something as simple as a product list sounds to me like you probably have a flawed design.
You need to analyze your potential table structures to ensure that each field contains no more than one piece of information (e.g., "2 hammers, 500 nails" in a single field is bad) and that each piece of information has no more than one field where it belongs (e.g., having phone1, phone2, phone3 fields is bad). Either of these situations indicates that you should move that information out into a separate, related table with a foreign key connecting it back to the original table. As pulegium has demonstrated, this technique can quickly break things down to three tables with only about a dozen fields total.

Storing Revisions of Relational Objects in an Efficient Way

I'm not sure if this type of question has been answered before. In my database I have a product table and specifications table. Each product can have multiple specifications. Here I need to store the revisions of each product in database in order to query them later on for history purposes.
So I need an efficient way to store the products' relations to specifications each time users make changes to these relations. Also the amount of data can become very big. For example, suppose there are 100000 products in database: each product can have 30 specifications and also there are minimum of 20 revisions on each product. So by storing all the data in a single table the amount of data becomes enormously high.
Any suggestions?
If this is purely for 'archival' purposes then maybe a separate table for the revisions is better.
However if you need to treat previous revisions equally to current revisions (for example, if you want to give users the ability to revert a product to a previous revision), then it is probably best to keep a single products table, rather than copying data between tables. If you are worried about performance, this is what indexes are for.
You can create a compound primary key for the product table, e.g. PRIMARY KEY (product_id, revision). Maybe a stored proc to find the current revision—by selecting the row with the highest revision for a particular product_id—will be useful.
I would recommend having a table, exact copy of current table with a HistoryDate column, and store the revisions in this table. This you can do for all 3 tables in question.
By keeping the revision separate from the main tables, you will not incur any performance penalties when querying the main tables.
You can also look at keeping a record to indicate the user that changed the data.