Tree Structure With History in SQL Server - sql

I'm not sure if it's a duplicate question, but I couldn't find anything on this subject.
What's the best way for storing tree structure in a table with ability to store history of changes in that structure.
Thank you for any help.
UPDATE:
I have Employees Table and I need to store the structure of branches, departments, sections, sub sections.. in a table.
I need to store historical information on employees branches, departments, to be able to retrieve an employee's branch, department, section, sub section even if the structure has been changed.
UPDATE 2:
There's a solution to save the whole structure in a history table on every change in the structure, but is that the best approach?
UPDATE 3:
There's also Orders Table. I must store employee's position, branch, department, section, sub section and other on every order. That's the main reason for storing history. It will be used very often. In another words I should be able to show db data for every past day.
UPDATE 4:
Maybe using hierarchyid is an option?
What if a node is renamed? What should I do, if I need the old name on old orders?

I think you are looking for something like this. It provides a complete tree structure. This is used for a directory, but it can be used for anything: divisions, departments, sections, etc.
The History is separate and it is best if you get your head aroun dthe Node structure before contemplating the history. For the History of any table, all that is required is the addition of a DateTime or TimeStamp column to the PK. The history stores before images of the current rows.
Functions such as (a) resolving the path of the tree and (b) finding the relevant history rows that were current at some point in time, are performed using pure SQL. With MS you can use recursive CTEs for (a), or write a simpler form in a stored proc.
For (b), I would implement a derived version of the table (Node, Department, whatever) that returns the rows that were current at the relevant point in time; then use that for the historic version of (a).
It is not necessary to copy and save the entire tree structure every time it changes.
Feel free to ask questions if you need any clarification.
Data Model
▶Tree Structure with History Data Model◀
Readers who are unfamiliar with the Relational Modelling Standard may find ▶IDEF1X Notation◀ useful.

Depending on how the historical information will be used will determine whether you need a temporal solution or simply an auditing solution. In a temporal solution, you would store a date range over which a given node applies in its given position and the "current" hierarchy is derived by using the current date and time to query for active nodes in order to report on the hierarchy. To say that this is a complicated solution is an understatement.
Given that we are talking about an employee hierarchy, my bet is that an auditing solution will suffice. In an auditing solution, you have one table(s) for the current hierarchy and store changes somewhere else that is accessible. It should be noted that the "somewhere else" doesn't necessarily need to be data. If changes to the company hierarchy are infrequent, then you could even use a seriously low-tech solution of creating a report (or series of reports) of the company hierarchy and store those in PDF format. When someone wanted to know what the hierarchy looked like last May, they could go find the corresponding PDF printout.
However, if it is desired to have the audit trail be queryable, then you could consider something like SQL Server 2008's Change Tracking feature or a third-party solution which does something similar.
Remember that there is more to the question of "best" than the structure itself. There is a cost-benefit analysis of how much effort is required vs. the functionality it provides. If stakeholders need to query for the hierarchy at any moment in time and it fluctuates frequently (wouldn't want to work there :)) then a temporal solution may be best but will be more costly to implement. If the hierarchy changes infrequently and granular auditing is not necessary, then I'd be inclined to simply store PDFs of the hierarchy on a periodic basis. If there is a real need to query for changes or granular auditing is needed, then I'd look at an auditing tool.
Change Tracking
Doing a quick search, here's a couple of third-party solutions for auditing:
ApexSQL
Lumigent Technologies

there are several approaches, however with your limited information I would say to use single table with parent relationship, for the history you can simply implement audit of the table
UPDATE: based on your new information I would not use a single table data store, it looks like your hierarchical structure is much more complex than a simple tree, also it looks like you have some well defined nodes of the structures especially at the top before getting into the sections and sub-sections; so multi-table relations might be a better fit;
as far as the audit tables, there are plenty of resources to find out what will work best for you, there are per row and per column audits, etc.

One thing to note is that you dont ahve to historize the tree structure. It always only grows, never gets smaller.
What chagnes is the USE of the nodes. Their data may change.
For eaxample a site.
/a/b/c
will be there forever. A may be a folder, b too, c a file. Later a may be a folter, b a tombstone (no data here at the moment) as c. But the tree in itself never changes.
Then, add a version number, and for every node a historiy / list of uses (node types) with start and possibly (can be null) end version.
The code for showing a version X then can build out the "real" tree for this moment easily.
To support moves of nodes, have a "rehook" type that indicates the noe was moved to another item at this version.
Voila.

Related

Tool to extract a DB row and ALL its referenced "objects" (recursively)?

For our project it has become increasingly complicated to reproduce certain error conditions that show up in productive use. Extracting and recreating certain conditions sometimes takes hours to re-enter the data and reproduce a situation mainly because the required "graph" can be huge and there are many referential constraints that must be fulfilled and recreating these in an analysis DB (production can not be used for this for obvious reasons) in the correct order is often extremely complicated and tedious.
What would ease such analyses enormously would be some tool that - given a specific table and row-id as starting point - would traverse the entire graph as defined by a table's references (foreign keys) and emit all referenced entries recursively.
Ideally it would emit all these rows (table name, column names and their values) as sql-insert statements such that one could execute these as inserts scripts to load a relevant subset into another DB for analysis.
Does such a tool exist? I could imagine that this is not such a seldom and exotic wish or requirement. Or is this wishful dreaming and I am in for a longer programming exercise?
The DB we are using is Oracle (v12) - in case that matters.
Hope I could make myself clear and convey the intention.
Years ago I did this and it was very easy because I had object-relational mapping to java objects. I could just pick up the "parent" object from source db [which would recursively traverse all the relationships and pick up all the children], open a connection to the target DB and save the fully instantiated parent object tree. I don't know of any other tools. Folks in my company try to keep a pre-prd DB periodically refreshed from prod.
Taking hours to manually reproduce the problem conditions in the data seems like a long time, but so would hand building a custom solution.
If you are having data problems, it is likely a bug in the code, so if you can get developers to "write the test that fails" based on the conditions, you'll be better off.

Sql Database structure for housing historical data and display changes

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis
Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.
(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

Is it possible to use Nhibernate with partition of an object over several tables?

We are having a system that gather large quantities of data each month and performes rather advanced calculations that increase the database even more. Since we have the requirements from the customer that we need to store data for fast access three years back and that we must be able to access older data (up to ten years), this however can be low performance and requires some work. We want to avoid performance issues where the database and its tables grows out of proportion.
After discussing using SQL Enterprise (VERY costly and full of traps since we haven't gotten the know-how) and since our system have so many tables that referenses each other we are leaning towards creating some kind of history tables to which we move data in a monthly fashion and rewrite the select queries that we have based on parameters to search either in the regular table or in the history or both depending on the situation.
Since we also are using NHibernate for mapping I was wondering if it is possbible to create a mapping file that handles this by itself (almost) using some sort of polymorfism or inheritance in which each object is stored in different tables based on parameters?
I know this sounds complicated and strange and that there is other methods to perform this but I this question I would rather have people answering the question asked and not give other sugestions to use instead.
As far as I know NHibernate can't do that (each class can be mapped to one table/view )but you can use SQL Queries or StoredProcedures (depends on the version of NHibernate that you are using) to populate mapped objects.
In your case you can have a combined view created by making unions of different tables Then you can use a SQL query to populated your entity.
There's also another solution that you create a summary object for your queries that uses that view ,therefore you can use both HQL and criteria to query this object.
Short answer "no". I would not create views as you mention a lot of joining.
Personally I would create summary tables and map to these directly using a stateless session or a very least mutable=false on the class definition. Think of these summary tables as denormalised data for report only. The only drawback is if historic data changes on a regular basis then the summary tables also needs changing. If historical data never changes then this should be simple to achieve.
I would also most probably store these summary tables on another catalog rather than adding to the size of the current system.
Its not a quick win this one I am afraid.

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.

What's the best way to store/calculate user scores?

I am looking to design a database for a website where users will be able to gain points (reputation) for performing certain activities and am struggling with the database design.
I am planning to keep records of the things a user does so they may have 25 points for an item they have submitted, 1 point each for 30 comments they have made and another 10 bonus points for being awesome!
Clearly all the data will be there, but it seems like a lot or querying to get the total score for each user which I would like to display next to their username (in the form of a level). For example, a query to the submitted items table to get the scores for each item from that user, a query to the comments table etc. If all this needs to be done for every user mentioned on a page.... LOTS of queries!
I had considered keeping a score in the user table, which would seem a lot quicker to look up, but I've had it drummed into me that storing data that can be calculated from other data is BAD!
I've seen a lot of sites that do similar things (even stack overflow does similar) so I figure there must be a "best practice" to follow. Can anyone suggest what it may be?
Any suggestions or comments would be great. Thanks!
I think that this is definitely a great question. I've had to build systems that have similar behavior to this--especially when the table with the scores in it is accessed pretty often (like in your scenario). Here's my suggestion to you:
First, create some tables like the following (I'm using SQL Server best practices, but name them however you see fit):
UserAccount UserAchievement
-Guid (PK) -Guid (PK)
-FirstName -UserAccountGuid (FK)
-LastName -Name
-EmailAddress -Score
Once you've done this, go ahead and create a view that looks something like the following (no, I haven't verified this SQL, but it should be a good start):
SELECT [UserAccount].[FirstName] AS FirstName,
[UserAccount].[LastName] AS LastName,
SUM([UserAchievement].[Score]) AS TotalPoints
FROM [UserAccount]
INNER JOIN [UserAchievement]
ON [UserAccount].[Guid] = [UserAchievement].[UserAccountGuid]
GROUP BY [UserAccount].[FirstName],
[UserAccount].[LastName]
ORDER BY [UserAccount].[LastName] ASC
I know you've mentioned some concern about performance and a lot of queries, but if you build out a view like this, you won't ever need more than one. I recommend not making this a materialized view; instead, just index your tables so that the lookups that you need (essentially, UserAccountGuid) will enable fast summation across the table.
I will add one more point--if your UserAccount table gets huge, you may consider a slightly more intelligent query that would incorporate the names of the accounts you need to get roll-ups for. This will make it possible not to return huge data sets to your web site when you're only showing, you know, 3-10 users' information on the page. I'd have to think a bit more about how to do this elegantly, but I'd suggest staying away from "IN" statements since this will invoke a linear search of the table.
For very high read/write ratios, denormalizing is a very valid option. You can use an indexed view and the data will be kept in sync declaratively (so you never have to worry about there being bad score data). The downside is that it IS kept in sync.. so the updates to the store total are a synchronous aspect of committing the score action. This would normally be quite fast, but it is a design decision. If you denormalize yourself, you can choose if you want to have some kind of delayed update system.
Personally I would go with an indexed view for starting, and then later you can replace it fairly seamlessly with a concrete table if your needs dictate.
In the past we've always used some sort of nightly or perodic cron job to calculate the current score and save it in the database - sort of like a persistent view of the SUM on the activities table. Like most "best practices" they are simply guidelines and it's often better and more practical to deviate from a specific hard nosed practice on very specific areas.
Plus it's not really all that much of a deviation if you use the cron job as it's better viewed as a cache stored in the database.
If you have a separate scores table, you could update it each time an item is submitted or a comment is posted by a user. You could do this using a trigger or within the sites code.
The user scores would be updated continuously, and could be quickly queried for display.