How to store complex records for referencing historical revisions? - sql

I have a table on my database that outlines complex processes in a work breakdown structure (similar to what's used to create Gantt charts). There are multiple rows for a particular process, each row outlining a hierarchical step of a particular process.
I then have a table with some product types, each being linked to a particular process. When an order for a particular product is placed - it is to be manufactured with the associated process.
In my situation, the processes can be dynamic (steps added or removed, for example).
I'm curious as to what the best way to capture current and historical revisions of each process is, such that even though a process may have evolved over time - I can historically go back to a particular order and determine what the process looked like at that time.
I'm sure there are multiple ways to go about this, using logging or triggers with a new history table - but I've had no experience doing something like this and I'd like to know what worked well for others.

Related

What is the best way to structure this database?

So I am in the process of building a database from my clients data. Each month they create roughly 25 csv's, which are unique by their topic and attributes, but they all have 1 thing in common; a registration number.
The registration number is the only common variable across all of these csv's.
My task is to move all of this into a database, for which I am leaning towards postgres (If anyone believes nosql would be best for this then please shout out!).
The big problem; structuring this within a database. Should I create 1 table per month that houses all the data, with column 1 being registration and column 2-200 being the attributes? Or should put all the csv's into postgres as they are, and then join them later?
I'm struggling to get my head around the method to structure this when there will be monthly updates to every registration, and we dont want to destroy historical data - we want to keep it for future benchmarks.
I hope this makes sense - I welcome all suggestions!
Thank you.
There are some ways where your question is too broad and asking for an opinion (SQL vs NoSQL).
However, the gist of the question is whether you should load your data one month at a time or into a well-developed data model. Definitely the latter.
My recommendation is the following.
First, design the data model around how the data needs to be stored in the database, rather than how it is being provided. There may be one table per CSV file. I would be a bit surprised, though. Data often wants to be restructured.
Second, design the archive framework for the CSV files.
You should archive all the incoming files in a nice directory structure with files from each month. This structure should be able to accommodate multiple uploads per month, either for all the files or some of them. Mistakes happen and you want to be sure the input data is available.
Third, copy (this is the Postgres command) the data into staging tables. This is the beginning of the monthly process.
Fourth, process the data -- including doing validation checks to load it into your data model.
There may be tweaks to this process, based on questions such as:
Does the data need to be available 24/7 even during the upload process?
Does a validation failure in one part of the data prevent uploading any data?
Are SQL checks (referential integrity and check) sufficient for validating the data?
Do you need to be able to "rollback" the system to any particular update?
These are just questions that can guide your implementation. They are not intended to be answered here.

Table with multiple foreign keys -- only one not null

I'm trying to design a system where an administrator will have to approve changes to the data and other various administrative tasks -- add a user, add an admin etc.
My idea is to have a notification table that contains these notifications, but the problem is that a notification can be any of the previously mentioned types, ie it's data is stored in one of many tables. Here is a picture to describe my current plan -- note I'm sure that it's not a proper ER diagram.
full_screen
Also, the data goes into a pending table, that reflects the table it will eventually wind up in, provided the data is approved -- it's a staging ground of sorts. So, a pending_user is a user that is not in the user table. And as you can see the user table, amongst others, is not shown here, but one can use their imagination.
I'm concerned that the multiple null values in the pending table will have adverse effects that I'm not totally aware of, such as increased space usage and possibly increase query time. Also, I'm not sure how I'll implement the retrieval of these notifications. My naive approach is to select the first X notifications, analyze the rows to find the non-null column, retrieve the appropriate data and then load all the data in a response.
Is there a more straight forward pattern for this type of problem?
Thanks in advance for any help.
I think, the traditional way is to provide various levels of access/read/write rights to users. These access rights define what actions a user can and can't perform. In this traditional approach if a user has access to a certain function, he can do it without further approval.
Also, traditionally there are some kind of audit logs that contain a trace of all important changes to the data. With such logs it would be possible to know who made a change (and when).
If you need to build a two-stage system, where a change has to go through an approval, I'd add a flag column to each important table that would indicate that values in the given row are not final and have to be approved. The table would store all historical changes to the data and with the help of this flag the system would know which variant is the latest approved version and which variant is pending and waiting for approval.
I would not try to make a single universal table that would hold data related to changes in many different tables. Each table is different and approval process for each table is likely to be different. I doubt that you'll have more than a dozen entities that are important enough to go through this approval process.

Should I create multiple tables, or even databases for multiple users of a CRM

I'm working on creating an application best described as a CRM. There is a relatively complex table structure, and I'm thinking about allowing users to do a fair bit of customization (adding fields and the like). One concern is that I will be reaching a certain level of scale almost immediately. We have about 50,000 individual users who will be coming online within about nine months of launch. So I want to build to last.
I'm thinking about two and maybe even three options.
One table set with a userID column on everything and with a custom attributes table created by creating a table which indexes custom attributes, then another table which has their values, which can then be joined to the existing contact records for the user. -- From what I've read, this seems like the right option, but I keep feeling like it's not. It seems like once these tables start reaching the millions of records searching for just one users records in every query is going to become a database hog.
For each user account recreate the table set, preened with a unique identifier (the userID for example.) Then rather than using a WHERE userID=? everywhere I can use a FROM ?_contacts. For attributes I could then have a custom attributes table where users could add additional columns for custom attributes. -- This feels like the simplest way to go, though, of course when I decide to change the database structure there would be a migration from hell.
The third option, which I'm pretty confident is wrong, but for that reason alone I can not rule out, is that a new database should be created for each user with all the requisite tables.
Am I crazy? Is option one really the best?
The first method is the best. Create individual userId's and then you can assign specific roles to them. A database retrieval time indeed depends on the number of records too. But, there is a trade-off where you can write efficient sql queries to fetch data. Well, according to this site, you will probably won't run out of memory or run into concurrency issues, because with a good server, the performance ought to be good, provided that you are efficient in writing queries.
If you recreate table sets, you will just end up creating lots of tables and can make the indexing slow which is a bad practice. Whereas if you opt of relational database scheme rather than an ordinary database scheme, and normalize the database and datatables for improving efficiency.
Creating a new database for each and every user, just sums up the complexity from both the above statements resulting in a shabby and disorganized database access. Because, if you decide to run individual instances of databases for every single user, you would just end up consuming your servers physical resources like RAM and CPU usage which will affect the service quality of all the other users.
Take up option 1. Assign separate userIds and assign them roles and privileges where needed. That is more efficient than the other two methods.

Display Statistics? SQL View, Sql trigger, Cached Values

I am working on a web app and would like to be able to display some computed statistics about different objects to the user. Some good examples would be: The "Top Questions" page on this site - SO - which lists "Votes", "Answers", and "Views" or the "Comment" and "Like" counts for a list of posts on the Facebook News Feed. Actually computed values like these are used all over the web and I am not sure the best way to implement them.
I will describe in greater detail a generic view of the problem. I have a single parent table in my database (you can visualize it as a blog post). This table has a one-to-many realtionship with 5 other tables (visualize it as comments and likes etc.). I need to diplay a list of ~20 parent table objects with the counts of each related child object (visualize it as a list of blog posts each displaying the total number of coments and total number of likes). I can think of multiple ways to tackle the problem, but I am not sure which would be the FASTEST and most ACCURATE.
Here are a number of options I have come up with...
A) SQL Trigger - Create a trigger to increment and decrement a computed count column on the parent table as the child tables have inserts and deletes performed. In not sure about performance tradeoffs running the trigger every time a small child object is created or deleted. I am also unsure about potential concurrency issues (although in my current architecture each row in the db can only be added or deleted by the row creator).
B) SQL View - Just an easier way to query and will yield accurate results, but I am worried about the performance implications for this type of view.
C) SQL Indexed View - An indexed view would be accurate and potentially faster, but as each child table has rows that can be added or removed, the view would constatntly have to be recalculated.
D)Cached Changes - Some kind of interim in process solution that would cache changes to the child tables, computed net changes to the counts, and flush them to the db based on some parameter. This could potentially be coupled with a process that checks for accuracy every so often.
E) Something awesome I haven't thought of yet :) How does SO keep track of soooo many stats??
Using SQL Seerver2008R2
**Please keep in mind that I am building something custom and it is not a blog/FB/SO, I am just using them as an exaple of a similar problem so suggesting to just use those sites is unhelpful ;) I would love to hear how this is accomplished in live web apps that handle a decent volume of requests.
THANKS in Advance

How to manage multiple versions of the same record

I am doing short-term contract work for a company that is trying to implement a check-in/check-out type of workflow for their database records.
Here's how it should work...
A user creates a new entity within the application. There are about 20 related tables that will be populated in addition to the main entity table.
Once the entity is created the user will mark it as the master.
Another user can make changes to the master only by "checking out" the entity. Multiple users can checkout the entity at the same time.
Once the user has made all the necessary changes to the entity, they put it in a "needs approval" status.
After an authorized user reviews the entity, they can promote it to master which will put the original record in a tombstoned status.
The way they are currently accomplishing the "check out" is by duplicating the entity records in all the tables. The primary keys include EntityID + EntityDate, so they duplicate the entity records in all related tables with the same EntityID and an updated EntityDate and give it a status of "checked out". When the record is put into the next state (needs approval), the duplication occurs again. Eventually it will be promoted to master at which time the final record is marked as master and the original master is marked as dead.
This design seems hideous to me, but I understand why they've done it. When someone looks up an entity from within the application, they need to see all current versions of that entity. This was a very straightforward way for making that happen. But the fact that they are representing the same entity multiple times within the same table(s) doesn't sit well with me, nor does the fact that they are duplicating EVERY piece of data rather than only storing deltas.
I would be interested in hearing your reaction to the design, whether positive or negative.
I would also be grateful for any resoures you can point me to that might be useful for seeing how someone else has implemented such a mechanism.
Thanks!
Darvis
I've worked on a system like this which supported the static data for trading at a very large bank. The static data in this case is things like the details of counterparties, standard settlement instructions, currencies (not FX rates) etc. Every entity in the database was versioned, and changing an entity involved creating a new version, changing that version and getting the version approved. They did not however let multiple people create versions at the same time.
This lead to a horribly complex database, with every join having to take version and approval state into account. In fact the software I wrote for them was middleware that abstracted this complex, versioned data into something that end-user applications could actually use.
The only thing that could have made it any worse was to store deltas instead of complete versioned objects. So the point of this answer is - don't try to implement deltas!
This looks like an example of a temporal database schema -- Often, in cases like that, there is a distinction made between an entity's key (EntityID, in your case) and the row primary key in the database (in your case, {EntityID, date}, but often a simple integer). You have to accept that the same entity is represented multiple times in the database, at different points in its history. Every database row still has a unique ID; it's just that your database is tracking versions, rather than entities.
You can manage data like that, and it can be very good at tracking changes to data, and providing accountability, if that is required, but it makes all of your queries quite a bit more complex.
You can read about the rationale behind, and design of temporal databases on Wikipedia
You are describing a homebrew Content Management System which was probably hacked together over time, is - for the reasons you state - redundant and inefficient, and given the nature of such systems in firms is unlikely to be displaced without massive organizational effort.