Db design for data update approval - sql

I'm working on a project where we need to have data entered or updated by some users go through a pending status before being added into 'live data'.
Whilst preparing the data the user can save incomplete records. Whilst the data is in the pending status we don't want the data to affect rules imposed on users editing the live data e.g. a user working on the live data should not run up against a unique contraint when entering the same data that is already in the pending status.
I envisage that sets of data updates will be grouped into a 'data submission' and the data will be re-validated and corrected/rejected/approved when someone quality control the submission.
I've thought about two scenarios with regards to storing the data:
1) Keeping the pending status data in the same table as the live data, but adding a flag to indicate its status. I could see issues here with having to remove contraints or make required fields nullable to support the 'incomplete' status data. Then there is the issue with how to handle updating existing data, you would have to add a new row for an update and link it back to existing 'live' row. This seems a bit messy to me.
2) Add new tables that mirror the live tables and store the data in there until it has been approved. This would allow me to keep full control over the existing live tables while the 'pending' tables can be abused with whatever the user feels he wants to put in there. The downside of this is that I will end up with a lot of extra tables/SPs in the db. Another issue I was thinking about was how might a user link between two records, whereby the record linked to might be a record in the live table or one in the pending table, but I suppose in this situation you could always take a copy of the linked record and treat it as an update?
Neither solutions seem perfect, but the second one seems like the better option to me - is there a third solution?

Your option 2 very much sounds like the best idea. If you want to use referential integrity and all the nice things you get with a DBMS you can't have the pending data in the same table. But there is no need for there to be unstructured data- pending data is still structured and presumably you want the db to play its part in enforcing rules even on this data. Even if you didn't, pending data fits well into a standard table structure.
A separate set of tables sounds the right answer. You can bring the primary key of the row being changed into the pending table so you know what item is being edited, or what item is being linked to.
I don't know your situation exactly so this might not be appropriate, but an idea would be to have a separate table for storing the batch of edits that are being made, because then you can quality control a batch, or submit a batch to live. Each pending table could have a batch key so you know what batch it is part of. You'll have to find a way to control multiple pending edits to the same rows (if you want to) but that doesn't seem too tricky a problem to solve.
I'm not sure if this fits but it might be worth looking into 'Master Data Management' tools such as SQL Server's Master Data Services.

'Unit of work' is a good name for 'data submission'.
You could serialize it to a different place, like (non-relational) document-oriented database, and only save to relational DB on approval.
Depends on how many of live data constraints still need to apply to the unapproved data.

I think second option is better. To manage this, you can use View which will contain both tables and you can work with this structure through view.
Another good approach is to use XML column in a separate table to store necessary data(because of unknown quantity/names of columns). You can create just one table with XML column ad column "Type" do determine which table this document is related with.

First scenerio seems to be good.
Add Status column in the table.There is no need to remove Nullable constraint just add one function to check the required fields based on flag like If flag is 1(incomplete) Null is allowed otherwise Not allowed.
regarding second doubt do you want to append the data or update the whole data.

Related

Custom user defined database fields, what is the best solution?

To keep this as short as possible I'm going to use and example.
So let's say I have a simple database that has the following tables:
company - ( "idcompany", "name", "createdOn" )
user - ( "iduser", "idcompany", "name", "dob", "createdOn" )
event - ( "idevent", "idcompany", "name", "description", "date", "createdOn" )
Many users can be linked to a single company as well as multiple events and many events can be linked to a single company. All companies, users and events have columns as show above in common. However, what if I wanted to give my customers the ability to add custom fields to both their users and their events for any unique extra information they wish to store. These extra fields would be on a company wide basis, not on a per record basis ( so a company adding a custom field to their users would add it to all of their users not just one specific user ). The custom fields also need to be sesrchable and have the ability to be reported on, ideally automatically with some sort of report wizard. Considering the database is expected to have lots of traffic as well as lots of custom fields, what is the best solution for this?
My current research and findings in possible solutions:
To have generic placeholder columns such as "custom1", "custom2" etc.
** This is not viable as there will eventually be too many custom columns and there will be too many NULL values stored in the database
To have 3x tables per current table. eg: user, user-custom-field, user-custom-field-value. The user table being the same. The user-custom-field table containing the information about the new field such as name, data type etc. And the user-custom-field-value table containing the value for the custom field
** This one is more of a contender if it were not for its complexity and table size implications. I think it will be impossible to avoid a user-custom-field table if I want to automatically report on these fields as I will have to store the information on how to report on these fields here. However, In order to pull almost any data you would have to do a million joins on the user-custom-field-value table as well as the fact that your now storing column data as rows which in a database expected to have a lot of traffic as well as a lot of custom fields would soon cause a problem.
Create a new user and event table for each new company that is added to the system removing the company id from within those tables and instead using it in the table name ( eg user56, 56 being the company id ). Then allowing the user to trigger DB commands that add the new custom columns to the tables giving them the power to decide if it has a default value or auto increments etc.
** Everytime I have seen this solution it has always instantly been shut down by people saying it would be unmanageable as you would eventually get thousands of tables. However nobody really explains what they mean by unmanageable. Firstly as far as my understanding goes, more tables is actually more efficient and produces faster search times as the tables are much smaller. Secondly, yes I understand that making any common table changes would be difficult but all you would have to do is run a script that changes all your tables for each company. Finally I actually see benefits using this method as it would seperate company data making it impossible for one to accidentally access another's data via a potential bug, plus it would potentially give the ability to back up and restore company data individually. If someone could elaborate on why this is perceived as a bad idea It would be appreciated.
Convert fully or partially to a NoSQL database.
** Honestly I have no experience with schemaless databases and don't really know how dynamic user defined fields on a per record basis would work ( although I know it's possible ). If someone could explain the implications of the switch or differences in queries and potential benefits that would be appreciated.
Create a JSON column in each table that requires extra fields. Then add the extra fields into that JSON object.
** The issue I have with this solution is that it is nearly impossible to filter data via the custom columns. You would not be able to report on these columns and until you have received and processed them you don't really know what is in them.
Finally if anyone has a solution not mentioned above or any thoughts or disagreements on any of my notes please tell me as this is all I have been able to find or figure out for myself.
A typical solution is to have a JSON (or XML) column that contains the user-defined fields. This would be an additional column in each table.
This is the most flexible. It allows:
New fields to be created at any time.
No modification to the existing table to do so.
Supports any reasonable type of field, including types not readily available in SQL (i.e. array).
On the downside,
There is no validation of the fields.
Some databases support JSON but do not support indexes on them.
JSON is not "known" to the database for things like foreign key constraints and table definitions.

What is the best method of logging data changes and user activity in an SQL database?

I'm starting a new application and was wondering what the best method of logging is. Some tables in the database will need to have every change recorded, and the user that made the change. Other tables may just need to have the last modified time recorded.
In previous applications I've used different methods to do this but want to hear what others have done.
I've tried the following:
Add a "modified" date-time field to the table to record the last time it was edited.
Add a secondary table just for recording changes in a primary table. Each row in the secondary table represents a changed field in the primary table. So one record update in the primary could create several records in the secondary table.
Add a table similar to no.2 but it records edits across three or fours tables, reference the table it relates to in an additional field.
what methods do you use and would recommend?
Also what is the best way to record deleted data? I never like the idea that a user can permanently delete a record from the DB, so usually I have a boolean field 'deleted' which is changed to true when its deleted, and then it'll be filtered out of all queries at model level. Any other suggestions on this?
Last one.. What is the best method for recording user activity? At the moment I have a table which records logins/logouts/password changes etc, and depending what the action is, gives it a code either 1,2, 3 etc.
Hope I haven't crammed too much into this question. thanks.
I know it's a very old question, but I'd wanted to add more detailed answer as this is the first link I got googling about db logging.
There are basically two ways to log data changes:
on application server layer
on database layer.
If you can, just use logging on server side. It is much more clear and flexible.
If you need to log on database layer you can use triggers, as #StanislavL said. But triggers can slow down your database performance and limit you to store change log in the same database.
Also, you can look at the transaction log monitoring.
For example, in PostgreSQL you can use mechanism of logical replication to stream changes in json format from your database to anywhere.
In the separate service you can receive, handle and log changes in any form and in any database (for example just put json you got to Mongo)
You can add triggers to any tracked table to olisten insert/update/delete. In the triggers just check NEW and OLD values and write them in a special table with columns
table_name
entity_id
modification_time
previous_value
new_value
user
It's hard to figure out user who makes changes but possible if you add changed_by column in the table you listen.

Is it sensible to have a table that does not reference any other in a database design?

I'd like to get some advice on database design. Specifically, consider the following (hypothetical) scenario:
Employees - table holding all employee details
Users - table holding employees that have username and password to access software
UserLog - table to track when users login and logout and calculate
time on software
In this scenario, if an employee leaves the company I also want to make sure I delete them from the Users table so that they can no longer access the software. I can achieve this using ON DELETE CASCADE as part of the FK relationship between EmployeeID in Employees and Users.
However, I don't want to delete their details from the UserLog as I am interested in collating data on how long people spend on the software and the fact that they no longer work at the company does not mean their user behaviour is no longer relevant.
What I am left with is a table UserLog that has no relationships with any other tables in my database. Is this a sensible idea?
Having looked through books etc / googled online I haven't come across any DB schemas with tables that have no relationships with others and so my gut instinct here is saying that my approach is not robust...
I'd appreciate some guidance please.
My personal preference in this case would be to "soft delete" an employee by adding a "DeletedDate" column to the Employees table. This will allow you to maintain referential integrity with your UserLog table and all details for all employees, past and present, remain available in the database.
The downside to this approach is that you need to add application logic to check for active employees.
Yes, this is perfectly sensible. The log is just a raw audit of data that should never change. It doesn't need to be normalized (and shouldn't be) and/or linked to other tables.
Ideally, I would put write-heavy audit logging in a different database entirely than the read-heavy transactional day-to-day stuff. They may grow differently over time. But starting small it's fine to keep them in the same database as long as you understand the fundamental differences between them.
On a side note, I would recommend not deleting the users from the tables. Maybe have some kind of IsActive or IsDeleted bit on them that would effectively blind them from the application, but deleting should be avoided if possible.
The problem you have here is that it's perfectly possible to insert UserLog data for users that have never existed as there's no link to the table that defines valid users.
I would say that perhaps the better course of action would be to mark the users as invalid and remove all their personal details when they leave rather than delete the record entirely.
That's not to say there aren't situations where it is valid to have a table (or tables) on the database that don't reference others.
Is this a sensible idea
The problem is this. Since the data isn't linked you can delete something from the employee table and still have references in the UserLog. After the employee infomration is deleted, you have no way of knowing what Log data ties back to. Is this ok? Technically yes. There is nothing preventing you from doing it, but then why are you keeping the data in the first place? You also have no guarantee that the data in the table actually is about an employee. Someone could accidently enter a wrong EmployeeID in the table that doesn't belong to anyone. Keys help prevent data corruption. It's always better to have extra data than it is to have bad data.
What I've found is that you never want to delete data when possible. Space is cheap, and you can add flags etc. to show the record isn't active. Yes, this does cause more work (this can be quickly remedied by creating a view which only shows active employees), and saying that you should never delete data is far fetched, but you start linking data together. Deleting becomes very difficult. If you are not adding a FK just so you can delete records, it's a tell tale sign you need to rethink your strategy.
Relying on Cascade Delete can be very dangerous too. The model you are stating is that anytime you don't want data deleted you have to know not to add a FK to that table which links it back to users. It doesn't take long for someone to forget this.
What you can do is use logical deletion or disabling a user by adding a bool value Deleted or Disabled to the Users table.
Or replace the EmployeeId with the name of the employee in the UserLog.
An alternative to using the soft delete process, is to store all the historical details you would want about the user at the time the log record is created rather than store the employee id. So you might have username, logintime, logouttime, sessionlength in your table.
Sensible? Sure, as in it makes sense as you've described your need to keep those users indefinitely. The problem you'll run into is maintaining the tables. Instead of doing a cascading update once, you'll have to use at least two updates in order to insert a new user.
I think a table as you are suggesting is perfectly fine. I frequently encounter log tables that are do not have explicit relationships with other tables. Just because a database is "relational" doesn't mean everything has to relate haha.
One thing that I do notice though is that you are using EmployeeID in the log, but not using it as a foreign key to your Employee table. I understand why you don't want that, since you will be dropping employees. But, if you are dropping them completely, then the EmployeeID column is meaningless.
A solution to this would be to keep a flag for employees, such as active, that tracks if they are active or not. That way, the log data is meaningful.
IANADBA but it's generally considered very bad practice indeed to delete almost anything from a DB ever,It would be far better here to have some kind of locked flag / "deleted" datestamp on your users table and preserve your FK.

Designing archive in database. Some patterns maybe?

We are currently doing an web application which one of the functionality is to create Events by the user. Those events can be later on deleted by the user or administrator. However the client requires that the event is not really physically deleted from database, but just marked as deleted. User should only see non-deleted events, but administrators should be able to browse also through deleted ones. That's all really the functionality there is.
Now I suggested that we should simply add one more extra column called "status", which would have couple of valid values: ACTIVE and DELETED. This way we can distinguish between normal(active) and deleted events and create really simple queries (SELECT * FROM EVENTS WHERE STATUS = 'ACTIVE').
My colleague however disagreed. He pointed out that regardless of the fact that right now active events and deleted events share same information (thus they can be stored in the same table) in a future requirements my change and client for example will need to store some additional information about deleted Event (like date of deletion, who deleted it, why he did it - sort of comment). He said that to fulfil those requirements in a future we would have to add additional columns in EVENTS table that would hold data specific for the deleted Events and not for active events. He proposed a solution, where additional table is created (like DELETED_EVENTS) with same schema as EVENTS table. Every deleted event would be physical deleted from EVENTS table and be moved to DELETED_EVENTS table.
I strongly disagreed with his idea. Not only would it make SQL query more complex and less efficient but also this totally is against YAGNI. I also disagreed with him that my idea would made us to create additional (not nullable) columns in EVENTS table, if the requirements changed in a future. In my scenario I would simply create new table like DELETED_EVENTS_DATA (that would hold those additional, archive data) and would add reference key in the EVENTS table to maintain one to one relationship between EVETNS and DELETED_EVENTS_DATA tables.
Nevertheless I was struggled by the fact that two developers who commonly share similar view on software and database design could have so radically different opinions about how this requirements should be designed in a database level. I thought that we maybe both going in a wrong direction and there is another (third) solution? Or are there more then just one alternative?
How do you design this sort of requirements? Are there any patterns or guidelines on how to do it properly? Any help will be deeply appreciated
Don't use a status column.
At minimum you should have a datedeleted and a deletedby columns. Just knowing something was removed isn't helpful, even if the client isn't asking for it right now the very first time they go to look at the deleted events they will want to know who in order to discern why.
If the events table is likely to grow pretty large in size it is common to move the deleted / archived data into a different table entirely. Usually you will allocate those tables to a different database file. That file usually lives on a different drive in order to keep performance up. I'm not saying a whole new database, just a different database file.
If you keep it in the same table, all of your queries should have a where clause on (DateDeleted is null). Obviously you don't have that requirement if the information is moved to a different table.. Which is why I recommend that way of doing things.
I found that taking snapshots of an object with every event (creation, update, etc.) and storing those snapshots (along with dates and user info) in another table allows you to meet all kinds of historical tracking needs in the lifetime of an application. You can then present the snapshots to the user, present chronological changes to the user, deduce the state of an object on a given date, etc..
I'm sure there are official design patterns out there - this is just one that I've refined over time and it works well. It's not efficient with disk space however.
EDIT: Also, when user deleted an object, I would flag the record as deleted and take a final snapshot for the history table. You could hide the object from the interface indefinitely or you could choose to show it - depends on usage needs.
OK the way we handle it is as follows.
We have an extra column on every table called 'Deleted' this is a bit field. Then as you rightly have said your queries are quite simple as its just a where clause to filter them out or leave them in. Only thing you need to make sure is that any reporting or stats that you generate filter out the deleted records.
Then for the extra info that you are talking about wanting to capture, just this extra info s would go in a separate 'audit' like table. In our case we have made this extra table quite generic and it can hold this audit info for any table... see below how it works...
Event
EventId EventName ... Deleted
1 Dinner 0
2 Supper 1
3 Lunch 0
4 Lunch 1
Audit
AuditId EntityTypeId EntityId ActionTypeId ActionDateTime ... etc
1 1 (Event) 2 (EventId) 1 (Deleted) 2/1/2010 12:00:00
1 1 (Event) 4 (EventId) 1 (Deleted) 3/1/2010 12:00:00
Now if you have other entities you want to capture (like Location - where Location is a table) as well it would look like this...
Audit
AuditId EntityTypeId EntityId ActionTypeId ActionDateTime ... etc
1 1 (Event) 2 (EventId) 1 (Deleted) 1/1/2010 12:00:00
1 1 (Event) 4 (EventId) 1 (Deleted) 2/1/2010 12:00:00
1 2 (Event) 2 (LocationId) 1 (Deleted) 3/1/2010 12:00:00
1 2 (Event) 8 (LocationId) 1 (Deleted) 4/1/2010 12:00:00
1 2 (Event) 9 (LocationId) 1 (Deleted) 5/1/2010 12:00:00
Then when you want to get out the extra audit data you are talking about its quite simple. The query would look something like this
SELECT *
FROM Event E
INNER JOIN Audit A
ON E.EventId = A.EntityId
WHERE E.Deleted = 1
AND A.EntityTypeId = 1 -- Where 1 stands for events
Also this audit table can capture other events and not just deletes... This is done via using the ActionTypeId column. At the moment it just has 1 (which is delete), but you could have others as well.
Hope this helps
EDIT:
On top of this if we have strong Audit requirements we do the following... None of the above changes but we create a second database called 'xyz_Audit' which captures the pre and post for every action that happens within the database. This second database has the same schema as the first database (without the Audit table) except that every table has 2 extra columns.
The first extra column is a PrePostFlag and the second column is the AuditId. Hence the primary key now goes across 3 columns, 'xyzId', 'PrePostFlag' and 'AuditId'.
By doing this we can give the admins full power to know who did what when, the data that changed and how it changed and to undelete a record we just need to change the deleted flag in the primary database.
Also by having this data in a different database it allows us to have different optimization, storage and management plans to the main transnational database.
I would add the flag field for now, and only bother to plan the rest when you actively know what you will have to do, and the system has also accumulated real-world data and user experiences, so you have some data to base your performance/complexity design decisions on.
It's often a judgement call in situations like this. Not knowing any more than you told me, I would tend to go with your solution though, which is to just have a virtual delete. I believe your application of YAGNI is good. If the user does in the future give requirements for logging stages in the events life, it's likely that at this time you guys will not correctly guess exactly what those requirements will be. This is especially true if the logic for dealing with events in the DB is well encapsulated (easy to change later).
However, if you know this client well, and you know of similar types of historical-type requirements they have had, and the functionality won't be well encapsulated, maybe your colleague is making a good guess. The key here is that whichever one of you is correct, it's not by much. Both sides have merit.
By the way, it will be better to have a boolean (yes/no) IsDeleted column, with an index beginning with that column. That will be quicker, though it perhaps would not make a big enough difference to matter.
Alot of this depends onthe size of the tables and whether you really need additonal information about the deletion.
In most cases, the deleted flag field is all you need. Then you create a view that selects the records where the record has not been deleted. Use the view for all queries for the users instead of directly accessing the tables.
If you have auditing, you already know who marked the record as deleted and when.
If not, you should add those fields to your table.
Periodically, I might remove deleted records to an archive table in order to imporve query performance on the main table. Say move all deleted records that have been deleted more than 6 months. Then have anothe view that combines both the normal table and the archive table for the admins to query on.
This combination of both approaches in conjuncction with using views gets you the best of both worlds, your table stays realtively small for querying, everyone can see just the records they need to see and it is relatively easy to undelete something deleted by accident, archiving old records can happen at a low usage period of the day rather than when the records are marked for deletion.
When a user creates, modifies or deletes an event, create a new transaction object. Store everything about the change in event in the transaction, and add it to a table, with a reference to the event. That way you have an audit log of everything the user has done. This adds minimal complexity but also allows for extension. You could even add an undo feature later on with minimal, if any, change to your data model.
So if the user is viewing the logs, you can retrieve every log without a DELETE transaction associated with it, though administrators would be able to see everything.

SQL - Table Design - DateCreated and DateUpdated columns

For my application there are several entity classes, User, Customer, Post, and so on
I'm about to design the database and I want to store the date when the entities were created and updated. This is where it gets tricky. Sure one option is to add created_timestamp and update_timestamp columns for each of the entity tables but that isn't that redudant?
Another possibility could be to create a log table that stores this information, and it could be made to contain keep track of updates for any entity.
Any thoughts? I'm leaning on implementing the latter.
The single-log-table-for-all-tables approach has two main problems that I can think of:
The design of the log table will (probably) constrain the design of all the other tables. Most likely the log table would have one column named TableName and then another column named PKValue (which would store the primary key value for the record you're logging). If some of your tables have compound primary keys (i.e. more than one column), then the design of your log table would have to account for this (probably by having columns like PKValue1, PKValue2 etc.).
If this is a web application of some sort, then the user identity that would be available from a trigger would be the application's account, instead of the ID of the web app user (which is most likely what you really want to store in your CreatedBy field). This would only help you distinguish between records created by your web app code and records created otherwise.
CreatedDate and ModifiedDate columns aren't redundant just because they're defined in each table. I would stick with that approach and put insert and update triggers on each table to populate those columns. If I also needed to record the end-user who made the change, I would skip the triggers and populate the timestamp and user fields from my application code.
I do the latter, with a "log" or "events" table. In my experience, the "updated" timestamp becomes frustrating pretty quick, because a lot of the time you find yourself in a fix where you want not just the very latest update time.
How often will you need to include the created/updated timestamps in your presentation layer? If the answer is anything more than "once in a great great while", I think you would be better served by having those columns in each table.
On a project I worked on a couple of years ago, we implemented triggers which updated what we called an audit table (it stored basic information about the changes being made, one audit table per table). This included modified date (and last modified).
They were only applied to key tables (not joins or reference data tables).
This removed a lot of the normal frustration of having to account for LastCreated & LastModified fields, but introduced the annoyance of keeping the triggers up to date.
In the end the trigger/audit table design worked well and all we had to remember was to remove and reapply the triggers before ETL(!).
It's for a web based CMS I work on. The creation and last updated dates will be displayed on most pages and there will be lists for the last created (and updated) pages. The admin interface will also use this information.