Sql Database structure for housing historical data and display changes - sql

Good morning,
This is more of a concept question then anything.
I am looking to design a database and interface that will track changes to the entries (in this case people) and display those changes readily.
(user experience would look something like this)
for user A
Date Category Activity
8/8/14 change position position 1 -> position 2
8/9/14 change department department a -> department b
...
...
the visual experience seem like it would benefit from an E-A-V design, however i am designing the database to be easy to data mine and from my reading, i think that E-A-V is not the right way to go.
does it make sense to duplicate data just to display it?
if not, does anyone have a suggestion of how to query the history table and display? (currently using jquery and php to leverage the db...i suppose i could do something interesting from a coding perspective to get it done)
thank you for your help,
Travis

Creating an efficient operational database environment and a creating an 'easy-to-data mine' environment are two separate (and often opposing) goals.
Others might disagree with me but in my opinion it is best to create your database based on operational readiness (This means using your E-A-V design as mentioned above) and then worry about data transformation later. This may make it inconvenient later to transform the data to allow for easy mining but it will accomplish an incredibly important goal which is to eliminate the possibility for data error.
Once you have a good system in place where you can collect data appropriately, then you can create a warehouse or datamart environment to more conveniently extract that data.
This may sound like a lot of work but from a data integrity perspective, it is much safer than trying to create some system that is designed entirely for reporting. That's my personal opinion at least.

(sorry cannot comment yet)
You have to analyse the data you need to persist.
if you have only a couple of tables, with no relationship, you probably don't need the database.
In this case the database solution probably will be slower(connection/transmission/security overhead ...).
well if it's a few MBs of data, I would keep everything in one table.
You can easily load the whole data set in memory and do what you need to do.

Related

Change detection in complex system

This might seem like a fairly specific question but I'm wondering if there is any technology/pattern out there that might help me in a current project. I have a hugely complex database which is updated by multiple systems. I now need to do change tracking on various bits of data that is spread across multiple tables so that I can send it to a third party system.
I've considered a number of options but unfortunately I can't seem to come to any other conclusion than using database triggers. I'm thinking of storing a flag in a table (or queue) to identify the rows that have changed and then building an xml diff containing the changed data to send to a web service. This feels a little dirty so I was wondering if anyone could think of a better alternative.
Depending on the database platform you're using, you might look into Change Data Capture. Since you mention .NET, here's some info about it: http://technet.microsoft.com/en-us/library/bb522489(v=sql.105).aspx
Other database systems may offer something similar.
Another option would be insert/update/delete triggers on the tables, however triggers should be approached carefully as they can cause some significant performance problems if not done right.
And yet another option still would be what you describe - some sort of flag to monitor for changes. A simple CREATED and MODIFIED timestamp fields can go a long way here, as rather than just a bit indicator suggesting that the row may need attention, you'll know when the update happened, and your export process can be programmed accordingly (e.g., select * from table where modified > getdate()-1).

Database Modification or start over?

So I'm currently working on rebuilding an existing website that is used internally at my company for project management, at heart it is a bug tracking utility that has some customer support and accounting operations linked into it.
Currently the database model is very repetitive, a good example of this is, currently a UserId is linked into a record (FK relationship into a user table that contains all the information about the user) and then all the information about the user also exists in the table.
I've been tasked with improving the website and the functionality of the model; however, I want to reduce the repetition of data in the website (is this normalization or is that the breaking apart of unlinked items into separate tables?). I'm not sure what the best method of doing this would be. I'm thinking of generating the creation scripts for the database and creating a new database project in VS to then modify the database, then generating some scripts to populate the new database model from the old database.
I plan on using the Entity Framework and ASP. NET MVC 2 to build the website as I think it provides the most flexible model moving forward for the modification and maintenance of the website.
The reason I ask all of this is because I'm very familiar with using databases and modifying existing ones to be used in applications and websites but I'm trying to discover the best way to build one.
I'm curious if there is any material on the best way to do this or if I should be using a different tool to do this with?
Edit: Providing more information on the model
There are 4 major areas that we have that are used:
Cases (Bugs, Features, Working Tasks, Etc)
2 .Tickets (Tech Support Events)
Errors (Errors Generated from our logging Library, Basically a stack trace with customer information)
License (Keeps track of each customers License allows modification to those licenses)
These are the Objects that are intermixed and used throughout the above 4 major areas.
Users (People who use the system)
Customers (People who use our software)
Stores (Places where our customers use our software)
Products (Our Software)
Relationships
Cases:
A Cases has to have a User, can have a Customer, Store, Error, Ticket and/or Product
Tickets
A Ticket has to have a User and a Customer, can have a Store, Error and/or Product
Errors:
A Error has to have a Product, Can Have a Case, Ticket, Store, and/or Product
Licenses:
A Licenses has to have a Product and Customer, can have a Store
Like I said very basic website, with a not super complex database, if done correctly.
Currently the database has no FK constraints, replication of lots of information across each table and lots of extra tables that are duplicates with different names.
E.g.
Each Case type has a separate table so there is a FeatureRequest, Bug, Tasks, Completed, etc table that all contain the same information.
Normalization is about storing data without redundancy or anomalies.
One example of an anomaly could be when attributes about a user in your main table are not in sync with the users table. Someone changes information about that user in one table without reflecting the changes in the redundant copy. The problem is that it's hard to know which change is the correct one.
Some people think that normalization is just about breaking apart tables into littler tables, because that's what they see as the most common type of change. But that's not the goal of normalization. It's just by coincidence that most mistakes of non-normalization involve stuffing too much data into one table where multiple tables would be correct.
It's hard to answer your question about whether to modify your database in-place or whether to create a whole new database and migrate to it.
What I would do in your case is to design a properly normalized database, and then examine the differences between that and your existing database. Imagine what you would have to do for each difference, to change your old database to the new one, versus a data migration. It could be that only a few changes are needed, only dropping the redundant columns. Or it could be that some major rework is needed. It's impossible to tell until you do the work of creating a normalized data model so you can compare.
The bigger task might be to adapt your application code that uses the database. One way to ease this transition is to create database views on top of the normalized database, which mimic your old non-normalized database. That way hopefully you don't have to rewrite every bit of code in your app all at once, you can keep some of it the same at least until you can refactor the code.
Also having a good set of regression tests in place is ideal, so you can be sure your app still does all the tasks it is supposed to do, as you refactor the database and the code that uses the database.
Re your comment: You mention that you're adding new functionality to the user model at the same time. I would find it too confusing to try to do this simultaneously with refactoring. Refactoring typically does not change functionality, it only changes implementation. But refactoring adds value because it makes the code easier to maintain or debug, improves efficiency, or prepares you to make future functionality changes more easily.
I would recommend that you bit the bullet and add your new user model features to the old non-normalized database. It's good to get the benefit of new features in the short term, and also you need to develop those features first to understand them well enough to account for them in your big refactoring project.
Here are some suggestions for resources to help you truly understand what normalization means:
SQL and Relational Theory by C. J. Date
A Simple Guide to Five Normal Forms in Relational Database Theory by William Kent
Database Normalization at Wikipedia and its sub-pages for each respective normal form
SQL Antipatterns Volume 1: Avoiding the Pitfalls of Database Programming by me, Bill Karwin. I wrote a chapter about database normalization that I hope explains it in plain English and with good examples.
Here are a couple of resources for managing changes to a database:
Refactoring Databases by Scott W. Ambler and Pramodkumar J. Sadalage
Agile Database Techniques: Effective Strategies for the Agile Software Developer by Scott W. Ambler
How long do you have, and how big is the database?
It's very difficult to answer this question black and white without being immersed in your environment and business case. It really doesn't seem like your limitation is technology wise, just to choose between solutions.
Re-creating is what programmers instinctively go for. However, in the "real world", sometimes we spend a lot of effort into something that isn't that used or wont last that long.
So food for thought. How long will it take you to re-do the database, how much will it cost. Will working with what's existent be sufficient for the functionality asked?

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.

SQL and Flat Files... In harmony?

I was just thinking, how quick it would be to store the actual data of an application in a flat file.
Now, you can't just go storing everything in a flat file... sometimes sorts and searches are required, and to go through directories and files recursively could be a pain.
Now, imagine, you stored all your search-able data in a database, and had a pointer field, that pointed to a data file?
This would be very specific per app, however- so long as all my search-able data is stored in the database, why should I store the actual data in a database?
(Locking, Data integrity aside) it would be faster, I am sure... but how much, and is it worth doing it?
Well you often want to do things in queries beyond search on the data. For instance you might might not search on a field called cost_center, but you might have a case statment that processes things differently depending on the information in the field. Or you might need to concatenate information together. You might update one field based onthe information in another field. You might not search on a field today and need to search on it tomorrow.
A properly designed relational database can easily perform well with terrabytes of data.
And frankly you should never even consider "data integrity aside". If you don't have data integrity you don't have data.
As to whether what you want is a good idea, it depends on the type of data you are storing and the types of things you intend to do with it. There isn't enough information to say for sure.
Well "Locking, Data integrity aside" should mean a faster system. If you drop constraints you should improve performance.
But in practical terms, I don't think it's going to be faster. There's lot of development time behind RDBMSs and that's why they are quick. Sure, non-relational databases are performing better than them in highly parallel situations and scenarios which take advantage of their qualities, for instance. However, your idea does not offer an improvement such as exploiting parallelism... any performance advantage would come from dropping the qualities of RDBMSs...
As well as other answers...
Sharing of data: how are multiple clients going to access data on a share?
Backup/Restore: synching of text and "searchable"
Security/permissions on text data
Change anomalies
There is no need to implement a SQL database just to perform searches. Lots of applications store their data in XML, and you can search in many ways, e.g., using Lucene. How fast it is entirely depends on the quantity of data and how you structure it - just like a database.
It can perform very fast, but can complicate things when you want to run more than one app server.
BTrieve was essential what you describe. Back in the DOS days it was a very fast database.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.