Store files and comments for a web application. Table design - sql

I have a web appliaction with several entities (tables).Each one has his CRUD pages.
I'd like to add for some the, the ability to add comments and attach files.
I was thinking of two scenarios.
One table for all comments/files - table would have some id for the entity and the particular record.
For each entity a separate comments/files table.
The files would be stored on the disk in a directory.In the table would be the name of the file and some additional info.

In term of application Design having one unique table for all coments seems to make sense. In term of application code that mean the same SQL will be reused for all entities. It's the 'classical way' used by most applications, extending on having the same acitive records and controllers used to handle comments and attachments for all objects.
In term of SQL thesecond solution could be usefull in some databases like MySQL to get more Memory Cache benefit. Every comment/attachmlent added in the 1st solution would drop from the memory cache all requests impacting the comment table. With individual tables a comment on one entity would not invalidate queries on other entities. But you would alos require more file descriptors and a bigger table cache.... so to choose this solution you would need a decision based on real-life, precise, case, where you would be able to compare the benefits in database access speed. And when you will add new entities you'll certainly find your each-entity-have-a-comment-table solution boring, things could have been automated by using 1st solution.

It's a tradeoff. With a single comments table, you get a simple, DRY (don't repeat yourself) schema, but you don't get foreign key constraints and thus no cascade deletion. Thus, if you delete an entity with comments, you must also remember to delete the comments!
If you go with multiple comment tables, you get FK constraints and cascade deletion, but you have a "wet" schema (you are repeating yourself). For example, each comment table might have a commentbody column. If you change that column definition, you have to change it in every comment table!
One interesting solution for a DRY-er schema could involve table inheritance (see http://www.postgresql.org/docs/9.0/interactive/ddl-inherit.html) but please read section 5.8.1. Caveats, as there are some "gotchas" regarding indexing, at least in postgres.
Either way, kudos to you for thinking carefully about your database design!

Related

Is there a database design pattern name for reducing duplicate join table data?

I have two tables with a join table to allow a many-to-many relationship.
It's a very familiar design pattern. It indicates which Branches each Member has access to.
As the number of members and branches increases I end up with a lot of data in the join table that is duplicated across members. Members tend to have access to the same groups of Branches as other Members.
So I'm looking at normalizing my data by creating a MemberProfile table that is effectively immutable. And rather than creating MemberBranch records for every Member I check for a matching MemberProfile, use if it already exists, or create one if it doesn't:
The idea being if I have a million Members with only a hundred access profiles this will save me a lot of space in my database.
I'm happy that it all works and that the development effort is worth is.
My question is "Is this a standard database design pattern, and if so, what is it called?"
EDIT: It's been pointed out that this is compressing the data not normalizing it. Which is the intent behind the design.
Unless your many:many table is always the join of particular other base tables, one is not normalizing. You aren't normalizing here. Normalization does not introduce new column names. It just rearranges the current ones among different base tables.
You are just compressing/encoding your data. There is not necessarily any benefit in this, since now some queries and updates will be slower although your database is smaller. (You have reported that it is worth it in your case.)
I understand you'd like to put a label on that precise transformation, but unfortunately, there aren't many books that discuss database design or refactoring patterns. One of the few is Martin Fowler's Refactoring Databases, which you may know for his work on analysis patterns (he also has a great blog, worth following!). In that book, Martin presents a bunch of refactoring patterns that can be applied to databases and has put a name on common database transformations, including the one you have presented, which he called Split Table.
Split Table. Vertically split (e.g. by columns) an existing table into one or more tables.
A catalog of the database refactorings presented in that book are available here.
Hi I don't know about a pattern name but I've used the same principle before.
To keep this performing well, introduce a checksum to memberProfile based upon the branches for the profile, this way a lookup for an existing profile is plain easy and fast.
But do remember that the checksum is not necessarily unique, in case of collisions you will still have to check the branches, but only for the profiles sharing the same checksum.
Cleanup can be a scheduled task is is nothing more then deleting the profiles without users.

Merge the same Access database frequently (daily/weekly)

I need to use one Access(2007)database on 2 offline locations and then get all the data back in one database. Some advised me to use SharePoint, but after some trial and frustration I wonder if it's really the best way.
Is it possible to manage this in an automated way, with update queries or so?
I have 26 tables, but only 14 need to be updated frequently. I use autonumber to create the parentkey and use cascade updating for the linked tables.
If your data can handle it, it's probably better to use a more natural key for the tables that require frequent updating. I.e. ideally you can uniquely identify a record my some combination of the columns in that record. Autonumbers in two databases can, and very likely will, step on each other, then when you do merge any records based on an old auto number need to be mapped properly. That can be done but is kind of a pain. It'd be nicer to avoid it all from the start.
As for using Sharepoint (I assume the suggestion is to replace your tables with lists, not to just put your accdb on SP) it has a lot of limitations in terms of the kinds of indices that can be created and relationships you can establish. Maybe your data are simple enough to live with this. I'm yet to be able to justify the move.
ultimate the answer to your question is YES it is possible to manage the synchonization with insert/update queries and very likely some VBA (possibly lots depending on how complicated your table hierarchy is). You'll need to be vigilant about two people updating a single record. You'll need to come up with some means to resolve the conflict.

SQL database checklist

I need to create a Sql Database Checklist,
I have some basics points like
Each table must have a primary key
Normalize data to third normal form
Check for Integrity column the column value should be incremented properly.
But can anyone help me to enhance this list ?
Objects conform to a single naming convention
Create foreign key relationships
Apply appropriate index(es)
Use of schema or other mechanisms for controlling read/write access, etc
Consideration given to how long data should be kept before deletion or archive
Version control over scripts for updating the database structure
Mechanism for applications to determine version of database
Backup and recovery plans in place
First, It would help if this is supposed to be a recuring check list, or a checklist for each new instance. Also, is there a specific implementation in mind like SQL Server? MySQL? (this is where the real check list begins). For example, you want to keep an eye on the Transactions Log if its SQL Server...
If this is a relational DB, ER diamgrams go a long way in making sure that you have your problem domain identified and analyzed. You are right track using third normal form where practical. I want to emphasize practical because you also want to try and anticipate and identify which data will be used more than others. If data is highly accessed, you may want consider indexing more than just the primary and/or denormalizing to 2nd normal form. (uses more space, but better performance). Remember that accessing data and updating data are inversely related when indexing is concerned. Hope this helps.

Upgrade strategies for bad DB schema designs

I've shown up at a new job and discovered database which is in dire need of some help. There are many many things wrong with it, including
No foreign keys...anywhere. They're faked by using ints and managing the relationship in code.
Practically every field can be NULL, which isn't really true
Naming conventions for tables and columns are practically non-existent
Varchars which are storing concatenated strings of relational information
Folks can argue, "It works", which it is. But moving forward, it's a total pain to manage all of this with code and opens us up to bugs IMO. Basically, the DB is being used as a flat file since it's not doing a whole lot of work.
I want to fix this. The issues I see now are:
We have a lot of data (migration, possibly tricky)
All of the DB logic is in code (with migration comes big code changes)
I'm also tempted to do something "radical" like moving to a schema-free DB.
What are some good strategies when faced with an existing DB built upon a poorly designed schema?
Enforce Foreign Keys: If a relationship exists in the domain, then it should have a Foreign Key.
Renaming existing tables/columns is fraught with danger, especially if there are many systems accessing the Database directly. Gotchas include tasks that run only periodically; these are often missed.
Of Interest: Scott Ambler's article: Introduction To Database Refactoring
and Catalog of Database Refactorings
Views are commonly used to transition between changing data models because of the encapsulation. A view looks like a table, but does not exist as a finite object in the database - you can change what column is being returned for a given column alias as desired. This allows you to setup your codebase to use a view, so you can move from the old table structure to the new one without the application needing to be updated. But it means the view has to return the data in the existing format. For example - your current data model has:
SELECT t.column --a list of concatenated strings, assuming comma separated
FROM TABLE t
...so the first version of the view would be the query above, but once you created the new table that uses 3NF, the query for the view would use:
SELECT GROUP_CONCAT(t.column SEPARATOR ',')
FROM NEW_TABLE t
...and the application code would never know that anything changed.
The problem with MySQL is that the view support is limited - you can't use variables within it, nor can they have subqueries.
The reality to the changes you wish to make is effectively rewriting the application from the ground up. Moving logic from the codebase into the data model will drastically change how the application gets the data. Model-View-Controller (MVC) is ideal to implement with changes like these, to minimize the cost of future changes like these.
I'd say leave it alone until you really understand it. Then make sure you don't start with one of the Things You Should Never Do.
Read Scott Ambler's book on Refactoring Databases. It covers a good many techniques for how to go about improving a database - including the transitional measures needed to allow both old and new programs to work with the changing design.
Create a completely new schema and make sure that it is fully normalized and contains any unique, check and not null constraints etc that are required and that appropriate data types are used.
Prepopulate each table that fills the parent role in a foreign key relationship with a single 'Unknown' record.
Create an ETL (Extract Transform Load) process (I can recommend SSIS (SQL Server Integration Services) but there are plenty of others) that you can use to refill the new schema from the existing one on a regular basis. Use the 'Unknown' record as the parent of any orphaned records - there will be plenty ;). You will need to put some thought into how you will consolidate duplicate records - this will probably need to be on a case by case basis.
Use as many iterations as are necessary to refine your new schema (ensure that the ETL Process is maintained and run regularly).
Create views over the new schema that match the existing schema as closely as possible.
Incrementally modify any clients to use the new schema making temporary use of the views where necessary. You should be able to gradually turn off parts of the ETL process and eventually disable it completely.
First see how bad the code is related to the DB if it is all mixed in no DAO layer you shouldn't think about a rewrite but if there is a DAO layer then it would be time to rewrite that layer and DB along with it. If possible make the migration tool based on using the two DAOs.
But my guess is there is no DAO so you need to find what areas of the code you are going to be changing and what parts of the DB that relates to hopefully you can cut it up into smaller parts that can be updated as you maintain. Biggest deal is to get FKs in there and start checking for proper indexes there is a good chance they aren't being done correctly.
I wouldn't worry too much about naming until the rest of the db is under control. As for the NULLs if the program chokes on a value being NULL don't let it be NULL but if the program can handle it I wouldn't worry about it at this point in the future if it is doing a default value move that to the DB but that is way down the line from the sound of things.
Do something about the Varchars sooner rather then later. If anything make that the first pure background fix to the program.
The other thing to do is estimate the effort of each areas change and then add that price to the cost of new development on that section of code. That way you can fix the parts as you add new features.

What are common pitfalls when developing a new SQL database?

I know, I quite dislike the catch-all survey type questions, but I couldn't think of a better way to find out what I think I need to know. I'm very green in the world of database development, having only worked on a small number of projects that merely interacted with the database rather than having to actually create a new one from scratch. However, things change and now I am faced with creating my own database.
So far, I have created the tables I need and added the columns that I think I need, including any link tables for many-many relationships and columns for one-to-many relationships. I have some specific questions on this, but I felt that rather than get just these answered, it would make more sense to ask about things I may not even know, which I should address now rather than 6 months from now when we have a populated database and client tools using it.
First the questions on my database which have led me to realise I don't know enough:
How do I ensure my many-to-many link tables and my one-to-many columns are up-to-date when changes are made to the referenced tables? What problems may I encounter?
I am using nvarchar(n) and nvarchar(MAX) for various text fields. Should I use varchar equivalents instead (I had read there may be performance risks in using nvarchar)? Are there any other gotchas regarding the selection of datatypes besides being wary of using fixed length char arrays to hold variable length information? Any rules on how to select the appropriate datatype?
I use int for the ID column of each table, which is my primary key in all but the link tables (where I have two primary keys, the IDs of the referenced table rows). This ID is set as the identity. Are there pitfalls to this approach?
I have created metadata tables for things like unit types and statuses, but I don't know if this was the correct thing to do or not. Should you create new tables for things like enumerated lists or is there a better way?
I understand that databases are complex and the subject of many many worthy tomes, but I suspect many of you have some tips and tricks to augment such reading material (though tips for essential reading would also be welcome).
Community wiki'd due to the rather subjective nature of these kinds of posts. Apologies if this is a duplicate, I've conducted a number of searches for something like this but couldn't find any, though this one is certainly related. Thanks.
Update
I just found this question, which is very similar in a roundabout way.
Not normalising
Not using normalisation
Trying to implement a denormalised schema from the start
Seriously:
Foreign keys will disallow deletes or updates from the parent tables. Or they can be cascaded.
Small as possible: 2 recent SO questions datatypes and (n)varchar
May not be portable and your "natural key" (say "product name") still needs a unique constraint. Otherwise no, but remember that an IDENTITY column is a "surrogate key"
Edit: Say you expect to store fruit with columns FruitID and FruitName. You have no way to restrict to one occurence of "Apple" or "Orange" because although this is your "natural key", you are using a surrogate key (FruitID). So, to maintain integrity, you need a unique constraint on FruitName
Not sure or your meaning, sorry. Edit: Don't do it. Ye olde "One true lookup table" idea.
I'll reply to your subjective query with some vague generalities. :)
The most common pitfall of designing a database is the same pitfall of any programming solution, not fully understanding the problem being solved. In the case of a database, it is understanding the nature of the data. How big it is, how it comes and goes, what business rules must it adhere to.
Here are some questions to ponder.
What is updated the most frequently? Is keeping that table write-locked going to lock up queries? Will it become a hot spot? Even a seemingly well normalized schema can be a poor performer if you don't understand your read versus write ratios.
What are your external interface needs? I've been on projects where the dotted line to "that other system" nearly scuttled the whole project because implementing it was delayed until everything else was in place, that is to say, everything else was inflexible.
Any other unspoken requirements? My favorite is date sensitivity. All the data is there, your reports are beautiful, the boss looks them over and asks, when did that datum change? Who did it and when? Is the database supposed to track itself and its users, or just the data? Will your front end do it for you?
Just some things to think about.
It does sounds like you've got a good grasp on what you're meant to be doing, and indeed there isn't "one true path" to doing databases.
Have you set up cascades for your hierarchical objects (i.e., a single delete at the 'head' of your object in the database will delete all entries in tables relating to that entry)?
Your link tables and 1:n columns should be foreign keys, so there isn't much to worry about if the data changes. By "two primary keys" here, did you mean indexes?
As for metadata tables, I've done them in the past, and I've not done them. A single char status with SQL comment can suffice for a limited set of statuses, but beyond a certain amount, or where you can think of adding more in the future, you might want to have a reference to another table of metadata, or maybe a char(8ish). E.g., I've seen user tables have "NORMAL", "ADMIN", "SUPER", "GUEST", etc for user type, which could have been 1,2,3,4,5 fkeys to a "UserType" table, but with such a restricted enumeration does it matter? Other people have a table of permissions (booleans of what a user can do) instead - many ways to skin a cat.
You might find some usable stuff in these slides:
[http://www.slideshare.net/billkarwin/sql-antipatterns-strike-back][1]
I also am a beginner to database design, but I found this online tutorial very, very helpful:
Database design with UML and SQL, 3rd edition
The author explains all the fundamental design aspects of database, and in a very clear manner. Before I found this online guide I did a lot of wikipedia reading about normalization. While that helped, this author explains the exact same stuff (through 3rd normal form, at least) but in a much, much easier to read way. It pretty much addresses all your questions as well.
I'd suggest a good book. The best IMO is this:
http://www.amazon.com/Server-2005-Database-Design-Optimization/dp/1590595297/ref=ntt_at_ep_dpt_1
In addition to not normalizing, a common problem I see is overindexing, done before there are performance measurements that take into account your in-production mix of reads vs. writes.
It's really, really easy to add an index to speed up a query, and harder to figure out which one to remove when you have several that are getting updated during an INSERT or UPDATE.
The middle ground is to go after obvious secondary indexes (e.g., for common, frequent lookups by name on large tables), deferring other candidate indexes until you have reasonable performance tests in place.
Among other things, not using primary keys, not thinking ahead about whether you'll be using indexed views (and designing tables accordingly; I once had to drop and recreate a large table at my site to change its ANSI_NULL attribute to ON so that I could then use it with an indexed view), using indices.