I am creating an database managment webapp that has a strange requirement to provide the user with ability to add new 'fields' to existing objects. For example the table 'Employees' has Names and ID's of emloyers. Suddenly the owner of system wants to know if his employees have driver license. But our table and app did not expect that
The only options that come to my mind is to
1) Add big varchar field storing aditional properties as JSON or something
2) Add table 'additional properties' That will allow creating new objects linked by PK to existing users
Then we will have
TABLE USER TABLE PROPERTIES
-ID <-------------FK
-NAME -NAME (driver livence)
-VALUE (true)
How bad is the second idea? Are there any better options other then going noSQL?
There is no issue with either approach. EAV (entity-attribute-value) models have been part of relational databases, probably since the earliest databases were created.
They do have some downsides:
The value attribute is (generally) a string. That means that other types cannot be coerced into the column.
The values cannot be part of declared foreign key relationships.
Validating values is tricky -- very complicated check constraints, for instance.
There is no (easy) way to ensure that an entity has a particular value.
But for user-defined or sparsely populated values, EAV is definitely a reasonable choice.
JSON is another reasonable choice. That does, however, require a one-time change to the database to add a JSON column. Some databases offer indexing on JSON values, which can improve performance.
If "has-drivers-license" is a one-time change, then you might just want a separate table with the same primary key. The next time that a new column is needed, you can modify the "options" table, rather than the main table. This allows better support for validating values (all values are unique, for example) or defining foreign key constraints.
Related
In my example, I have a watch, which is an indication a user wants notifications about events on a different item, say a group and an organization.
I see two ways to do this:
Have a groupwatch resource, with a groupwatch table, with id,user,group (group FK to group resource and table); and a orgwatch resource, with a orgwatch table, with id,user,organization (org FK to organization resource and table)
Have a generic watch resource, with a watch table, with id,user,type,typeid. type is one of group or organization, and typeid is the ID of the group or organization being watched.
Since both of them are watches, it seems a waste to have two different tables and resources to watch 2 different objects. It gets worse if I start watching 4, 5, 6, 20, 50 different types of resources.
On the other hand, a foreign key relationship appears impossible if I just have a generic typeid, which means that my database (if relational) and my framework (activerecord or anything else) cannot enforce it correctly.
How do I best implement this type of "association to different types of record/table for each record in my table"?
UPDATE:
Are my only choices for doing this:
separate tables/resources for each watch type, which enables the database to enforce relational integrity and do joins
single table for all watches, but I will have to enforce relational integrity and do joins at the app level?
If you add a new type of resource once every six months, you may want to define your tables in such a way that adding new resources involves changing data definitions. If you add a new resource type every week, you may want to make your data definitions stay the same when you add new types. There's a downside to either choice.
If you do choose to define table in such a way that the types are visible in the table structure, there are two patterns often used with type/subtype (aka class/subclass) situations.
One pattern has been called "single table inheritance". Put data about all the types in a single table, and leave some columns NULL wherever they do not apply.
Another pattern has been called "class table inheritance". Define one table for the superclass, with all the data that is common to all the types. Then define tables for each subtype (subclass) to contain class specific data. Make the primary key of the subtype tables a duplicate of the primary key in the supertype table, and also declare it as a foreign key that references the primary key of the supertype table. It's going to be up to the app, at insert time, to replicate the value of the primary key in the supertype table over in the subtype table.
I like Fowlers' treatment of these two patterns.
http://martinfowler.com/eaaCatalog/classTableInheritance.html
http://www.martinfowler.com/eaaCatalog/singleTableInheritance.html
This matter of sharing primary keys has a few beneficial effects.
First, it enforces the one-to-one nature of the ISa relationships.
Second, it makes it easy to find out whether a given entry belongs to a desired subtype, by just joining with the subtype table. You don't really need an extra type field.
Third, it speeds up the joins, because of the index that gets built when you declare a primary key.
If you want a structure that can adapt to new attributes without changing data definitions, you can look into E-A-V design. Be careful, though. Sometimes this results in data that is nearly impossible to use, because the logical structure is so obscure. I usually think of E-A-V as an anti-pattern for this reason, although there are some who really like the results they get from it.
Given a project I'm working on, we have an old database structure we're migrating data from into a new database structure, and we need to preserve the old keys for a few tables for backwards compatibility with some existing application functionality.
Currently, there are two approaches we are considering for addressing this need:
Create an extra nullable field for each table and insert the old key into that new field
Create companion table(s) that contain the old and new key mappings
Note: new data will not generate old ID keys, so in approach #1, eventually the nullable field will contain nulls over time for new records.
Which approach is better for a cleaner database design, and data management long-term?
Do you see any issues with either approach, and if so, what issues?
Is there a #3 approach that I haven't thought of yet?
You mention sql, but is it SQL-Server?
if SQL-Server, look into SET INSERT_IDENTITY. This allows you to explicitly insert values for the auto-increment columns vs being in a protected mode for that column.
However, I believe that if you explicitly include the PK in the insert statement with its value, it will respect that and save the original key in the original column you are hoping to retain without having to force yet another column for backward compatibility purposes.
I am designing a database to contain a table reference, with a column type that is one of several predefined values (e.g., book, movie, magazine, etc.). I intend the range of possible values to expand over time (e.g. if I realize that I missed the academic_paper type, I want to be able to put that in).
The easiest solution would seem to be to simply store a string representing the type into the table. But this sounds like it would result in a lot of wasted space.
The other solution I thought of is creating a new table reference_types, which the type column references in its foreign key. This seems to have the added benefit of ensuring valid foreign keys (so that I won't accidentally mistype a "magzine" somewhere in my code), possible allow for faster queries for all media of a certain type (since integer comparisons should be much faster than string comparisons), but also slow my application down a bit as joins would be required whenever I need the reference type, and probably complicate logic because of those extra joins.
What are your thoughts on schema design for this problem?
Your second solution is the correct one. Create a secondary table to store your reference types and link them using a foreign key.
For further reading on this subject the search term you'd want to use is 'database normalisation'.
Create the reference_types table. And in your references table use integer and also add a reference_type_name field.
You can query the references table to get the integer key and print its name when needed without performing a join to the other table, and still use that table to perfom other operations, just keep both tables with equal type names.
I know it sonds redundant, but it's really the fastest way to do a simple query by int key and have it all together.
It depends, if you will want to add some other information to reference types, then use the second approach. If not, use the first one because it's faster and the information stored is only a string (you can always select unique to retrieve your types). Read this article for more info.
I have a table that will contain information for 3 other tables. The design I have is that this table will have a column that will tell the objects's ID and another column will tell the objects's type (and thus the table that that row refers to).
Two questions:
a) Is that the best design or is there something else more widely accepted?
b) What is the recommend procedure to assure that IDs are valid for the given objects's type?
If I understood your question correctly, each row in your table links to exactly one of the three other tables.
Your approach (type field + one foreign key field) is a valid design, and it's useful if you want to create a general-purpose table that contains meta-information about your data (e.g. a list of records that should be retransmitted for replication).
Another approach, which might be more suitable for real application-level data, would be to have three columns, each being a foreign key to one of the three tables, and to add a constraint that requires exactly two of those fields to be null. The has the following advantages:
The three FKs do not need to have the same data type.
The JOIN syntax becomes more natural (not involving the type field).
You can add referential integrity constraints on those FK columns.
You don't need to ensure correctness of the type field -- in fact, you don't need the type field at all. The type is determined implicitly by the one FK column which is not null.
a) I'm supposing you have a relationship one to many between objects and object types. In a normal design you'd have a reference from the objecttype column in the objects table to the primary key of the object types table
b) I would enforce referential integrity in the relationship properties (this depends on the dbms you are using). It's also up to you to use cascading on updates and deletes. This way, an update or a delete of the primary key on object types table would be reflected on the objects one, updating its foreign key column (object type column) or deleting the registers that have that object type.
The basics of DB schema design are easy, but more complicated situations can be really complicated to figure out what's best. There is a lot of personal subjectivity that can come into play here, and even performance can be a factor in denormalizing a design.
Disclaimer aside, my personal recommendation is to never use a column to store more than one kind of FK, i.e. a column for FKs should store FKs that point only to a single table. If you don't do this, you have to map the cascade of that column's data into multiple sub-select queries inside your code, and it can begin to get more messy than you expected. Your given "Problem No. 2, ensuring validity between type and FK" is just the beginning of a whole world of pain that will cascade throughout your source code.
Assuming you change the design to use one field per FK reference, I would also check whether each FK field in your main "information-holding table" will be fully valid for each record. If not, I would move out the FK columns that will only be applicable some of the time to a separate table.
We're using Visual Studio Database Edition (DBPro) to manage our schema. This is a great tool that, among the many things it can do, can analyse our schema and T-SQL code based on rules (much like what FxCop does with C# code), and flag certain things as warnings and errors.
Some example rules might be that every table must have a primary key, no underscore's in column names, every stored procedure must have comments etc.
The number of rules built into DBPro is fairly small, and a bit odd. Fortunately DBPro has an API that allows the developer to create their own. I'm curious as to the types of rules you and your DB team would create (both schema rules and T-SQL rules). Looking at some of your rules might help us decide what we should consider.
Thanks - Randy
Some of mine. Not all could be tested programmatically:
No hungarian-style prefixes (like "tbl" for table, "vw" for view)
If there is any chance this would ever be ported to Oracle, no identifiers longer than 30 characters.
All table and column names expressed in lower-case letters only
Underscores between words in column and table names--we differ on this one obviously
Table names are singular ("customer" not "customers")
Words that make up table, column, and view names are not abbreviated, concatenated, or acronym-based unless necessary.
Indexes will be prefixed with “IX_”.
Primary Keys are prefixed with “PK_”.
Foreign Keys are prefixed with “FK_”.
Unique Constraints are prefixed with “UC_”.
I suspect most of my list would be hard to put in a rules engine, but here goes:
If possible I'd have it report any tables that are defined as wider than the bytes that can be stored in a record (excluding varchar(max) and text type fields) and/or a datapage.
I want all related PK and FK columns to have the same name if at all possible. The only time it isn't possible is when you need to have two FKs in the same table relating to one PK and even then, I would name it the name of the PK and a prefix or suffix describing the difference. For instance if I had a PersonID PK and a table needed to have both the sales rep id and the customer id, they would be CustomerPersonID, and RepPersonID.
I would check to make sure all FKs have an index.
I would want to know about all fields that are required but have no default value. Depending on what it is, you may not want to define a default, But I would want to be able to easily see which ones don't to hopefully find the ones that should have a default.
I would want all triggers checked to see that they are set-based and not designed to run for one row at time.
No table without a defined Unique index or PK. No table where the PK is more than one field. No table where the PK is not an int.
No object names that use reserved words for the database I'm using.
No fields with the word Date as part of the name that are not defined as date or datetime.
No table without an associated audit table.
No field called SSN, SocialSecurityNumber, etc. that is not encrypted. Same for any field named CreditCardNumber.
No user defined datatypes (In SQL Server at least, these are far more trouble than they are worth.)
No views that call other views. Experience has shown me these are often a performance disaster waiting to happen. Especially if they layer more than one layer deep.
If using replication, no table without a GUID field.
All tables should have a DateInserted field and InsertedBy field (even with auditing, it is often easier to research data problems if this info is easily available.)
Consistent use of the same case in naming. It doesn't matter which as long as all use the same one.
No tables with a field called ID. Hate these with a passion. They are so useless. ID fields should be named tablenameID if a PK and with the PK name if an FK.
No spaces or special characters in object names. In other words if you need special handling for the database to recognize it in the correct context in query, don't use it.
If it is going to analyze code as well, I'd want to see any code that uses a cursor or a correlated subquery. Why create performance problems from the start?
I would want to see if a proc uses dynamic SQl and if so if it has an input bit variable called Debug (and code to only print the dynamic SQl statment and not execute it, if the Debug variable is set to 1).
I'd want to be able to check that if there is more than one statement causing action in the database (insert/update/delete) that there is also an explicit transaction in the proc and error trapping to roll the whole thing back if any part of it fails.
I'm sure I could think of more.