I like the entity-attribute-value thing because I can add new fields and have the rows automatically removed when the foreign table row is removed, but I don't like the fact that I can't enforce a data type. And the select queries are complicated.
Are there better ways that don't involve creating a table for each attribute?
If I create a very big table with every possible attribute, will this table take up space even if most rows will have NULL on most columns?
You can enforce data types in an EAV model by using multiple value fields. This gets a bit tricky because you need another column to specify the type and then additional constraints to specify that only one value is filled and that it matches the type.
In most databases, you can handle this using check constraints.
In addition, you can use just a single string value and then enforce contents of the string using check constraints. This is often sufficient. Such constraints make good use of regular expressions in databases that support them.
As for your second question. Each row is going to occupy space for the entity/attribute columns. Whether or not the NULL value occupies any space depends on the database, but this space would typically be small.
Related
We have a database schema that involves a heavy amount of polymorphism/inheritance/subtyping. Certain foreign keys such as a comment subject, or a permission target, point to a large number of tables, and certain properties like globally unique codes are shared by multiple tables.
The way we have this structured is via table-per-class inheritance, where the parent and child tables share a UUID primary key. UUID's were chosen so that tables could be merged and separated to adjust to business needs without worrying about collisions.
Currently there is no explicit discriminator column on the parent types, but I would like to add one for a couple reasons. It would rule out the possibility of erroneously pointing two different concrete subtypes to the same parent type. It would also frequently save joins, as often the child-specific data is not needed, only the knowledge of which subtype matches a given id.
I can think of a few possible approaches:
Just use a plain text field that stores the name of the concrete subtype table.
Do the same as above but with a custom Enum type that lists the possible tables.
Use an "id" field that points to a lookup table.
(2) seems like it would be a nicer option than (1), however it has the rather large downside of not allowing me to drop a value from the Enum without a ton of migration pain. This is particularly painful if this discriminator column shows up on a lot of tables, which it likely will.
(3) is often used for "enums that need to change", however it'd require hardcoding UUID/int id values into the DDL in order to properly handle the foreign keys on the subtypes, which seems like a bit of a deal breaker.
This has me leaning towards (1), but I was wondering if there was a better option. Perhaps even a text type optimized for frequently repeated identifiers with very limited character sets.
In the light of your explanations, I would use the first option.
If you use short strings, that shouldn't waste noticeable performance and storage space (the space taken up by a short string is one byte more than the string itself). You'd have to use check constraints to constrain the strings that can be used, but that's not a big hassle.
I am designing a database to contain a table reference, with a column type that is one of several predefined values (e.g., book, movie, magazine, etc.). I intend the range of possible values to expand over time (e.g. if I realize that I missed the academic_paper type, I want to be able to put that in).
The easiest solution would seem to be to simply store a string representing the type into the table. But this sounds like it would result in a lot of wasted space.
The other solution I thought of is creating a new table reference_types, which the type column references in its foreign key. This seems to have the added benefit of ensuring valid foreign keys (so that I won't accidentally mistype a "magzine" somewhere in my code), possible allow for faster queries for all media of a certain type (since integer comparisons should be much faster than string comparisons), but also slow my application down a bit as joins would be required whenever I need the reference type, and probably complicate logic because of those extra joins.
What are your thoughts on schema design for this problem?
Your second solution is the correct one. Create a secondary table to store your reference types and link them using a foreign key.
For further reading on this subject the search term you'd want to use is 'database normalisation'.
Create the reference_types table. And in your references table use integer and also add a reference_type_name field.
You can query the references table to get the integer key and print its name when needed without performing a join to the other table, and still use that table to perfom other operations, just keep both tables with equal type names.
I know it sonds redundant, but it's really the fastest way to do a simple query by int key and have it all together.
It depends, if you will want to add some other information to reference types, then use the second approach. If not, use the first one because it's faster and the information stored is only a string (you can always select unique to retrieve your types). Read this article for more info.
The user wants to add new fields in UI dynamically. This new field should get stored in database and they should be allowed to perform CRUD on it.
Now I can do this by specifying a XML but I wanted a better way where these new columns are searchable. Also the idea of firing ALTER statement and adding a new column seems wrong.
Can anyone help me with a design pattern on database server side of how to solve this problem?
This can be approached using a key value system. You create a table with the primary key column(s) of the table you want to annotate, a column for the name of the attribute, and a column for its value. When you user wants to add an attribute (say height) to the record of person 123 you add a row to the new table with the values (123, 'HEIGHT', '140.5').
In general you cast the values to TEXT for storage but if you know all the attributes will be numeric you can choose a different type for the value column. You can also (not recommended) use several different value columns depending on the type of the data.
This technique has the advantage that you don't need to modify the database structure to add new attributes and attributes are only stored for those records that have them. The disadvantage is that querying is not as straightforward as if the columns were all in the main data table.
There is no reason why a qualified business user should not be allowed to add a column to a table. It is less likely to cause a problem than just about anything else you can imagine including adding a new row to a table or changing. the value of a data element.
Using either of the methods described above do not avoid any risk; they are simply throwbacks to COBOL filler fields or unnecessary embellishments of the database function. The result can still be unnormalized and inaccurate.
These same business persons add columns to spreadsheets and tables to Word documents without DBAs getting in their way.
Of course, just adding the column is the smallest part of getting an information system to work, but it is often the case that it is perceived to be an almost insurmountable barrier. It is in fact 5 min worth of work assuming you know where to put it. Adding a column to the proper table with the proper datatype is easy to do, easy to use, and has the best chance of encouraging data quality.
Find out what the maximum number of user-added fields will be and add them before hand. For example 'User1', 'User2', 'User3', 'User4'...etc. You can then enable the fields on the UI based on some configurable settings.
I'm wanting to store a wide array of categorical data in MySQL database tables. Let's say that for instance I want to to information on "widgets" and want to categorize attributes in certain ways, i.e. shape category.
For instance, the widgets could be classified as: round, square, triangular, spherical, etc.
Should these categories be stored within a table to reference them best from an application? Another possibility, I would imagine, would be to add a column to widgets that contained a shape column that contained a tiny int. That way my application could search shapes by that and then use a coordinating enum type that would map the shape int meanings.
Which would be best? Or is there another solution that I'm not thinking of yet?
Define a category table for each attribute grouping. IE:
WIDGET_SHAPE_TYPE_CODES
WIDGET_SHAPE_TYPE_CODE (primary key)
DESCRIPTION
Then use a foreign key reference in the WIDGETS table:
WIDGETS
WIDGET_ID (primary key)
...
WIDGET_SHAPE_TYPE_CODE (foreign key)
This has the benefit of being portable to other databases, and more obvious relationships which means simpler maintenance.
What I would do is start with a Widgets table that has a category field that is a numeric type. If you also use the category table the numeric category is a foreign key that relates to a row in the category table. A numeric type is nice and small for better performance.
Optionally you can add a category table containing a a primary key numeric value, and a text description. This matches up the numeric value to a human friendly text value. This table can be used to convert the numbers to text if you just want to run reports directly from the database. The nice thing about having this table is you don't need to update an executable if you add a new category. I would add such a table to my design.
MySQL's ENUM is handy but it stores int the table as a string so it uses up more space in the table than is really needed. However it does have the advantage of preventing values that are not recognized from being stored. Preventing the storage of invalid numeric values is possible, but not as elegantly as ENUM. The other problem with ENUM is because it is regarded as a string, the database must do more work if you are selecting by the value because instead of comparing a single number, multiple characters have to be compared.
If you really want to you can have an enumeration in your code that coverts the numeric category back into something more application code friendly, but you are making your code more difficult to maintain by doing this. However it can have a performance advantage because fewer bytes have to be returned when you run a query. I would try to avoid this because it requires updating the application code every time a category is added to the database. If you really need to squeeze performance out of the database you could select the whole category table, and select the widgets table and merge them in application code, but that is a rare circumstance since the DB client almost always has a fast connection to the DB server and a few more bytes over the network are insignificant.
I think the best way is use ENUM, for example thereare pre defined enum type in mysql - http://dev.mysql.com/doc/refman/5.0/en/enum.html
I am refactoring an old Oracle 10g schema to try to introduce some normalization. In one of the larger tables, there is a text field that has at most, 10-15 possible values. In my mind, it seems that this field is an example of unnecessary data duplication and should be extracted to a separate table.
After examining the data, I cannot find one relevant piece of information that could be associated with that text value. Basically, if I pulled that value out and put it into its own table, it would be the only field in that table. It exists today as more of a 'flag' field. Should I create a two-column table with a surrogate key, keep it as it is, or do something entirely different? Am I doing more harm than good by trying to minimize data duplication on this field?
You might save some space by extracting the column to a separate table. This is called a lookup table. It can give you a couple of other benefits:
You can declare a foreign key constraint to the lookup table, so you can rely on the column in the main table never having any value other than the 10-15 values you want.
It's easy to query for a concise list of all permitted values, by querying the lookup table. This can be faster than using SELECT DISTINCT on the main table's column. It also returns values that are permitted, but not currently used in the main table.
If you change a value in the lookup table, it automatically applies to all rows in the main table that reference it.
However, creating a lookup table with one column is not strictly normalization. You're just replacing one value with another. The attribute in the main table either already supports a normal form, or not.
Using surrogate keys (vs. natural keys) also has nothing to do with normalization. A lot of people make this mistake.
However, if you move other attributes into the lookup table, attributes that depend only on the lookup value and therefore would create repeating groups (violating 3NF) in the main table if you left them there, then that would be normalization.
If you want normalization break it out.
I think of these types of data in DBs as the equivalent of enums in C,C++,C#. Mostly you put them in the table as documentation.
I often have an ID, Name, Description, and auditing columns for them (eg modified by, modified date, create date, create by, active.) The description field is rarely used.
Example (some might say there are more than just 2)
Gender
ID Name Audit Columns...
1 Male
2 Female
Then in your contacts you would have a GenderID column which would link to this one.
Of course you don't "need" the table. You could have external documentation somewhere that says 1=Male, 2=Female -- but I think these tables serve to document a system.
If it's really a free-entry text field that's not re-used somewhere else in the database, and there's just a single field without repeated instances, I'd probably go ahead and leave it as it is. If you're determined to break it out I'd create a 'validation' table with a surrogate key and the text value, then put the surrogate key in the base table.
Share and enjoy.
Are these 10-15 values actually meaningful, or are they really just flags? If they're meaningful pieces of text and it seems wasteful to replicate them, then sure create a lookup table. But if they're just arbitrary flag values, then your new table will be nothing more than a mapping from one arbitrary value to another, and not terribly helpful.
A completely separate question is whether all or most of the rows in your big table even have a value for this column. If not, then indeed you have a good opportunity for normalization and can create a separate table linking the primary key from your base table with the flag value.
Edit: One thing. If there's some chance that one of these "flag" values is likely to be wholesale replaced with another value at some point in the future, that would be another good reason to create a table.