SQL database design with some user defined fields - sql

I'm developing a database schema to handle collection of data and later, reporting on this data.
After a requirements discussion, it seems that either an entity-attribute-value (EAV) solution, or a flat table solution would be alright - since the data is somewhat sparse but not highly sparse.
However, user defined fields will become a must in the future, but I understand that querying and optimizing an RDBMS with EAV tables can become complex.
I've taken a look at the discussion here, and I was thinking an option similar to option 1 would be possible. For example, have a number of set fields, then a number of spare fields that users can define the labels of.
In terms of reporting, is there any downside in using this approach rather than using EAV?

You will regret EAV, especially when it comes to reporting
Make sure you're aware of existing data model patterns before you try anything: Ready to use database model patterns
Familiarize yourself with Table Inheritance: How can you represent inheritance in a database?
Consider allowing users to modify their own schemas: https://martinfowler.com/bliki/UserDefinedField.html
EAV is almost always a really bad idea. If you still need custom fields after trying the above, use a blob type (like JSON or XML) with indexing: http://backchannel.org/blog/friendfeed-schemaless-mysql . Postgres's binary jsonb is fast and allows indexing/querying

Related

Should I use EAV database design model or a lot of tables

I started a new application and now I am looking at two paths and don't know which is good way to continue.
I am building something like eCommerce site. I have a categories and subcategories.
The problem is that there are different type of products on site and each has different properties. And site must be filterable by those product properties.
This is my initial database design:
Products{ProductId, Name, ProductCategoryId}
ProductCategories{ProductCategoryId, Name, ParentId}
CategoryProperties{CategoryPropertyId, ProductCategoryId, Name}
ProductPropertyValues{ProductId, CategoryPropertyId, Value}
Now after some analysis I see that this design is actually EAV model and I read that people usually don't recommend this design.
It seems that dynamic sql queries are required for everything.
That's one way and I am looking at it right now.
Another way that I see is probably named a LOT WORK WAY but if it's better I want to go there.
To make table
Product{ProductId, CategoryId, Name, ManufacturerId}
and to make table inheritance in database wich means to make tables like
Cpus{ProductId ....}
HardDisks{ProductId ....}
MotherBoards{ProductId ....}
erc. for each product (1 to 1 relation).
I understand that this will be a very large database and very large application domain but is it better, easier and performance better than the option one with EAV design.
EAV is rarely a win. In your case I can see the appeal of EAV given that different categories will have different attributes and this will be hard to manage otherwise. However, suppose someone wants to search for "all hard drives with more than 3 platters, using a SATA interface, spinning at 10k rpm?" Your query in EAV will be painful. If you ever want to support a query like that, EAV is out.
There are other approaches however. You could consider an XML field with extended data or, if you are on PostgreSQL 9.2, a JSON field (XML is easier to search though). This would give you a significantly larger range of possible searches without the headaches of EAV. The tradeoff would be that schema enforcement would be harder.
This questions seems to discuss the issue in greater detail.
Apart from performance, extensibility and complexity discussed there, also take into account:
SQL databases such as SQL Server have full-text search features; so if you have a single field describing the product - full text search will index it and will be able to provide advanced semantic searches
take a look at no-sql systems that are all the rage right now; scalability should be quite good with them and they provide support for non-structured data such as the one you have. Hadoop and Casandra are good starting points.
You could very well work with the EAV model.
We do something similar with a Logistics application. It is built on .net though.
Apart from the tables, your application code has to handle the objects correctly.
See if you can add generic table for each object. It works for us.

Normalisation and multi-valued fields

I'm having a problem with my students using multi-valued fields in access and getting confused about normalisation as a result.
Here is what I can make out. Given a 1-to-many relationship, e.g.
Articles Comments
-------- --------
artID{PK} commID{PK}
text text
artID{FK}
Access makes it possible to store this information into what appears to be one table, something like
Articles
--------
artID{PK}
text
comment
+ value
"value" referring to multiple comment values for the comment "column", which access actually stores as a separate table. The specifics of how the values are stored - table, its PK and FK - is completely hidden, but it is possible to query the multi-valued field, e.g. in the example above with the query
INSERT INTO article( [comment].Value )
VALUES ('thank you')
WHERE artID = 1;
But the query doesn't quite reveal the underlying structure of the hidden table implementing the multi-valued field.
Given this (disaster, in my view) - my problem is how to help newcomers to database design and normalisation understand what Access is offering them, why it may not be helpful, and that it is not a reason to ignore the basics of the relational model. More specifically:
Are there better ways, besides queries as above, to reveal the structure behind multi-valued fields?
Are there good examples of where the multi-valued field is not good enough, and shows the advantage of normalising explicitly?
Are there straightforward ways to obtain the multi-select visual output of Access multi-values, but based on separate, explicit tables?
Thanks!
I cannot give you advice in using this feature, because I never used it; however, I can give you reasons not to use it.
I want to have full control on what I'm doing. This is not the case for multi-valued fields, therefore I don't use them.
This feature is not expandable. What if you want to add a date field to your comments, for instance?
It is sometimes necessary to upsize an Access (backend) database to a "big" database (SQL Server, Oracle). These Databases don't offer such a feature. It is often the customer who decides which database has to be used. Recently I had to migrate an Access application (frontend) using an Oracle backend to a SQL-Server backend because my client decided to drop his Oracle server. Therefore it is a good idea to restrict yourself to use only common features.
For common tasks like editing lookup tables I created generic forms. My existing solutions will not work with multi-valued fields.
I have a (self-made) tool that synchronizes changes in the structure of the database on my developer’s site with the database on the client’s site. This tool cannot deal with multi-valued fields.
I have tools for the security management that can grant SELECT, INSERT, UPDATE and DELETE rights on tables or revoke them. Again, the management tool does not work with multi-valued fields.
Having a separate table for the comments allows you to quickly inspect all the comments (by opening the table). You cannot do this with multi-valued fields.
You will not see the 1 to n relation between the articles and the comments in a database diagram.
With a separate table you can choose whether you want to cascade deletes to the details table or not. If you don't, you will not be able to delete an article as long as there are comments attached to it. This can be desirable, if you want to protect the comments from being deleted inadvertently.
It is important to realize the difference between physical and logical relationships. Today the whole internet and web services (SOAP) quite much realizes on a data format that is multi-value in nature.
When you represent multi-value data with a relational database (such as Access), then behind the scenes you are using a traditional (and legitimate) relation. I cannot stress that as such, then the use of multi-value columns in Access is in fact a LEGITIMATE relational model.
The fact that table is not exposed does not negate this issue. In fact, if you represent an invoice (master record, and repeating details) as a XML data cube, then we see two things:
1) you can build and represent that invoice with a relational database like Access
2) such a relational data model that is normalized can ALSO be represented as a SINGLE xml string.
3) deleting the XML record (or string) means that cascade delete of the child rows (invoice details) MUST occur.
So while it is true that Multi-Value fields been added to Access to deal with SharePoint, it is MOST important to realize that such data can be mapped to a relational database (if you could not do this, then Access could not consume that XML data using relational database tables as ACCESS CURRENTLY DOES RIGHT NOW).
And with the web such as XML, and SharePoint then the need to consume and manage and utilize such data is not only widespread, but is in fact a basic staple of the internet.
As more and more data becomes of a complex nature, we find the requirement for multi-value data exploding in use. Anyone who used that so called "fad" the internet is thus relying and using data that is in fact VERY OFTEN XML and is multi-value (complex) in nature.
As long as the logical (not physical) relational data model is kept, then use of multi-value columns to represent such data is possible and this is exactly what Access is doing (it is mapping the relational data model to a complex model). Note that the complex (xml) data model does NOT necessary have to be relational in nature. However, if you ARE going to map such data to Access then the complex multi-value model MUST CONFORM TO A RELATIONAL data model.
This is EXACTLY what is occurring in Access.
The fact that such a correct and legitimate math relational model is not exposed is of little issue here. Are we to suggest that because Excel does not expose the binary codes used then users will never learn about computers? Or perhaps we all must program in assembler so we all correctly learn how computers works.
At the end of the day, who cares and why does this matter? The fact that people drive automatic cars today does not toss out the concept that they are using different gears to operate that car. The idea that we shut down all of society because someone is going to drive an automatic car or in this case use complex data would be galactic stupid on our part.
So keep in mind that extensions to SQL do exist in Access to query the multi-value data, but as well pointed out here those underlying tables are not exposed. However, as noted, exposing such tables would STILL REQUIRE one to not change or mess with cascade delete since that feature is required TO MAINTAIN A INTERSECTION OF FEATURES and a CORRECT MATH relational model between the complex data model (xml) and that of using two related tables to represent such data.
In other words, you can use related tables to represent the complex data model IF YOU REMOVE the ability of users to play with the referential integrity options. The RI options MUST remain as set in those hidden tables else such data will not be able to make the trip BACK to the XML or complex data model of which it was consumed from.
As noted, in regards to users being taught how gasoline reacts with oxygen for that of learning to drive a car, or using a word processor and being forced to learn a relational model and expose the underlying tables makes little sense here.
However, the points made here in regards to such tables being exposed are legitimate concerns.
The REAL problem is SQL server and Oracle etc. cannot consume or represent that complex data WHILE ACCESS CAN CONSUME such data.
As noted, the complex data ship has LONG ago sailed! XML, soap, and the basic technologies of the internet are based on this complex data model.
In effect, SQL server, Oracle and most databases cannot that consume this multi-value data represent it without users having to create and model such data in a relational fashion is a BIG shortcoming of SQL server etc.
Access stands alone in this ability to consume this data.
So, for anyone who used a smartphone, iPad or the web, you are using basic technologies that are built around using complex data, something that Access now allows.
It is likely that the rest of the industry will have to follow suit given that more and more data is complex in nature. If the database industry does not change, then the mainstream traditional relational database system will NOT be the resting place of such data.
A trend away from storing data in related tables is occurring at a rapid pace right now and products like SharePoint, or even Google docs is proof of this concept. So Access is only reacting to market pressures and it is likely that other database vendors will have to follow suit or simply give up on being part of the "fad" called the internet.
XML and complex data structures are STAPLE and fact of our industry right now – this is not an issue we all should run away from, but in fact embrace.
Albert D. Kallal (Access MVP)
Edmonton, Alberta Canada
kallal#msn.com
The technical discussion is interesting. I think the real problem lies in student understanding. Because it is available in Access students will use it, and initially it will probably provide a simple solution to some design problems. The negatives will occur later when they try and use the data. Maybe a simple example demonstrating the problems would persuade some students to avoid using multi-valued fields ? Maybe an example of storing the data in another, more usable format would help ?
Good luck !
Peter Bullard
MS Access does a great job of simplifying database management and abstracting out a lot of complexity. This however makes the learning of dbms concepts a bit difficult. Have you tried using other 'standard' dbms tools like MySQL (or even sqlite). From a learning perspective they may be better.
I know this post is old. But, it's not quite the same as every other post I've seen on this topic. This one has someone making a good case for using Multi Valued Fields...
As someone who is trying who is still trying very hard to get their head around Access, I find the discussion for and against using the Multi Valued Fields incredibly frustrating.
I'm trying to sort through it all, but if everyone is so against them, what is an alternative method? It seems that in every search result I find everyone is either telling you how to use Multi Valued Fields and Controls or telling you how horrible and what a mistake they are. Many people refer to an alternative to them, but nobody says "Here's an example". I'm here to learn about these things. And while I know that this is a simpler concept for a lot of people in these forums, I could really use some examples to take a look at.
I'm at a point where I have to decide which way to go. It would be wonderful to compare examples of using Multi Valued Fields and alternatives and using a control to select multiple values.
Or am I wrong and the functionality of a combobox where you can select multiple items is only available through Access?
I want to address the last of your questions first. There is a way of providing a visual presentation of a parent child relationship. It's called subforms. If you get help about subforms in Access, it will explain the concept.
I have used subforms in a project where I wanted to display the transaction header in a form and the transaction details in a subform. There is nothing to hinder this construct even when the data is stored in two normalized tables.
Of course, this affects the screen, not the database. That's the whole point. Normalization is relevant to storage and retrieval, not to other uses of data.

Database EAV Pros/Cons and Alternatives

I have been looking for a database solution to allow user defined fields and values (allowing an unlimited number). At first glance, EAV seemed like the right fit, but after some reading I am not sure anymore.
What are the pros and cons of EAV?
Is there an alternative database method to allow user defined attributes/fields and values?
This is not to be considered an exhaustive answer, but just a few points on the topic.
Since the question is also tagged with the [sql] tag, let me say that, in general, relational databases aren't particularly suitable for storing data using the EAV model. You can still design an EAV model in SQL, but you will have to sacrifice many advantages that a relational database would give. Not only you won't be able to enforce referential integrity, use SQL data types for values and enforce mandatory attributes, but even the very basic queries can become difficult to write. In fact, to overcome this limitation, several EAV solutions rely on data duplication, instead of joining with related tables, which as you can imagine, has plenty of drawbacks.
If you really require a schemaless design, "allowing an unlimited number of attributes", your best bet is probably to use a NoSQL solution. Even though the weaknesses of EAV relative to relational databases also apply to NoSQL alternatives, you will be offered additional features that are difficult to achieve with conventional SQL databases. For example, usually NoSQL datastores can be scaled much easier than relational databases, simply because they were designed to solve some sort of scalability problem, and they intentionally dropped features that make scaling difficult.
Many cloud computing platforms (such as those offered by Amazon, Google and Microsoft) are featuring datastores based on the EAV model, where an arbitrary number of attributes can be associated with a given entity. If you are considering deploying your application to the cloud, you may consider this both as a business advantage, as well as a technical one, because the strong competition between the big vendors is pushing the value-to-cost ratios to very high levels, by continually pushing up on the features and pushing down the financial and implementation costs.
Have a look at posgtres hstore http://www.postgresql.org/docs/9.0/static/hstore.html
this will do exactly what you want without most of the disadvantages
The Streams Platform proposes the alternative way based on Streams (actually, it's the Domain Model), Fields and Assignments entities.
Is there an alternative database method to allow user defined attributes/fields and values?
One alternative is to change the database schema based on user input: for example when the user wants a new field, then add a corresponding column to the database.

What should one map strings to in a database ORM?

Strings are unbounded, but it seems every normal relational database requires that a column declare its maximum length. This seems to be a rather significant discrepancy and I'm curious how typical ORMs handle this.
Using a 'text' column type would theoretically give you much more string-like storage, but as I understand it text columns are not queryable, or at least not efficiently (non-indexed?).
I'm thinking of using something like NHibernate perhaps, but my ORM needs are relatively simple, so if I can just write it myself it might save some bloat.
SqlServer for instance stores only the actual size of the data. For me, defining a large enough size for strings is sufficient that the user doesn't recognize the limits.
Exmaples:
name of a product, person etc: 500
Paths, Urls etc: 1000
Comments, free text: 2000 or even more
NHibernate does not anything with the size at runtime. You need to use some kind of validator or let the database either cut or throw an exception.
Quote: "my ORM needs are relatively simple". It's hard to say if NHibernate is overkill or not. Data access isn't generally that simple.
As a simple guide of the top of my head, take NHibernate if:
You have a fine-grained or complex domain model. You need to map inheritance.
You want your domain model somewhat independent from the database model.
You need some lazy loading features
You want to be database independent, eg. run it on SqlServer or Oracle
If you think that a class per table is what you need, you don't actually need a ORM.
According to my knowledge, they dont handle it, either you let the ORM define the schema, then:
The ORM will decide the size either defaulted or defined by your config.
Or the schema is not deined by the ORM, then it just has to obey the rules, if you insert too large strings, then you'll get errors from the DB.
I would stay on varcharish types, e.g. varchar2 for oracle or nvarchar for sql server, unless you're dealing with clobs.

MySQL design question - which is better, long tables or multiple databases?

So I have an interesting problem that's been the fruit of lots of good discussion in my group at work.
We have some scientific software producing SQLlite files, and this software is basically a black box. We don't control its table designs, formats, etc. It's entirely conceivable that this black box's output could change, and our design needs to be able to handle that.
The SQLlite files are entire databases which our user would like to query across. There are two ways (we see) of implementing this, one, to create a single database and a backend in Python that appends tables from each database to the master database, and two, querying across separate databases' tables and unifying the results in Python.
Both methods run into trouble when the black box produces alters its table structures, say for example renaming a column, splitting up a table, etc. We have to take this into account, and we've discussed translation tables that translate queries of columns from one table format to another.
We're interested in ease of implementation, how well the design handles a change in database/table layout, and speed. Also, a last dimension is how well it would work with existing Python web frameworks (Django doesn't support cross-database queries, and neither does SQLAlchemy, so we know we are in for a lot of programming.)
If you find yourself querying across databases, you should look into consolidating. Cross-database queries are evil.
If your queries are essentially relegated to individual databases, then you may want to stick with multiple databases, as clearly their separation is necessary.
You cannot accommodate arbitrary changes in a database's schema without categorizing and anticipating that change in some way. In the very best case with nontrivial changes, you can sometimes simply ignore new data or tables, in the worst case, your interpretation of the data will entirely break down.
I've encountered similar issues where users need data pivoted out of a normalized schema. The schema does NOT change. However, their required output format requires a fixed number of hierarchical levels. Thus, although the database design accommodates all the changes they want to make, their chosen view of that data cannot be maintained in the face of their changes. Thus it is impossible to maintain the output schema in the face of data change (not even schema change). This is not to say that it's not a valid output or input schema, but that there are limits beyond which their chosen schema cannot be used. At this point, they have to revise the output contract, the pivoting program (which CAN anticipate this and generate new columns) can then have a place to put the data in the output schema.
My point being: the semantics and interpretation of new columns and new tables (or removal of columns and tables which existing logic may depend on) is nontrivial unless new columns or tables can be anticipated in some way. However, in these cases, there are usually good database designs which eliminate those strategies in the first place:
For instance, a particular database schema can contain any number of tables, all with the same structure (although there is no theoretical reason they could not be consolidated into a single table). A particular kind of table could have a set of columns all similarly named (although this "array" violates normalization principles and could be normalized into a commonkey/code/value schema).
Even in a data warehouse ETL situation, a new column is going to have to be determined whether it is a fact or a dimensional attribute, and then if it is a dimensional attribute, which dimension table it is best assigned to. This could somewhat be automated for facts (obvious candidates would be scalars like decimal/numeric) by inspecting the metadata for unmapped columns, altering the DW table (yikes) and then loading appropriately. But for dimensions, I would be very leery of automating somethings like this.
So, in summary, I would say that schema changes in a good normalized database design are the least likely to be able to be accommodated because: 1) the database design already anticipates and accommodates a good deal of change and flexibility and 2) schema changes to such a database design are unlikely to be able to be anticipated very easily. Conversely, schema changes in a poorly normalized database design are actually more easy to anticipate as shortcomings in the database design are more visible.
So, my question to you is: How well-designed is the database you are working from?
You say that you know that you are in for a lot of programming...
I'm not sure about that. I would go for a quick and dirty solution not a 'generic' solution because generic solutions like the entity attribute value model often have a bad performance. Don't do client side joining (unifying the results) inside your Python code because that is very slow. Use SQL for joining, it is designed for that purpose. Users can also make their own reports with all kind of reporting tools that generate sql statements. You don't have to do everything in your app, just start with solving 80% of the problems, not 100%.
If something breaks because something inside the black box changes you can define views for backward compatibility that keeps your app functioning.
Maybe the scientific software will add a lot of new features and maybe it will change its datamodel because of those new features..? That is possible but then you will have to change your application anyways to take profit from those new features.
It sounds to me as if your problem isn't really about MySQL or SQLlite. It's about the sharing of data, and the contract that needs to exist between the supplier of data and the user of the same data.
To the extent that databases exist so that data can be shared, that contract is fundamental to everything about databases. When databases were first being built, and database theory was first being solidified, in the 1960s and 1970s, the sharing of data was the central purpose in building databases. Today, databases are frequently used where files would have served equally well. Your situation may be a case in point.
In your situation, you have a beggar's contract with your data suppliers. They can change the format of the data, and maybe even the semantics, and all you can do is suck it up and deal wth it. This situation is by no means uncommon.
I don't know the specifics of your situation, so what follows could be way off target.
If it was up to me, I would want to build a database that was as generic, as flexible, and as stable as possible, without losing the essential features of structured and managed data. Maybe, some design like star schema would make sense, but I might adopt a very different design if I were actually in your shoes.
This leaves the problem of extracting the data from the databases you are given, transforming the data into the stable format the central database supports, and loading it into the central database. You are right in guessing that this involves a lot of programming. This process, known as "ETL" in data warehousing texts, is not the simplest of programming challenges.
At least ETL collects all the hard problems in one place. Once you have the data loaded into a database that's built for your needs, and not for the needs of your suppliers, turning the data into valuable information should be relatively easy, at least at the programming or SQL level. There are even OLAP tools that make using the data as simple as a video game. There are challenges at that level, but they aren't the same kind of challenges I'm talking about here.
Read up on data warehousing, and especially data marts. The description may seem daunting to you at first, but it can be scaled down to meet your needs.