SQL vs NoSQL for data that will be presented to a user after multiple filters have been added - sql

I am about to embark on a project for work that is very outside my normal scope of duties. As a SQL DBA, my initial inclination was to approach the project using a SQL database but the more I learn about NoSQL, the more I believe that it might be the better option. I was hoping that I could use this question to describe the project at a high level to get some feedback on the pros and cons of using each option.
The project is relatively straightforward. I have a set of objects that have various attributes. Some of these attributes are common to all objects whereas some are common only to a subset of the objects. What I am tasked with building is a service where the user chooses a series of filters that are based on the attributes of an object and then is returned a list of objects that matches all^ of the filters. When the user selects a filter, he or she may be filtering on a common or subset attribute but that is abstracted on the front end.
^ There is a chance, depending on user feedback, that the list of objects may match only some of the filters and the quality of the match will be displayed to the user through a score that indicates how many of the criteria were matched.
After watching this talk by Martin Folwler (http://www.youtube.com/watch?v=qI_g07C_Q5I), it would seem that a document-style NoSQL database should suit my needs but given that I have no experience with this approach, it is also possible that I am missing something obvious.
Some additional information - The database will initially have about 5,000 objects with each object containing 10 to 50 attributes but the number of objects will definitely grow over time and the number of attributes could grow depending on user feedback. In addition, I am hoping to have the ability to make rapid changes to the product as I get user feedback so flexibility is very important.
Any feedback would be very much appreciated and I would be happy to provide more information if I have left anything critical out of my discussion. Thanks.

This problem can be solved in by using two separate pieces of technology. The first is to use a relatively well designed database schema with a modern RDBMS. By modeling the application using the usual principles of normalization, you'll get really good response out of storage for individual CRUD statements.
Searching this schema, as you've surmised, is going to be a nightmare at scale. Don't do it. Instead look into using Solr/Lucene as your full text search engine. Solr's support for dynamic fields means you can add new properties to your documents/objects on the fly and immediately have the ability to search inside your data if you have designed your Solr schema correctly.

I'm not an expert in NoSQL, so I will not be advocating it. However, I have few points that can help you address your questions regarding the relational database structure.
First thing that I see right away is, you are talking about inheritance (at least conceptually). Your objects inherit from each-other, thus you have additional attributes for derived objects. Say you are adding a new type of object, first thing you need to do (conceptually) is to find a base/super (parent) object type for it, that has subset of the attributes and you are adding on top of them (extending base object type).
Once you get used to thinking like said above, next thing is about inheritance mapping patterns for relational databases. I'll steal terms from Martin Fowler to describe it here.
You can hold inheritance chain in the database by following one of the 3 ways:
1 - Single table inheritance: Whole inheritance chain is in one table. So, all new types of objects go into the same table.
Advantages: your search query has only one table to search, and it must be faster than a join for example.
Disadvantages: table grows faster than with option 2 for example; you have to add a type column that says what type of object is the row; some rows have empty columns because they belong to other types of objects.
2 - Concrete table inheritance: Separate table for each new type of object.
Advantages: if search affects only one type, you search only one table at a time; each table grows slower than in option 1 for example.
Disadvantages: you need to use union of queries if searching several types at the same time.
3 - Class table inheritance: One table for the base type object with its attributes only, additional tables with additional attributes for each child object type. So, child tables refer to the base table with PK/FK relations.
Advantages: all types are present in one table so easy to search all together using common attributes.
Disadvantages: base table grows fast because it contains part of child tables too; you need to use join to search all types of objects with all attributes.
Which one to choose?
It's a trade-off obviously. If you expect to have many types of objects added, I would go with Concrete table inheritance that gives reasonable query and scaling options. Class table inheritance seems to be not very friendly with fast queries and scalability. Single table inheritance seems to work with small number of types better.
Your call, my friend!

May as well make this an answer. I should comment that I'm not strong in NoSQL, so I tend to lean towards SQL.
I'd do this as a three table set. You will see it referred to as entity value pair logic on the web...it's a way of handling multiple dynamic attributes for items. Lets say you have a bunch of products and each one has a few attributes.
Prd 1 - a,b,c
Prd 2 - a,d,e,f
Prd 3 - a,b,d,g
Prd 4 - a,c,d,e,f
So here are 4 products and 6 attributes...same theory will work for hundreds of products and thousands of attributes. Standard way of holding this in one table requires the product info along with 6 columns to store the data (in this setup at least one third of them are null). New attribute added means altering the table to add another column to it and coming up with a script to populate existing or just leaving it null for all existing. Not the most fun, can be a head ache.
The alternative to this is a name value pair setup. You want a 'header' table to hold the common values amoungst your products (like name, or price...things that all rpoducts always have). In our example above, you will notice that attribute 'a' is being used on each record...this does mean attribute a can be a part of the header table as well. We'll call the key column here 'header_id'.
Second table is a reference table that is simply going to store the attributes that can be assigned to each product and assign an ID to it. We'll call the table attribute with atrr_id for a key. Rather straight forwards, each attribute above will be one row.
Quick example:
attr_id, attribute_name, notes
1,b, the length of time the product takes to install
2,c, spare part required
etc...
It's just a list of all of your attributes and what that attribute means. In the future, you will be adding a row to this table to open up a new attribute for each header.
Final table is a mapping table that actually holds the info. You will have your product id, the attribute id, and then the value. Normally called the detail table:
prd1, b, 5 mins
prd1, c, needs spare jack
prd2, d, 'misc text'
prd3, b, 15 mins
See how the data is stored as product key, value label, value? Any future product added can have any combination of any attributes stored in this table. Adding new attributes is adding a new line to the attribute table and then populating the details table as needed.
I beleive there is a wiki for it too... http://en.wikipedia.org/wiki/Entity-attribute-value_model
After this, it's simply figuring out the best methodology to pivot out your data (I'd recommend Postgres as an opensource db option here)

Related

Normalization of SQL tables

I am creating some tables for a project and just realized that many of the tables have the same structure (Id, Name), but are used for different things. How far should I go with normalization? Should I build them all into one table or keep them apart for better understanding? How does it affect performence?
Example 1:
TableObjectType (used for types of objects in the log)
Id Name
1 User
2 MobileDevice
3 SIMcard
TableAction (used for types of actions in a log)
Id Name
1 Create
2 Edit
3 Delete
TableStatus (used for a status a device can have)
Id Name
1 Stock
2 Lost
3 Repair
4 Locked
Example 2:
TableConstants
Id Name
1 User
2 MobileDevice
3 SIMcard
4 Create
5 Edit
6 Delete
7 Stock
8 Lost
9 Repair
10 Locked
Ignore the naming, as my tables have other names, but I am using these for clarification.
The downside for using one table for all constants is that if I want to add more later on, they dont really come in "groups", but on the other hand in SQL I should never rely on a specific order when I use the data.
Just because a table has a similar structure to another doesn't mean it stores the data describing identical entities.
There are some obvious reasons not to go with example 2.
Firstly, you may want to limit the values in your ObjectTypeID column to values that are valid object types. The obvious way to do this is to create a foreign key relationship to the ObjectType table. Creating a similar check on TableConstants would be much harder (in most database engines, you can't use the foreign key restraint in this way).
Secondly, it makes the database self describing - someone who is inspecting the schema will understand that "object type" is a meaningful concept in your business domain. This is important for long-lived applications, or applications with large development teams.
Thirdly, you often get specific business logic with those references - for instance, "status" often requires some logic to say "you can't modify a record in status LOCKED". This business logic often requires storing additional data attributes - that's not really possible with a "Constants" table.
Fourthly - "constants" have to be managed. If you have a large schema, very quickly people start to re-use constants to reflect slightly different concepts. Your "create" constant might get applied to a table storing business requests as well as your log events. This becomes almost unintelligible - and if the business decides log events don't refer to "create" but "write", your business transactions all start to look wrong.
What you could do is to use an ENUM (many database engines support this) to model attributes that don't have much logic beyond storing a name. This removes risks 1, 2 and 4, but does mean your logic is encoded in the database schema - adding a new object type is a schema change, not a data insertion.
I think that generally it is better to keep tables apart (it helps documentation too). In some particular cases (your is the choice...) you could "merge" all similar tables into one (of course adding other columns, as TAB_TYPE to distinct them): this could give you some advantage in developing apps and reducing the overall number of tables (it this is a problem for you).
If they are all relatively small table (with not many records), you should have not performance problems.

SQL: Best way to conditionally relate multiple tables to a single table based on row value in the main table

In a nutshell here's the situation, we have a database that is used to build a hierarchy of "locations". (Example: Street Address > Building 1 > First Floor > Room).
Each of the locations are stored in a table. Each location can be of a different "type". The types are defined in another table. (We use the types of locations to restrict what locations can be added to a location).
Here's the quandary we are facing: We need to be able to store different types of information for different types of locations. For example, a location type of "building" may need to have the address stored where as a location type of "room" may need to have dimensions or paint color stored.
Obviously we could create a table for each location type we define to hold the properties required for the particular location type and then use application logic to query the appropriate table to pull in a particular location's additional information. Is there a more elegant or practical way to accomplish this relationally in the database without having to rely on application logic?
Thanks!
The relationally pure way to do this would be to implement your initial suggestion i.e. have a separate table such as BUILDING_LOCATIONS that holds the attributes that are only applicable to this type of location. A different [TYPE]_LOCATIONS table would be created for each type of location that has its own attributes. In this way you could use standard database constraint functionality to ensure the integrity of the data in these tables.
Another method would be to add a series of nullable columns to the LOCATIONS table such as BUILDING_ADDRESS and ROOM_DIMENSIONS. This is not relationally pure as it means null values can exist in this table. However, you can still use standard database functionality to ensure the integrity of the data. It can be a bit more convoluted if certain values are mandatory in certain situations. Also, if there are many types of location with many differing attributes the number of columns in the table can become unwieldy.
Another method is the Entity-Attribute-Value model. Generally, this is to be avoided if at all possible. It is not relationally pure, as your column values are now no longer defined over domains, and it is extremely difficult, if not impossible, to ensure the integrity of the data. Any real attempt to do so will require a lot of bespoke coding (which needs to be carefully implemented to cater for things like concurrency control that database constraints give you for free) as you cannot use standard database constraints. However, if you are just interested in storing values for information and not doing anything with them you could use this method.
The EAV method does have a danger that because it appears so easy to add attributes to an entity, it becomes the default way of doing so. It is then used to add attributes for which vital processing is dependent and, because you cannot ensure the integrity of the data using this method, you find the values being used are meaningless and the whole logical basis for the processing is destroyed.

How to model a mutually exclusive relationship in SQL Server

I have to add functionality to an existing application and I've run into a data situation that I'm not sure how to model. I am being restricted to the creation of new tables and code. If I need to alter the existing structure I think my client may reject the proposal.. although if its the only way to get it right this is what I will have to do.
I have an Item table that can me link to any number of tables, and these tables may increase over time. The Item can only me linked to one other table, but the record in the other table may have many items linked to it.
Examples of the tables/entities being linked to are Person, Vehicle, Building, Office. These are all separate tables.
Example of Items are Pen, Stapler, Cushion, Tyre, A4 Paper, Plastic Bag, Poster, Decoration"
For instance a Poster may be allocated to a Person or Office or Building. In the future if they add a Conference Room table it may also be added to that.
My intital thoughts are:
Item
{
ID,
Name
}
LinkedItem
{
ItemID,
LinkedToTableName,
LinkedToID
}
The LinkedToTableName field will then allow me to identify the correct table to link to in my code.
I'm not overly happy with this solution, but I can't quite think of anything else. Please help! :)
Thanks!
It is not a good practice to store table names as column values. This is a bad hack.
There are two standard ways of doing what you are trying to do. The first is called single-table inheritance. This is easily understood by ORM tools but trades off some normalization. The idea is, that all of these entities - Person, Vehicle, whatever - are stored in the same table, often with several unused columns per entry, along with a discriminator field that identifies what type the entity is.
The discriminator field is usually an integer type, that is mapped to some enumeration in your code. It may also be a foreign key to some lookup table in your database, identifying which numbers correspond to which types (not table names, just descriptions).
The other way to do this is multiple-table inheritance, which is better for your database but not as easy to map in code. You do this by having a base table which defines some common properties of all the objects - perhaps just an ID and a name - and all of your "specific" tables (Person etc.) use the base ID as a unique foreign key (usually also the primary key).
In the first case, the exclusivity is implicit, since all entities are in one table. In the second case, the relationship is between the Item and the base entity ID, which also guarantees uniqueness.
Note that with multiple-table inheritance, you have a different problem - you can't guarantee that a base ID is used by exactly one inheritance table. It could be used by several, or not used at all. That is why multiple-table inheritance schemes usually also have a discriminator column, to identify which table is "expected." Again, this discriminator doesn't hold a table name, it holds a lookup value which the consumer may (or may not) use to determine which other table to join to.
Multiple-table inheritance is a closer match to your current schema, so I would recommend going with that unless you need to use this with Linq to SQL or a similar ORM.
See here for a good detailed tutorial: Implementing Table Inheritance in SQL Server.
Find something common to Person, Vehicle, Building, Office. For the lack of a better term I have used Entity. Then implement super-type/sub-type relationship between the Entity and its sub-types. Note that the EntityID is a PK and a FK in all sub-type tables. Now, you can link the Item table to the Entity (owner).
In this model, one item can belong to only one Entity; one Entity can have (own) many items.
your link table is ok.
the trouble you will have is that you will need to generate dynamic sql at runtime. parameterized sql does not typically allow the objects inthe FROM list to be parameters.
i fyou want to avoid this, you may be able to denormalize a little - say by creating a table to hold the id (assuming the ids are unique across the other tables) and the type_id representing which table is the source, and a generated description - e.g. the name value from the inital record.
you would trigger the creation of this denormalized list when the base info is modified, and you could use that for generalized queries - and then resort to your dynamic queries when needed at runtime.

Inheritance in Database Design

I am designing a new laboratory database with MANY types of my main entities.
The table for each entity will hold fields common to ALL types of that entity (entity_id, created_on, created_by, etc). I will then use concrete inheritance (separate table for each unique set of attributes) to store all remaining fields.
I believe that this is the best design for the standard types of data which come through the laboratory daily. However, we often have a special samples which often are accompanied by specific values the originator wants stored.
Question: How should I model special (non-standard) types of entities?
Option 1: Use entity-value for special fields
One table (entity_id, attribute_name, numerical_value) would hold all data for any special entity.
+ Fewer tables.
- Cannot enforce requiring a particular attribute.
- Must convert (pivot) rows to columns which is inefficient.
Option 2: Strict concrete inheritance.
Create separate table for each separate special case.
+ Follows in accordance with all other rules
- Overhead of many tables with only a few rows.
Option 3: Concrete inheritance with special tables under a different user.
Put all special tables under a different user.
+ Keeps all special and standard tables separate.
+ Easier to search for common standard table in a list without searching through all special tables.
- Overhead of many tables with only a few rows.
Actually the design you described (common table plus subtype-specific tables) is called Class Table Inheritance.
Concrete Table Inheritance would have all the common attributes duplicated in the subtype tables, and you'd have no supertype table as you do now.
I'm strongly against EAV. I consider it an SQL antipattern. It may seem like an elegant solution because it requires fewer tables, but you're setting yourself up for a lot of headache later. You identified a couple of the disadvantages, but there are many others. IMHO, EAV is used appropriately only if you absolutely must not create a new table when you introduce a new subtype, or if you have an unbounded number of subtypes (e.g. users can define new attributes ad hoc).
You have many subtypes, but still a finite number of them, so if I were doing this project I'd stick with Class Table Inheritance. You may have few rows of each subtype, but at least you have some assurance that all rows in each subtype have the same columns, you can use NOT NULL if you need to, you can use SQL data types, you can use referential integrity constraints, etc. From a relational perspective, it's a better design than EAV.
One more option that you didn't mention is called Serialized LOB. That is, add a BLOB column for a semi-structured collection of custom attributes. Store XML, YAML, JSON, or your own DSL in that column. You won't be able to parse individual attributes out of that BLOB easily with SQL, you'll have to fetch the whole BLOB back into your application and extract individual attributes in code. So in some ways it's less convenient. But if that satisfies your usage of the data, then there's nothing wrong with that.
I think it depends mostly on how you want to use this data.
First of all, I don't really see the benefit of option 3 over option 2. I think separating the special tables in another schema will make your application harder to maintain, especially if later on commonalities are found between 'special values'.
As another option I would say:
- Store the special values in an XML fragment (or blob). Most databases have ability to query on XML structures these days, so without the need for many extra tables, you would keep your flexibility for a small performance hit.
If you put all the special values in one table, you get a very sparse table. Most normal DBMSes cannot handle this very well, but there are some implementations that specialize in this. You could benefit from that.
Do you often need to query the key-value pairs? if you basically access that table through it's entry_id, I think having a key-value table is not a bad design. An extra index on the kay column might even help you when you do need to query for special values. If you build an application layer on top of your database, the key-value table will map on a Map or Hash structure, which can also easily be used.
It also depends on the different types of values you want to store. If there are many different types, that need to be easily accessed (instead of being serialized/deserialized to XML/Character-String) you might want to store the type in a separate column, but that will usually lead to a very complicated design.
Hope this helps (a little bit).
-Maarten
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
Suggest you read about the problems with entity value tables before deciding to use them.
Oracle can deal with sparsely filled tables quite well. I think you can use a similar approach as company salesforce uses. They use tables with a lot of columns, they create columns when needed. You can index those columns much better than an eav model.
So it is flexible but it performs better than an eav model.
Read: Ask Tom 1, Ask Tom 2, High Scalabilty and SalesForce.
The "Option 1" patterns is also called the "Universal Relation" At first look it seems like a short cut to not doing potentially difficult data modeling. It trades effortless data modeling for not being able to do simple select, update, delete without dramatically more effort than it would take on more usual looking data model with multiple tables.

What is the preferred way to store custom fields in a SQL database?

My friend is building a product to be used by different independent medical units.
The database stores a vast collection of measurements taken at different times, like the temperature, blood pressure, etc...
Let us assume these are held in a table called exams with columns temperature, pressure, etc... (as well as id, patient_id and timestamp). Most of the measurements are stored as floats, but some are of other types (strings, integers...)
While many of these measurements are handled by their product, it needs to allow the different medical units to record and process other custom measurements. A very nifty UI allows the administrator to edit these customs fields, specify their name, type, possible range of values, etc...
He is unsure as to how to store these custom fields.
He is leaning towards a separate table (say a table custom_exam_data with fields like exam_id, custom_field_id, float_value, string_value, ...)
I worry that this will make searching both more difficult to achieve and less efficient.
I am leaning towards modifying the exam table directly (while avoiding conflicts on column names with some scheme like prefixing all custom fields with an underscore or naming them custom_1, ...)
He worries about modifying the database dynamically and having different schemas for each medical unit.
Hopefully some people which more experience can weigh in on this issue.
Notes:
he is using Ruby on Rails but I think this question is pretty much framework agnostic, except from the fact that he is only looking for solutions in SQL databases only.
I simplified the problem a bit since the custom fields need to be available for more than one table, but I believe this doesn`t really impact the direction to take.
(added) A very generic reporting module will need to search, sort, generate stats, etc.. of this data, so it is required that this data be stored in the columns of the appropriate type
(added) User inputs will be filtered, for the standard fields as well as for the custom fields. For example, numbers will be checked within a given range (can't have a temperature of -12 or +444), etc... Thus, conversion to the appropriate SQL type is not a problem.
I've had to deal with this situation many times over the years, and I agree with your initial idea of modifying the DB tables directly, and using dynamic SQL to generate statements.
Creating string UserAttribute or Key/Value columns sounds appealing at first, but it leads to the inner-platform effect where you end up having to re-implement foreign keys, data types, constraints, transactions, validation, sorting, grouping, calculations, et al. inside your RDBMS. You may as well just use flat files and not SQL at all.
SQL Server provides INFORMATION_SCHEMA tables that let you create, query, and modify table schemas at runtime. This has full type checking, constraints, transactions, calculations, and everything you need already built-in, don't reinvent it.
It's strange that so many people come up with ad-hoc solutions for this when there's a well-documented pattern for it:
Entity-Attribute-Value (EAV) Model
Two alternatives are XML and Nested Sets. XML is easier to manage but generally slow. Nested Sets usually require some type of proprietary database extension to do without making a mess, like CLR types in SQL Server 2005+. They violate first-normal form, but are nevertheless the fastest-performing solution.
Microsoft Dynamics CRM achieves this by altering the database design each time a change is made. Nasty, I think.
I would say a better option would be to consider an attribute table. Even though these are often frowned upon, it gives you the flexibility you need, and you can always create views using dynamic SQL to pivot the data out again. Just make sure you always use LEFT JOINs and FKs when creating these views, so that the Query Optimizer can do its job better.
I have seen a use of your friend's idea in a commercial accounting package. The table was split into two, first contained fields solely defined by the system, second contained fields like USER_STRING1, USER_STRING2, USER_FLOAT1 etc. The tables were linked by identity value (when a record is inserted into the main table, a record with same identity is inserted into the second one). Each table that needed user fields was split like that.
Well, whenever I need to store some unknown type in a database field, I usually store it as String, serializing it as needed, and also store the type of the data.
This way, you can have any kind of data, working with any type of database.
I would be inclined to store the measurement in the database as a string (varchar) with another column identifying the measurement type. My reasoning is that it will presumably, come from the UI as a string and casting to any other datatype may introduce a corruption before the user input get's stored.
The downside is that when you go to filter result-sets by some measurement metric you will still have to perform a casting but at least the storage and persistence mechanism is not introducing corruption.
I can't tell you the best way but I can tell you how Drupal achieves a sort of schemaless structure while still using the standard RDBMSs available today.
The general idea is that there's a schema table with a list of fields. Each row really only has two columns, the 'table':String column and the 'column':String column. For each of these columns it actually defines a whole table with just an id and the actual data for that column.
The trick really is that when you are working with the data it's never more than one join away from the bundle table that lists all the possible columns so you end up not losing as much speed as you might otherwise think. This will also allow you to expand much farther than just a few medical companies unlike the custom_ prefix you were proposing.
MySQL is very fast at returning row data for short rows with few columns. In this way this scheme ends up fairly quick while allowing you lots of flexibility.
As to search, my suggestion would be to index the page content instead of the database content. Use Solr to parse through rendered pages and hold links to the actual page instead of trying to search through the database using clever SQL.
Define two new tables: custom_exam_schema and custom_exam_data.
custom_exam_data has an exam_id column, plus an additional column for every custom attribute.
custom_exam_schema would have a row to describe how to interpret each of the columns of the custom_exam_data table. It would have columns like name, type, minValue, maxValue, etc.
So, for example, to create a custom field to track the number of fingers a person has, you would add ('fingerCount', 'number', 0, 10) to custom_exam_schema and then add a column named fingerCount to the exam table.
Someone might say it's bad to change the database schema at run time, but I'd argue that configuring these custom fields is part of set up and won't happen too often. Still, this method lets you handle changes at any time and doesn't risk messing around with your core table schemas.
lets say that your friend's database has to store data values from multiple sources such as demogrphic values, diagnosis, interventions, physionomic values, physiologic exam values, hospitalisation values etc.
He might have as well to define choices, lets say his database is missing the race and the unit staff need the race of the patient (different races are more unlikely to get some diseases), they might want to use a drop down with several choices.
I would propose to use an other table that would have these choices or would you just use a "Custom_field_choices" table, which at some point is exactly the same but with a different name.
Considering that the database :
- needs to be flexible
- that data from multiple tables can be added and be customized
- that you might want to keep the integrity of the main structure of your database for distribution and uniformity purpose
- that data MUST have a limit and alarms and warnings
- that data must have units ( 10 kg or 10 pounds) ?
- that data can have a selection of choices
- that data can be with different rights (from simple user to admin)
- that these data might be needed to generate reports without modifying the code (automation)
- that these data might be needed to make cross reference analysis within the system without modifying the code
the custom table would be my solution, modifying each table would end up being too risky.
I would store those custom fields in a table where each record ( dataType, dataValue, dataUnit ) would use in one row. So there would be a relation oneToMany from one sample to the data. You can also create a table to record all the kind of cutsom types you would use. For example:
create table DataType
(
id int primary key,
name varchar(100) not null unique
description text,
uri varchar(255) //<-- can be used for an ONTOLOGY
)
create table DataRecord
(
id int primary key,
sample_id int not null,//<-- reference to the sample
dataType_id int not null, //<-- references DataType
value varchar(100),//<-- the value as string
unit varchar(50)//<-- g, mg/ml, etc... but it could also be a link to a table describing the units just like DataType
)