how to use blob pattern as an alternative of EAV - blob

In the literature and on the forums people often say that EAV is evil and often sugest using Serialized LOB pattern as an alternative of EAV, but don't say something concrete how to use it.
I wonder how to overcome problems with using BLOB pattern as an alternative of EAV.
Let’s assume that we could store all custom fields of the entity in a field custom_fields as a string for example in JSON something like tihis:
{customField1: value1, customField2: value2, …,
customFieldN: valueN}
Let's assume the table subscribers has fields:
id, email, custom_fields (where all custom fields is stored)
How to overcome the following problems:
1. How to seach by seperate custom fields, for example, to find entities with the conditions custField1 = value1 AND customField2 = value2?
2. How to mantain data integrity, for example, if we delete a custom field for the entity how to delete all values oif these custom fields in the entity?

Related

SQLAlchemy Category column

I am using a SQLAlchemy database to hold data for a flask application. I would like one column in my database to represent a category (e.g. the possible categories may be A, B or C).
I have seen in documentation that this can be achieved by a simple relationship which relates two tables. One table to hold some live data (inclusive of a category ID and a category) and another table to relate a category id to the associated category. http://flask-sqlalchemy.pocoo.org/2.3/quickstart/#simple-relationships
Would this method be considered good practice for including some kind of "category" column in my database? Or is there a simpler/better way. In this case my aim is to prioritise simplicity while maintaining good practice (don't really need best practice if it entails too much complexity).
Additionally, if my category names will never change, is it bad practice to use a constant list of category names to compare input data with in order to validate it? If so, why?
This is more of an SQL question and it isn't related to Python at all.
Anyways, it is actually better to use a reference table as you first suggested.
In this case, a Category table with one-to-many relationship. This allows you to change category name, and enrich Category with more details (like description) that might become useful in the future.
The other way, using constant list, is considered a bad practice - especially using Enums. You can read more about it in this article: 8 Reasons Why MySQL's ENUM Data Type Is Evil
You can read more about this dilemma here.
Hope it helps.

Database design: Better to store more columns or use a single column with a dictionary on Backend?

I'm building a web app which takes preferences and saves them.
The dataset I want to save will consist of a unique ID, a search string and some finite list of parameters which could be represented as True or False. This list of parameters could get up to say 10 in number.
I haven't decided what type of database I'm using but assuming it has rows and columns, would it be more efficient to have ID, search string and all the parameters as separate columns OR would it be more efficient to have ID, search string and then a single column representing all my parameters using some sort of dictionary that I would translate on the back end.
For example I could represent option A, C and D as A-C-D in a single column and then use a dictionary on retrieval to work with it in the application. Or else I would be using ColA: True, ColB: False, ColC: True, ColD: True, ..., ColN in the table and working with that when I pull it through
Would it be more useful to choose an SQL style DB over something like MongoDB in either case?
The answer to this depends. Normally, one uses relational databases to store relational information. This would mean that you have separate columns for options and values. There are traditionally two ways of doing this.
The most common is a normalized form, where each option has a column in a Users table. The key is the user id and you can just read the values. This works very well when there is a finite list of options that doesn't change much. It is also really useful when you want to query the table by options -- which users have a particular option, for instance.
Another method is called entity-attribute-value (EAV). In this method, the UserOptions table would have a separate row for each user and each option. The key would normally consist of the user/option pair (and the option itself might be an id that references a master list of options). This is flexible; it is easy to add values and it can handle an unlimited number of options per user. The downside is that getting all options for a user can be cumbersome; there is no data type validation on the values; implementing check constraints to validate values is tricky.
A third method can be useful for some purposes. That is to store all the options in a single "string" -- more typically, a JSON object. This is useful when you are using the database only for its ACID properties and don't need to query individual options. You can read the "options object" into your application, and it parses them into the options.
And, these are three examples of methods of solving the problem. There are also hybrid approaches that combine elements from more than one solution.
Which solution works best for you depends on your application. If you just have a handful of predetermined options, I would go with the first suggestion, a single column per option in a table.
Neither of the two options you specified is ideal.
Would it be more efficient to have ID, search string and all the parameters as separate columns
The problem with this is not only does this assume that you have a fixed maximum number of parameters, but querying this data would require you to always include every param column. An example query for this would be like this:
SELECT Id, <other fields>, Param1, Param2, Param3, Param4, ..., Param10
FROM YourTable
WHERE <stuff>
This can be very cumbersome on the back-end trying to check for NULL values, and you may run into the situation where you don't have enough columns. Plus, indexing would be very high overhead to add an index to each Param.
In short, don't do that method.
OR would it be more efficient to have ID, search string and then a single column representing all my parameters using some sort of dictionary that I would translate on the back end.
Also, no. There is a large problem with this method when it comes to querying data. If, say, you wanted to retrieve all records with parameter xyz, you would need to construct a query that parses out all of the params and compares them. Such a query cannot be indexed, and performance will be dreadful. In addition, it requires more coding on the application layer to actually make sense of the data returned.
Proposed Solution
You should make a separate table for the parameters. The structure would look something similar to this:
Dataset: DatasetParameters:
Id DatasetId
<Other Fields> Parameter
Using this structure, let's say for ID 1, you have parameters A, B, C, and D. You can insert into the DatasetParameters four columns:
DatasetId Parameter
----------------------
1 A
1 B
1 C
1 D
If you want to add more parameters later, you can simply insert (or delete, should you wish to remove) from this table with the DatasetId being the ID of the Dataset table.
To query this, all you would need to do is use a JOIN:
SELECT D.*, P.Param
FROM Dataset D
INNER JOIN DatasetParam P ON D.ID = P.DatasetID

Enum types in database schema

This might be sort of a basic db question, but I'm more used to working with objects rather than tables. Let's say I have an object 'Movie' with property 'genre'. Genre should be restricted by using enumerated types (eg. the only valid genres are Horror, Action, Comedy, Drama). How should this translate to a db schema?
I could put a 'genre' column in the Movies table and rely on checking inputs to ensure that a 'genre' assignment is valid?
Or, I could include a Genres table with pre-filled rows, and then in the Movies table include a column with a foreign key to the Genres table?
I'm leaning towards the first option, but are there pitfalls/etc. that I'm not considering?
I lean toward using the lookup table, your second option. The reason I prefer this is that I can add a new genre simply by adding a row to the Genres table. There would be no need to modify code or to modify the enum definition in the schema.
See also my answer to How to handle enumerations without enum fields in a database?
Here is a useful heuristic: Do you treat all values the same from the client code?
If you do, then just use the lookup table. Even if you don't envision adding new values1 now, requirements tend to change as the time marches on, and the lookup table will allow you to do that without changing the client code. Your case seems to fall into that category.
If you don't, then enum is likely more appropriate - the "knowledge" about each distinct value is contained in your client code anyway, so there is nothing useful left to store in the database.
The gray zone is if you do a little bit of both. E.g. you need to treat values in special ways, but there is still some additional field (associated to each value) that you can treat generically (e.g. just display it to the user). Or you need to treat just some values in special ways. In cases like these, I'd lean towards the lookup table.
1 Or deleting or modifying old values.

How can i design a DB where the user can define the fields and types of a detail table in a M-D relationship?

My application has one table called 'events' and each event has approx 30 standard fields, but also user defined fields that could be any name or type, in an 'eventdata' table. Users can define these event data tables, by specifying x number of fields (either text/double/datetime/boolean) and the names of these fields. This 'eventdata' (table) can be different for each 'event'.
My current approach is to create a lookup table for the definitions. So if i need to query all 'event' and 'eventdata' per record, i do so in a M-D relaitionship using two queries (i.e. select * from events, then for each record in 'events', select * from 'some table').
Is there a better approach to doing this? I have implemented this so far, but most of my queries require two distinct calls to the DB - i cannot simply join my master 'events' table with different 'eventdata' tables for each record in in 'events'.
I guess my main question is: can i join my master table with different detail tables for each record?
E.g.
SELECT E.*, E.Tablename
FROM events E
LEFT JOIN 'E.tablename' T ON E._ID = T.ID
If not, is there a better way to design my database considering i have no idea on how many user defined fields there may be and what type they will be.
There are four ways of handling this.
Add several additional fields named "Custom1", "Custom2", "Custom3", etc. These should have a datatype of varchar(?) or similiar
Add a field to hold the unstructured data (like an XML column).
Create a table of name /value pairs which are associated with some type of template. Let them manage the template. You'll have to use pivot tables or similiar to get the data out.
Use a database like MongoDB or another NoSql style product to store this.
The above said, The first one has the advantage of being fast but limits the number of custom fields to the number you defined. Older main frame type applications work this way. SalesForce CRM used to.
The second option means that each record can have it's own custom fields. However, depending on your database there are definite challenges here. Tried this, don't recommend it.
The third one is generally harder to code for but allows for extreme flexibility. SalesForce and other applications have gone this route; including a couple I'm responsible for. The downside is that Microsoft apparently acquired a patent on doing things this way and is in the process of suing a few companies over it. Personally, I think that's bullcrap; but whatever. Point is, use at your own risk.
The fourth option is interesting. We've played with it a bit and the performance is great while coding is pretty darn simple. This might be your best bet for the unstructured data.
Those type of joins won't work because you will need to pivot the eventdata table to make it columns instead of rows. Therefore it depends on which database technology you are using.
Here is an example with MySQL: How to pivot a MySQL entity-attribute-value schema
My approach would be to avoid using a different table for each event, if that's possible.
I would use something like:
Event (EventId, ..., ...)
EventColumnType (EventColumnTypeId, EventTypeId, ColumnName)
EventColumnData (EventColumnTypeId, Data)
You are them limited to the type of data you can store (everything would have to be strings, for example), but you the number of events and columns are unrestricted.
What I'm getting from your description is you have an event table, and then a separate EventData table for each and every event.
Rather than that, why not have a single EventCustomFields table that contains a foreign key to the event table, a field Name (event+field being the PK) and a field value.
Sure it's not the best. You'd be stuck serializing the value or storing everything as a string. And you'd still be stuck doing two queries, one for the event table and one to get it's custom fields, but at least you wouldn't have a new table for every event in the system (yuck x10)
Another, (arguably worse) option is to serialize the custom fields into a single column of the and then deserialize when you need. So your query would be something like
Select E.*, C.*
From events E, customFields C
Where E.ID = C.ID
Is it possible to just impose a limit on your users? I know the tables underneath Sharepoint 2007 had a bunch of columns for custom data that were just named like CustomString1, CustomDate2, etc. That may end up easier than some of the approaches above, where everything is in one column (though that's an approach I've taken as well), and I would think it would scale up better.
The answer to your main question is: no. You can't have different rows in the result set with different columns. The result set is kind of like a table, so each row has to have the same columns. You can fake it with padding and dummy columns, but that's probably not much better.
You could try defining a fixed event data table, with (say) ten of each type of column. Then you'd store the usage metadata in a separate table and just read that in at system startup. The metadata would tell you that event type "foo" has a field "name" mapped to column string0 in the event data table, a field named "reporter" mapped to column string1, and a field named "reportDate" mapped to column date0. It's ugly and wastes space, but it's reasonably flexible. If you're in charge of the database, you can even define a view on the table so to the client it looks like a "normal" table. If the clients create their own tables and just stick the table name in the event record, then obviously this won't fly.
If you're really hardcore you can write a database procedure to query the table structures and serialize everything to a lilst of key/type/value tuples and return that in one long string as the last column, but that's probably not much handier than what you're doing now.

Define Generic Data Model for Custom Product Types

I want to create a product catalog that allows for intricate details on each of the product types in the catalog. The product types have vastly different data associated with them; some with only generic data, some with a few extra fields of data, some with many fields that are specific to that product type. I need to easily add new product types to the system and respect their configuration, and I'd love tips on how to design the data model for these products as well as how to handle persistence and retrieval.
Some products will be very generic and I plan to use a common UI for editing those products. The products that have extensible configuration associated with them will get new views (and controllers) created for their editing. I expect all custom products to have their own model defined but to share a common base class. The base class would represent the generic product that has no custom fields.
Example products that need to be handled:
Generic product
Description
Light Bulb
Description
Type (with an enum of florescent, incandescent, halogen, led)
Wattage
Style (enum of flood, spot, etc.)
Refrigerator
Description
Make
Model
Style (with an enum in the domain model)
Water Filter information
Part number
Description
I expect to use MEF for discovering what product types are available in the system. I plan to create assemblies that contain product type models, views, and controllers, drop those assemblies into the bin, and have the application discover the new product types, and show them in the navigation.
Using SQL Server 2008, what would be the best way to store products of these various types, allowing for new types to be added without having to grow the database schema?
When retrieving data from the database, what's the best way to translate these polymorphic entities into their correct domain models?
Updates and Clarifications
To avoid the Inner Platform Effect, if there is a database table for every product type (to store the products of that type), then I still need a way to retrieve all products that spans product types. How would that be achieved?
I talked with Nikhilk in more detail about his SharePoint reference. Specifically, he was talking about this: http://msdn.microsoft.com/en-us/library/ms998711.aspx. It actually seems pretty attractive. No need to parse XML; and there is some indexing that could be done allowing for simple and fast queries over the data. For instance, I could say "find all 75-watt light bulbs" by knowing that the first int column in the row is the wattage when the row represents a light bulb. Something (NHibernate?) in the app tier would define the mapping from the product type to the userdata schema.
Voted down the schema that has the Property Table because this could lead to lots of rows per product. This could lead to index difficulties, plus all queries would have to essentially pivot the data.
Use a Sharepoint-style UserData table, that has a set of string columns, a set of int columns, etc. and a Type column.
Then you have a list of types table that specifies the schema for each type - its properties, and the specific columns they map to in the UserData table.
With things like Azure and other utility computing storage you don't even need to define a table. Every store object is basically a dictionary.
I think you need to go with a data model like --
Product Table
ProductId (PK)
ProductName
Details
Property Table
PropertyId (PK)
ProductId (FK)
ParentPropertyId (FK - Self referenced to categorize properties)
PropertyName
PropertyValue
PropertyValueTypeId
Property Value Lookup Table
PropertyValueLookupId (PK)
PropertyId (FK)
LookupValue
And then have a dynamic view based on this. You could use the PropertyValueTypeId coloumn to identify the type, using a convention, like (0- string, 1-integer, 2-float, 3-image etc) - But ultimately you can store everything untyped only. You could also use this column to select the control template to render the corresponding property to the user.
You can use the Value lookup table to keep lookups for a specific property (so that user can choose it from a list)
Summarizing lets look at the options under consideration for storing product information:
1) some xml format in the database
2) similar to the post above about having x number of type defined columns (sharepoint approach)
3) via generic table with name and type definitions stored in lookup table and values in secondary table with columns id, propertyid, value (similar to #2 however this approach would provide unlimited property information
4) some hybrid of the above option where product table would have x common columns (for storage of properties common with all products) with y user defined columns (this could be m of integer type and n of varchar types). This may be taking the best of #2 and a normalzied structure as if you knew all the properties of all products. You would be getting the best sql performance for the properties that you use the most (probably those that are common across all products) while still allowing custom columns for specific properties with each product.
Are there other options? In my opinion I would consider 4 above as the best hybrid of the combinations.
dave
Put as much of the shared anticipated structure in traditional normalized 3NF model, then augment with XML columns as appropriate.
I don't see MEF (or any other ORM) being able to do all this transparently.
I think you should avoid the Inner Platform Effect and actually build tables for your specialized entities. You'll be writing specific code to manage them so why not have proper backing tables too?
It will make your deployment slightly harder - drop in an assembly and run a script - but it will probably save you a lot of pain in the long run.
Jeff,
we currently use a XML field in the Products table to handle all product-specific data. So our Products table has a few common fields that all products share, an XML which contains whatever a particular product needs additionally, and a few computed fields that grab into the XML and surface some of the frequently queried fields as "virtual" fields on the Products table (e.g. "Style" would be set to whatever the current product defines, or NULL, if the product doesn't have a Style property).
So far, we've been quite flexible with that approach - if you create some decent XSD schemas for your XML, you can even create C# proxy classes for these fields.
Works nicely for us - joining the best of both the relational and XML worlds.
Marc