Database schema changes needed yearly. Which Strategy should be used?

Database schema changes needed yearly. Which Strategy should be used? - sql

Every year our company holds a conference/stand where participants can show their products.
We have a web-application which let the participants sign up for the conference.
They can enter information such as the name of their company, billing information, and so on.
It seems as if the requirements for what information the participants need to enter, vary from year to year.
I.E , one year the participants might need to enter the size of the stand they want, the next year this is no longer needed, and so on.
One year, you might just have to enter a total number of m^2 you want, while the next year, you might need to add the length, height and number of floors you want.
Over they years, this has caused the DB schema to become quite crazy.
We now have a lot of 'obsolete' fields and tables in our database, and it's beginning to look quite messy.
For historical reasons, we can't just reset the schema back to basics for each year.
We might need some of the data from the old conferences.
So: Does anyone have a good idea on how we can deal with this ?
The only solutions I can think of are
Version our database for each conference i.e
Store all of the 'varying' information as xml
If anyone has some good litterature for how to handle evolving databases and dealing with obsolete data, it would be good !

much as I hate to say this, this might be case where the Entity-attribute-value structure would work best.
http://en.wikipedia.org/wiki/Entity-Attribute-Value_model
Note this is not a model to use lightly, there are significant problems with it. But this iexactly the kind of problem it is designed to solve.

I would consider using a name-value approach for all the extended data. Essentially you define your static data from year over year. This will be things like Company information, the definition of an Address for example doesn't change year after year. These will be modled normally.
Then you would define a table that will contain a master of all the questions you have, and will be linked somehow to tell you what year those questions are valid for. This table might also indicate other attributes about the question that could let you dynamically create a GUI on top of it. Things such as regular expressions to validate the type of data etc.
Here's a really naive approach which even after doing this would not be the end state of what I would model (I would probally have another table the correlates a year to a question, and this is what I would link the company too. this way we can reuse questions over and over).

"We now have a lot of 'obsolete' fields and tables in our database, and it's beginning to look quite messy. For historical reasons, we can't just reset the schema back to basics for each year. We might need some of the data from the old conferences."
If you might need them, they're not obsolete.
I would code the front-end generically however. This means having a system that can handle any form of stand area configuration (in the example you give), and maybe more in the future if that should occur.
If you have tables like "standarea" (area in m^2), "standsize" (length, width, height, etc) - then you would have objects in your model to match these (StandArea, StandSize) - these could both extend a common base class StandData.
One year one table gets data set, the next year another table gets the data. Your DAO will try to load each object from each table (by a parent, err, stand_uid field) and then set the StandData field in your "ConferenceApplication" object to whatever it discovered.
The other option is to just have all possible fields in a single table, and allow them to be empty.

Related

Querying SQL using a Code column vs extended table

I am setting up a fairly large dataset (catalogue) on a sql database (i'd guesstimate ∼100k records) to store information regarding products. Each product is characterized by about 20-30 properties, so that would basically mean 20-30 column. The system is setup so that each of these properties is actually linked to a code, and each product is therefore characterized by a unique string made concatenating all these properties (the string has to be unique, if two product codes are the same then the two products are actually the same product). What I am trying to figure out is if sql-wise there is any difference to storing the catalogue as a table of 20-30 columns, or if I am better off just having 1 column with the code and decoding the properties from the code. The difference being that in one case I would do
SELECT * FROM Catalogue WHERE Color='RED'
versus
SELECT * FROM Catalogue WHERE Code LIKE '____R____________'
Also it might make it easier to check whether a product already exists, as I am only comparing a single column compared to 20-30 columns. I could also just add an extra column to the complete table to store the code and use one method when doing one operation and another when doing another operation.
I have almost no knowledge of how the SQL engine works so I might be completely off with my reasoning here.

The code approach seems silly. Why do I phrase it this way?
You have a few dozen columns with attributes and you know what they are. Why would you NOT include that information in the data model.
I am also amused by how you are going to distinguish these comparisons:
WHERE Code LIKE '____R____________'
WHERE Code LIKE '___R_____________'
WHERE Code LIKE '_____R___________'
WHERE Code LIKE '____R___________'
That just seems like a recipe for spending half the rest of your future life on debugging -- if not your code then someone else's.
And, with separate columns, you can create indexes for commonly used combinations.
If not all rows have all attributes -- or if the attributes can be expanded in the future -- you might want a structure with a separate line for each attribute:
entityId code value
1 Color Red
This is called an entity-attribute-value (EAV) model and is appropriate under some circumstances.

User characteristics database schema

I'm really battling an issue where I have a Users table that has a growing number of user characteristics (regligion, smoking preferences, etc). The strategy I've used thus far has been to add a column for each preference that keys off onto another table.
For example, if User XYZ has a RelgionId of 3, that could mean they're Christian. At runtime, if I need their religion, I join onto another table.
This strategy has worked so far. However, I'm getting concerned about the number of columns in the tables as the number of preferences is increasing. Also, this strategy leads to many joins if I need to get all values for a single user.
I'd like to find out the most normalized way of representing this data. Anybody have any ideas?

I'd like to find out the most normalized way of representing this data.
Well, from what you describe, you seem to have quite a normalized database.
What you are looking for if you want to reduce the number of joins is denormalization.
For instance, if you want to access a subset of those user preferences with a smaller number of joins, you might want to cache them in a UserDetails table, and link that in the User table with a UserDetailsId foreign key.
This might actually be feasible in case you have a subset of seldom-changing values (for instance one's religion does not often change).
The drawback is that in case one of these changes you might have to change the info in two places (depending on if you want to also keep the normalized version of that data or not).
I hope this helps. Feel free to ask for additional clarification.

Three dimensional database table

We have all been there - consider the following example - first, the client says "every user shall only have one profile picture", so we add a field for that to the users table - half a year later, requirements change and a user actually needs to have n profile pictures.
Now, this seems only possible if you add a new table such as user_pictures to handle the new cardinality 1:n instead of 1:1. Oftentimes this can get very complicated. Whenever I come across this problem, I wonder why we don't use all three dimensions that we can think in. A two dimensional table is limited in a way that it is somewhat incomplete - what if, referring to our problem with the profile picture again, the picture field in the users table had a depth, and that depth made the field an array that perfectly represented both cardinalities 1:1 and 1:n at the same time.
Table fields would simply become arrays and automatically support both cardinalities - wouldn't that be something? At least I would use it. Is there something like it out there already?

Oracle has support for arrays as well as nested tables. Either seem to fit your requirements. These days though people prefer to model everything as tables and relationships to keep things simple and consistent and so modern RDBMSes don't generally support this stuff and I don't believe it ever made it into standard SQL either.

The standard many-to-many approach, many users to many profile pictures, is easily covered by the three table approach:
Table: Users
Table: Pictures
Table: User_Pictures
However, if you move to a NoSQL approach, you can store a User document (usually in JSON format), that stores an array of profile pictures for that user in a single table.
#gordy +1 for the Oracle link. I wasn't sure if any RDBS supposed arrays.

You are describing a denormalization technique (multiple columns for instances of one field) and it usually leads to tears unless you thoroughly understand the consequences of violating basic relational principles.
A classic difficulty comes when you want to query on the field ("find the user who has this picture") and you discover that an SQL statement with "AND picture IN (pic1, pic2, pic3)" can't be indexed and your optimizer starts planning its revenge.

Implementing a Flexible Relationship in a RDBMS -- What really are the tradeoffs?

I have a bunch of products with a bunch of different possible attributes for each product. E.g. Product A has a name, size, color, shape. Product B has a name, calories, sugar, etc. One way to solve this is like:
1) Create tables
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
This allows for maximum flexibility, but I have heard a lot of people recommend against this although I am not sure why. I mean, if those tables were called Teams, Players, Team_Players we would all agree that this is proper relational design.
Everyone who explains to me why this is bad does so in the context of a completely flexible relational design where you don't ever create real tables past a basic few basic initial tables (e.g. object, attribute, object_attribute)-- which I think we all can agree is bad. But this is a much more limited and contained version of that (only Products, not every object in the system), so I don't think it is fair to group these two architectures together.
What issues have you encountered (experience or theoretical) that makes this design so bad?
2) Another way to solve this is to create a Product table with a bunch of columns like Size, Color, Shape, Weight, Sugar, etc and then include some extra columns at the end to give us some flexibility. This will create generally sparse rows filled mostly with NULLs. People tend to like this approach, but my question is how many columns can you have before this approach loses its performance benefits? If you have 200 columns, I imagine this is no longer a smart move, but what about 100 columns? 50 columns? 25 columns?
3) The final approach I know about is to store all of the attributes as a blob (JSON perhaps) in a single column of the Products table. I like this approach but it doesn't feel right. Queries are hard. And if you want to be able to easily change the name of an attribute later, you either have to parse every record individually or have them keyed in your blob by some id. If you go the id path then you will need another table Attributes and things start to look like approach #1 from above except you won't be able to join the attribute_id with your blob, so I hope you didn't want to query anything by attribute name.
What I like about this approach though is you can query one product and in your code you can easily access all the properties it has -- fast. And if you delete a product, you won't have to cleanup other tables -- easy to stay consistent.
4) I have read some things about being able to index strongly typed xml formats in some RDBMSs, but I honestly don't know much about this approach.
I am stuck. I feel like approach #1 is the best bet, but everything I read says that way stinks. What is the right way to think about this problem to be able to decide what is the best method for a given situation? More ideas than what I have listed are obviously welcomed!

You can probably find a great deal about this topic by doing a Google search on "entity attribute value antipattern".
One of the issues with this approach is that you end up mixing meta-data with actual data. Your "attribute" has to now tell the database what exactly is held in the "value" column. This can make it very difficult to handle this data in front-ends, reporting software, etc.
Second, you're going to have a very hard time actually enforcing any data integrity in the database. When your product has an attribute of "weight" what's to stop someone from putting "22 inches" in the value? Or a non-numeric value completely. You might say, "Well, my application will handle that." Then you need to change your application every time that you want to add a new attribute because the application needs to know how to handle it. If you're going to go through all of that work, just add a new column.
Third, how do you enforce that a given product has all of the attributes that it needs? In a row you can make column NOT NULL and they are then required to get that row into the database. You can't enforce that in the EAV model.
Fourth, this kind of a model usually leads to a lot of confusion. People aren't sure what "attributes" are supported, or they duplicate an attribute, or they forget to handle an attribute when creating a report. As an example, if I have an attribute for "Weight(kg)" and another attribute for "Weight(lbs)" and someone asks me, "What's the heaviest product in your database?" I'd better remember that I need to check both attributes.
Fifth, this model usually also leads to laziness. Hey, there's no reason to actually do any analysis of the products that our system can handle, because whatever comes along we'll just add some attributes. In my experience, companies are much better off doing the analysis required to create a good database design rather than fall back on an antipattern like this. You'll learn things about the database, the application, and likely the business as well.
Sixth, it might take a LOT of joins to get a single row of data for a given product. You can return the attributes as separate rows, but now you have to come up with customized list boxes to list those products, etc. Similarly, writing search queries against this model can be very difficult and in both of these situations you're likely to have performance issues.
These are just a few of the problems which I've encountered over the years. I'm sure that there are others.
What the correct solution is for your system depends a lot on the specifics of your business and application. Rather than a sparse row, you might consider using subtype tables if your products fall into a few categories that share common attributes.

There are many problems with flexible data models but the first one that is likely to bite you is the fact that queries get unwieldy very quickly. For example, if you wanted to get the Size attribute for every product, the query is relatively easy.
SELECT p.name product_name,
pa.value product_size
FROM product p
left outer join product_attribute pa on (p.product_id = pa.product_id)
left outer join attribute a on (pa.attribute_id = a.attribute_id and
a.name = 'size')
If you want to get the size and some other attribute like color, things get trickier
SELECT p.name product_name,
pa_size.value product_size
pa_color.value product_color
FROM product p
left outer join product_attribute pa_size on (p.product_id = pa_size.product_id)
left outer join product_attribute pa_color on (p.product_id = pa_size.product_id)
left outer join attribute a_size on (pa_size.attribute_id = a.attribute_id and
a_size.name = 'size')
left outer join attribute a_color on (pa_color.attribute_id = a.attribute_id and
a_color.name = 'color')
Very quickly, when you start wanting to grab 10 attributes or write complex searches (show me products where the color is blue and the size is medium), the queries start to get very complicated both for developers to write and maintain and for the database optimizer to generate the query plan for. If you're joining 30 tables together, the optimizer would have to prune the tree of plans it considers very, very quickly to be able to generate a query plan in a reasonable time frame. That tends to lead the optimizer to discard promising paths too early and to generate less than optimal paths for many of your queries.
This, in turn, means that you very quickly get to a point where new development is bottlenecked because developers can't get their queries right or developers can't get their queries to return quickly enough. Whatever time you saved up front by not gathering the requirements to determine what the valid attributes are quickly gets used up with the 47th iteration of "Why can't I get the data I want out of this putrid data model?"
Beyond this cost to developers, you end up creating a lot of costs for the organization as a whole.
No query tool is going to handle this sort of data model well. So all the users that can currently fire up their favorite query tool and run some reports out of your database are now stuck waiting for developers to write their reports and do their extracts for them.
Data quality becomes very hard to enforce. It becomes very hard to check conditions that involve multiple attributes (i.e. if a product's size Medium then the weight must be between 1 and 10 pounds, if a product's height is specified then a width is required as well) so people don't make those checks. They don't write the reports to identify where these sorts of rules are violated. So the data ends up being a bit bucket of data that downstream processes decide they can't use because it isn't sufficiently complete.
You're moving too much of the initial requirements discussion off into the future when understanding the core entities will likely lead to a much better design overall. If you can't agree on a set of attributes that the first version of the product needs to support, you don't really understand what that version is supposed to do. Even if you successfully code a very generic application, that means that it is going to require a lot of time to configure once you've built it (because someone will have to figure out what attributes it supports at that point). And then you'll discover when the application is being configured that you missed a ton of requirements that only became clear when the attributes were defined-- you can't know that width is required if height is specified if you don't know whether they're going to store height or width in the first place.
In the worst case, the response to this problem during configuration is to immediately determine that you need to provide a flexible way to specify business rules and to specify workflows so that the people configuring the application can quickly code their business rules when they add new attributes and so that they can control the flow of the application by grouping attributes together or skipping certain pages (i.e. have a page where make & model are required if the product type is car, skip that page if now). But in order to do that, you're going to end up building an entire development environment. And you're going to push the job of actually coding the application to the folks that are configuring the product. Unless you happen to be really good at building development environments, and unless the people configuring the product are really developers, this doesn't end well.

I mean, if those tables were called
Teams, Players, Team_Players we would
all agree that this is proper
relational design.
No, we wouldn't. Here's why.
You started with this.
Products (id, name)
Attributes (id, name)
Product_Attributes (product_id, attribute_id, value as string)
Let's drop the id numbers, so we can see what's really going on. (Longer column names for clarity.)
Products (product_name)
Attributes (attribute_name)
Product_Attributes (product_name, attribute_name, value as string)
And translating that to teams and players . . .
Teams (team_name)
Players (player_name)
Team_Players (team_name, player_name, value as string)
So for sample data we might have
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell ?
St. Louis Cardinals Carpenter, Chris ?
St. Louis Cardinals Franklin, Ryan ?
St. Louis Cardinals Garcia, Jaime ?
What on earth belongs in place of the question marks? Let's say we want to record number of games played. Now the sample data looks like this.
Team Player Value
--
St. Louis Cardinals Boggs, Mitchell 23
St. Louis Cardinals Carpenter, Chris 15
St. Louis Cardinals Franklin, Ryan 19
St. Louis Cardinals Garcia, Jaime 14
Want to store batting average, too? You can't. Not only can you not store batting average along with games played, you can't tell by looking at the database whether Mitch Boggs played in 23 games, had 23 hits, scored 23 runs, had 23 "at bats", had 23 singles, or struck out 23 times.

The reason why this approach is so bad is that you don't know how may times you have to join to the table to get all the attributes. Plus joining to the same table 20 times tends to create a performance block of massive proportions. I am assuming that Products wil be at the heart of your system and thus be a critical place for performance.
Now you say that the product attributes will be drastically different. I disagree. There will be many attributes that are common to a large number of your products things like price, units, size, color, dimemnsions, weight. Those should be in the product table as common properties. These are also the ones that the user is most likely to be searching for when picking a product.
Other properties are useful as a description but not for most anything else (they won't be searched on or put into the order details). Put those in a description or notes field.
Finally you are left with the few attributes which might be different. But how different are they? Are they common to a partiuclar type of product (books have these attributes, cameras have these), then a related table for that type of product might work well.
Once you have done your job and figured all this out, then add the flexibility of an EAV table if you still need one. The steps above should cover 98+% of the real requirements.
(Also it's kind of hard to design the order details table if you don't know the attribute fields you need to record for the order - you can't rely on the products table for that)
(oh and I agree wholeheartedly with what #Tom H is saying as well.)

Best pattern for storing (product) attributes in SQL Server

We are starting a new project where we need to store product and many product attributes in a database. The technology stack is MS SQL 2008 and Entity Framework 4.0 / LINQ for data access.
The products (and Products Table) are pretty straightforward (a SKU, manufacturer, price, etc..). However there are also many attributes to store with each product (think industrial widgets). These may range from color to certification(s) to pipe size. Every product may have different attributes, and some may have multiples of the same attribute (Ex: Certifications).
The current proposal is that we will basically have a name/value pair table with a FK back to the product ID in each row.
An example of the attributes Table may look like this:
ProdID AttributeName AttributeValue
123 Color Blue
123 FittingSize 1.25
123 Certification AS1111
123 Certification EE2212
123 Certification FM.3
456 Pipe 11
678 Color Red
999 Certification AE1111
...
Note: Attribute name would likely come from a lookup table or enum.
So the main question here is: Is this the best pattern for doing something like this? How will the performance be? Queries will be based on a JOIN of the product and attributes table, and generally need many WHEREs to filter on specific attributes - the most common search will be to find a product based on a set of known/desired attributes.
If anyone has any suggestions or a better pattern for this type of data, please let me know.
Thanks!
-Ed

You are about to re-invent the dreaded EAV model, Entity-Attribute-Value. This is notorious for having problems in real-life, for various reasons, many covered by Dave's answer.
Luckly the SQL Customer Advisory Team (SQLCAT) has a whitepaper on the topic,
Best Practices for Semantic Data Modeling for Performance and Scalability. I highly recommend this paper. Unfortunately, it does not offer a panacea, a cookie cutter solution, since the problem has no solution. Instead, you'll learn how to find the balance between a fixed queryable schema and a flexible EAV structure, a balance that works for your specific case:
Semantic data models can be very
complex and until semantic databases
are commonly available, the challenge
remains to find the optimal balance
between the pure object model and the
pure relational model for each
application. The key to success is to
understand the issues, make the
necessary mitigations for those
issues, and then test, test, and test.
Scalability testing is a critical
success factor if you are going to
find that optimal design.

This is going to be problematic for a couple of reasons:
Your entity queries will be much harder to write. Transforming the results of those queries into something resembling a ViewModel when it comes time for presentation is going to be painful because it will involve a pivot for each product.
Understanding what your datatypes will be is going to be tough when it comes time to read certain types of data. Are you planning on storing this as strings? For example, DateTimes hold more data than the default .ToString() implementation writes to the string. You're also going to have issues if you try to store floating-point values.
Your objects' data integrity is at risk. There will be a temptation to put properties which should be just attributes of your main product tables in this "bucket o' data". Maybe the design will be semi-sane to begin with, but I guarantee you that after a certain amount of time, folks will start to just throw properties in the bag. It'll then be very tough to keep your objects' integrity with such a loosely defined structure.
Your indexes will most likely be suboptimal. Again think of a property which should be on your product table. Instead of being able to index on just one column, you will now be forced to make a potentially very large composite index on your "type" table.
Since you're apparently planning to throw out proper datatypes and use strings, the performance of range queries for numeric data will likely be poor.
Your table will get big, slowing backups and queries. Instead of an integer being 4 bytes, you're going to have to store far more for an integer of any size.
Better to normalize the table in a more "traditional" way using "IS-A" relationships. For example, you might have Pipes, which are a type of Product, but have a couple more attributes. You might have Stoves, which are a type of product, but have a couple more attributes still.
If you really have a generic database and all sorts of other properties which aren't going to be subject to data integrity rules, you very well may want to consider storing data in an XML column. It's hard to tell you what the correct design choice is unless I know a lot more about your business.
IMO this is a design antipattern. The siren song of this idea has lured many a developer onto the rocks of of an unmaintainable application.

I know it is an old one - however there might be other readers...
I have seen the balance EAV to attribute modeled approach. Well - it is still EAV. "EAV's are like drugs" is pretty much true. So what about thinking it through once more - and let's be aggressive really:
I still liked the supertype apporach, where a lot of tables use the same primary key from a key generator. Let's reuse this one. So what about creating a new table for each set of attributes - all having the primary from the same key generator? Eg. you would have a table with the fields "color,pipe", another table "fittingsize,pipe", and so on. The requirement "volatility of attributes" screams for a carefully(automatically) maintained data dictionary anyway.
This approach is fully normalized and can be fully automated. You can support checks if specific attribute sets materialized already as table by hashing attribute name clusters, eg. crc32(lower('color~fittingsize~pipe')) where the atribute names need to be sorted alphabetically. Of course this requires to have the hash in the data dictionary. Based on the data dictionary each object can be searched (using 'UNION'), especially if the data dictionary itself is a table. Having the data dictionary as table also allows you to use its primary (surrogate) key as basis for unique tablenames, to end up with tables like 'attributes1','attributes2',... Most databases nowadays support some billion tables - so we are sort of save on that end as well. You could even have a product catalouge with very common attributes, that reference the extended attribute tables.
An open issue are 1:n data sets. I am afraid you need to sort them out in separate tables. However this very much depends on your data presentation and querying strategy. Should they always be presented as comma seperated string attached to the product or do you want to eg. be able to query for all products of a certain Certification?
Before you flame this approach please consider this: It is meant for use cases where you have a very high volatility of attributes - in quantity and quality - only. Also it was preset, that you cannot know most of the attributes at the point in time when the solution is created. So do not discuss this in a context where you can model your attributes upfront which would enable you to balance trade offs much better.

In short, you cannot go all one route. If you use an EAV like your example you will have a myriad of problems like those outlined by the other posters not the least of which will be performance and data integrity. Let me reiterate, that using an EAV as the core of your solution will fail when you get to reporting and analysis. However, as you have also stated, you might have hundreds of attributes that change regularly.
The solution, IMO, is a hybrid. For common attributes, use columns/standard schema. For additional, arbitrary attributes, use an EAV. However, the rule with the EAV data is that you can never, ever, under any circumstances, write a query that includes a sort or filter on an attribute. I.e., you can never write Where AttributeName = 'Foo'. The EAV portion of the schema represents a bag of data that is merely there for tracking purposes. In fact, I have seen many people implement this solution by using Xml for the EAV portion. The moment someone does want to search, filter, sort or place an EAV value in a specific spot on a report, that attribute must be elevated to a top level column in the products table.
The key to this hybrid approach is discipline. It will seem simple enough to add a filter, sort or put an attribute in a specific spot somewhere on a report especially when you get pressure from management. You must resist this temptation. Once you go down the dark path... If you do not think that you can maintain that level of discipline in your development team, then I would not use an EAV. As I've mentioned before, EAV's are like drugs: in small quantities and used under the right circumstances they can be beneficial. Too much will kill you.

Rather than have a name-value table, create the usual Product table structure containing all the common attributes, and add an XML column for the attributes that vary by product.
I have used this structure before and it worked quite well.
As #Dave Markle mentions, the name-value approach can lead to a world of pain.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas