Setting up Dim and Fact tables for a Data Warehouse - sql

I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.
Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.
I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.

The fact table is the case event and it is 'factless' in that it has no numerical value. The dimensions would be time, event type, case and maybe some others depending on what other data is in the system.
You need to consolidate the event tables into a single fact table, labelled with an 'event type' dimension. The throughput/bottleneck reports are calculating differences between event times for specific combinations of event types on a given case.
The reports should calculate the event-event times and possibly bin them into a histogram. You could also label certain types of event combinations and apply the label to the events of interest. These events could then have the time recorded against them, which would allow slice-and-dice operations on the times with an OLAP tool.
If you want to benchmark certain stages in the life-cycle progression you would have a table that goes case type, event type1, event type 2, benchmark time.
With a bit of massaging, you might be able to use a data mining toolkit or even a simple regression analysis to spot correlations between case attributes and event-event times (YMMV).

I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.
In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.
Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.
Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.

Like any other facet of development, you must approach the problem from the end requirements ("user stories" if you will) backwards. The most conservative approach for a warehouse is to simply represent a copy of the transaction database. From there, guided by the requirements, certain optimizations can be made to enhance the performance of certain data access patterns. I believe it is important, however, to see these as optimizations and not assume that a data warehouse automatically must be a complex explosion of every possible dimension over every fact. My experience is that for most purposes, a straight representation is adequate or even ideal for 90+% of analytical queries. For the remainder, first consider indexes, indexed views, additional statistics, or other optimizations that can be made without affecting the structures. Then if aggregation or other redundant structures are needed to improve performance, consider separating these into a "data mart" (at least conceptually) which provides a separation between primitive facts and redundancies thereof. Finally, if the requirements are too fluid and the aggregation demands to heavy to efficiently function this way, then you might consider wholesale explosions of data i.e. star schema. Again though, limit this to the smallest cross section of the data as possible.

Here's what I came up with essentially. Thx NXC
Fact Events
EventID
TimeKey
CaseID
Dim Events
EventID
EventDesc
Dim Time
TimeKey
Dim Regions
RegionID
RegionDesc
Cases
CaseID
RegionID

This may be a case of choosing a solution before you've considered the problem. Not all datawarehouses fit into the star schema model. I don't see that you are aggregating any data here. So far we have a factless fact table and at least one rapidly changing dimension (cases).
Looking at what I see so far I think the central entity in this database should be the case. Trying to stick the event at the middle doesn't seem right. Try looking at it a different way. Perhaps, case, events, and case events to start.

Related

Are dimension tables and domain tables the same thing?

I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?
Short version:
Domains represent data you've pulled out of your fact table to make the fact table smaller.
Dimensions represent axes that you've pre-aggregated along for faster querying.
Here is a long version in plain English:
You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.
A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.
(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.
That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.
(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.
For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.
Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.
Repeat for any number of other dimensions.
Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.
In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.
You can start with http://en.wikipedia.org/wiki/Star_schema if you
want to dig deeper on this topic.
Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.
Dimension tables provide a key to each dimension, and attributes relevant to that dimension.
An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.
Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.
A Fact table is a table which contains measurements or metrics or facts of business processes. Example:
"monthly sales number" in the Sales business process
"monthly profit amount" in the Profit business process
Most of them are additive (sales, profit), some are semi-additive (balance as of) and some are not additive (unit price).
The level of detail in Fact table is called the "grain" of the table i.e. the granularity can be fine or coarse. Fact table also contains Foreign Keys for the dimension tables.
Whereas Dimension Tables are those tables which contain attributes that helps in describing the facts of the fact table.
The following are types of Dimension Tables:
Slowly Changing Dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions
To know more you can go through the Data Warehousing Tutorials

Database warehouse design: fact tables and dimension tables

I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:
sex (true/false)
demographic classification (A, B, C etc)
place of birth
date of birth
weight (recorded daily): The fact that is being recorded
My requirements are to be able to run 'OLAP' queries that allow me to:
'slice and dice'
'drill up/down' the data and
generally, be able to view the data from different perspectives
After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.
Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.
'Natural' (or obvious) dimensions are:
Date dimension
Geographical location
Which have hierarchical attributes. However, I am struggling with how to model the following fields:
sex (true/false)
demographic classification (A, B, C etc)
The reason I am struggling with these fields is that:
They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
They are mostly static or very rarely change - which suggests they should be in a dimension table.
Maybe the heuristic I am using above is too crude?
I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.
I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:
How does male and female weights compare across different demographic classifications?
Which demographic classification (male AND female), show the most increase in weight this quarter.
etc.
Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?
Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?
The 'obvious' schema:
CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));
may not be the correct one.
Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.
Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.
And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.
Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.
As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of
where sex_type.name = 'FEMALE'
than
where sex_type.is_male = 1
Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.
Edit
"I am thinking of how to deal with
cases of new sex_types and demographic
categories not already in the
database"
There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.
Generally, all numeric quantities and measures are columns in the fact table(s). Then everything else is a dimensional attribute. Which dimension they belong in is rather pragmatic and depends on the data.
In addition to the suggestions you have already received, I saw no mention of degenerate dimensions. In these cases, things like an invoice number or sequence number timestamp which is different for every fact needs to be stored in the fact, otherwise the dimension table will become 1-1 with the fact table.
A key design decision in your case is probably the analysis of data related to age if the study is ongoing. Because people's ages change with time, they will move to another age group at some point. Depending on whether the groups are fixed at the beginning of a study or not, this may determine how you want to aggregate. I'm not necessarily saying you should have a group dimension and get to age through that, but that you may need to determine the correct age/demographic dimension during the ETL. But this depends on the end use (or accommodate both with two dimension roles linked from the fact table - initial demographics, which never changes, and current demographics which will change with time).
A similar thing could apply with geography. Although you can obviously track a person's geography by analysing current geography changes over time, the point of a dimensional DW is to have all the relevant dimensions linked straight to the fact (things which you might normally derive in a normalized model through the network of an Entity-Relationship model - these get locked in at the time of ETL). This redundancy makes analysis quicker on the dimensional model in traditional RDBMSes.
Note that a lot of this does not apply in massively parallel DW like Teradata, which don't perform well with star schemas - they like all the data normalized and linked up to the same primary index because they the primary index to distribute the data over the processing units.
What OLAP / presentation tier tool are you intending to use? These often have features of their own to support building of cubes, hierarchies, aggregations, etc.
Normal Form is usually the most sound basis for a flexible and efficient Data Warehouse, although Marts are sometimes denormalized to support a specific set of reporting requirements. In the absence of any other information I suggest you aim to ensure your database is in at least Boyce-Codd / 5th Normal Form.

Best pattern for storing (product) attributes in SQL Server

We are starting a new project where we need to store product and many product attributes in a database. The technology stack is MS SQL 2008 and Entity Framework 4.0 / LINQ for data access.
The products (and Products Table) are pretty straightforward (a SKU, manufacturer, price, etc..). However there are also many attributes to store with each product (think industrial widgets). These may range from color to certification(s) to pipe size. Every product may have different attributes, and some may have multiples of the same attribute (Ex: Certifications).
The current proposal is that we will basically have a name/value pair table with a FK back to the product ID in each row.
An example of the attributes Table may look like this:
ProdID AttributeName AttributeValue
123 Color Blue
123 FittingSize 1.25
123 Certification AS1111
123 Certification EE2212
123 Certification FM.3
456 Pipe 11
678 Color Red
999 Certification AE1111
...
Note: Attribute name would likely come from a lookup table or enum.
So the main question here is: Is this the best pattern for doing something like this? How will the performance be? Queries will be based on a JOIN of the product and attributes table, and generally need many WHEREs to filter on specific attributes - the most common search will be to find a product based on a set of known/desired attributes.
If anyone has any suggestions or a better pattern for this type of data, please let me know.
Thanks!
-Ed
You are about to re-invent the dreaded EAV model, Entity-Attribute-Value. This is notorious for having problems in real-life, for various reasons, many covered by Dave's answer.
Luckly the SQL Customer Advisory Team (SQLCAT) has a whitepaper on the topic,
Best Practices for Semantic Data Modeling for Performance and Scalability. I highly recommend this paper. Unfortunately, it does not offer a panacea, a cookie cutter solution, since the problem has no solution. Instead, you'll learn how to find the balance between a fixed queryable schema and a flexible EAV structure, a balance that works for your specific case:
Semantic data models can be very
complex and until semantic databases
are commonly available, the challenge
remains to find the optimal balance
between the pure object model and the
pure relational model for each
application. The key to success is to
understand the issues, make the
necessary mitigations for those
issues, and then test, test, and test.
Scalability testing is a critical
success factor if you are going to
find that optimal design.
This is going to be problematic for a couple of reasons:
Your entity queries will be much harder to write. Transforming the results of those queries into something resembling a ViewModel when it comes time for presentation is going to be painful because it will involve a pivot for each product.
Understanding what your datatypes will be is going to be tough when it comes time to read certain types of data. Are you planning on storing this as strings? For example, DateTimes hold more data than the default .ToString() implementation writes to the string. You're also going to have issues if you try to store floating-point values.
Your objects' data integrity is at risk. There will be a temptation to put properties which should be just attributes of your main product tables in this "bucket o' data". Maybe the design will be semi-sane to begin with, but I guarantee you that after a certain amount of time, folks will start to just throw properties in the bag. It'll then be very tough to keep your objects' integrity with such a loosely defined structure.
Your indexes will most likely be suboptimal. Again think of a property which should be on your product table. Instead of being able to index on just one column, you will now be forced to make a potentially very large composite index on your "type" table.
Since you're apparently planning to throw out proper datatypes and use strings, the performance of range queries for numeric data will likely be poor.
Your table will get big, slowing backups and queries. Instead of an integer being 4 bytes, you're going to have to store far more for an integer of any size.
Better to normalize the table in a more "traditional" way using "IS-A" relationships. For example, you might have Pipes, which are a type of Product, but have a couple more attributes. You might have Stoves, which are a type of product, but have a couple more attributes still.
If you really have a generic database and all sorts of other properties which aren't going to be subject to data integrity rules, you very well may want to consider storing data in an XML column. It's hard to tell you what the correct design choice is unless I know a lot more about your business.
IMO this is a design antipattern. The siren song of this idea has lured many a developer onto the rocks of of an unmaintainable application.
I know it is an old one - however there might be other readers...
I have seen the balance EAV to attribute modeled approach. Well - it is still EAV. "EAV's are like drugs" is pretty much true. So what about thinking it through once more - and let's be aggressive really:
I still liked the supertype apporach, where a lot of tables use the same primary key from a key generator. Let's reuse this one. So what about creating a new table for each set of attributes - all having the primary from the same key generator? Eg. you would have a table with the fields "color,pipe", another table "fittingsize,pipe", and so on. The requirement "volatility of attributes" screams for a carefully(automatically) maintained data dictionary anyway.
This approach is fully normalized and can be fully automated. You can support checks if specific attribute sets materialized already as table by hashing attribute name clusters, eg. crc32(lower('color~fittingsize~pipe')) where the atribute names need to be sorted alphabetically. Of course this requires to have the hash in the data dictionary. Based on the data dictionary each object can be searched (using 'UNION'), especially if the data dictionary itself is a table. Having the data dictionary as table also allows you to use its primary (surrogate) key as basis for unique tablenames, to end up with tables like 'attributes1','attributes2',... Most databases nowadays support some billion tables - so we are sort of save on that end as well. You could even have a product catalouge with very common attributes, that reference the extended attribute tables.
An open issue are 1:n data sets. I am afraid you need to sort them out in separate tables. However this very much depends on your data presentation and querying strategy. Should they always be presented as comma seperated string attached to the product or do you want to eg. be able to query for all products of a certain Certification?
Before you flame this approach please consider this: It is meant for use cases where you have a very high volatility of attributes - in quantity and quality - only. Also it was preset, that you cannot know most of the attributes at the point in time when the solution is created. So do not discuss this in a context where you can model your attributes upfront which would enable you to balance trade offs much better.
In short, you cannot go all one route. If you use an EAV like your example you will have a myriad of problems like those outlined by the other posters not the least of which will be performance and data integrity. Let me reiterate, that using an EAV as the core of your solution will fail when you get to reporting and analysis. However, as you have also stated, you might have hundreds of attributes that change regularly.
The solution, IMO, is a hybrid. For common attributes, use columns/standard schema. For additional, arbitrary attributes, use an EAV. However, the rule with the EAV data is that you can never, ever, under any circumstances, write a query that includes a sort or filter on an attribute. I.e., you can never write Where AttributeName = 'Foo'. The EAV portion of the schema represents a bag of data that is merely there for tracking purposes. In fact, I have seen many people implement this solution by using Xml for the EAV portion. The moment someone does want to search, filter, sort or place an EAV value in a specific spot on a report, that attribute must be elevated to a top level column in the products table.
The key to this hybrid approach is discipline. It will seem simple enough to add a filter, sort or put an attribute in a specific spot somewhere on a report especially when you get pressure from management. You must resist this temptation. Once you go down the dark path... If you do not think that you can maintain that level of discipline in your development team, then I would not use an EAV. As I've mentioned before, EAV's are like drugs: in small quantities and used under the right circumstances they can be beneficial. Too much will kill you.
Rather than have a name-value table, create the usual Product table structure containing all the common attributes, and add an XML column for the attributes that vary by product.
I have used this structure before and it worked quite well.
As #Dave Markle mentions, the name-value approach can lead to a world of pain.

To aggregate or not to aggregate, that is the database schema design question

If you're doing min/max/avg queries, do you prefer to use aggregation tables or simply query across a range of rows in the raw table?
This is obviously a very open-ended question and there's no one right answer, so I'm just looking for people's general suggestions. Assume that the raw data table consists of a timestamp, a numeric foreign key (say a user id), and a decimal value (say a purchase amount). Furthermore, assume that there are millions of rows in the table.
I have done both and am torn. On one hand aggregation tables have given me significantly faster queries but at the cost of a proliferation of additional tables. Displaying the current values for an aggregated range either requires dropping entirely back to the raw data table or combining more fine grained aggregations. I have found that keeping track in the application code of which aggregation table to query when is more work that you'd think and that schema changes will be required, as the original aggregation ranges will invariably not be enough ("But I wanted to see our sales over the last 3 pay periods!").
On the other hand, querying from the raw data can be punishingly slow but lets me be very flexible about the data ranges. When the range bounds change, I simply change a query rather than having to rebuild aggregation tables. Likewise the application code requires fewer updates. I suspect that if I was smarter about my indexing (i.e. always having good covering indexes), I would be able to reduce the penalty of selecting from the raw data but that's by no means a panacea.
Is there anyway I can have the best of both worlds?
We had that same problem and ran into the same issues you ran into. We ended up switching our reporting to Analysis Services. There is a learning curve with MDX and Analysis services itself, but it's been great. Some of the benefits we have found are:
You have a lot of flexibility for
querying any way you want. Before we
had to build specific aggregates,
but now one cube answers all our
questions.
Storage in a cube is far smaller
than the detailed data.
Building and processing the cubes
takes less time and produces less
load on the database servers than
the aggregates did.
Some CONS:
There is a learning curve around
building cubes and learning MDX.
We had to create some tools to
automate working with the cubes.
UPDATE:
Since you're using MySql, you could take a look at Pentaho Mondrian, which is an open source OLAP solution that supports MySql. I've never used it though, so I don't know if it will work for you or not. Would be interested in knowing if it works for you though.
It helps to pick a good primary key (ie [user_id, used_date, used_time]). For a constant user_id it's then very fast to do a range-condition on used_date.
But as the table grows, you can reduce your table-size by aggregating to a table like [user_id, used_date]. For every range where the time-of-day doesn't matter you can then use that table. An other way to reduce the table-size is archiving old data that you don't (allow) querying anymore.
I always lean towards raw data. Once aggregated, you can't go back.
Nothing to do with deletion - unless there's the simplest of aggregated data sets, you can't accurately revert/transpose the data back to raw.
Ideally, I'd use a materialized view (assuming that the data can fit within the constraints) because it is effectively a table. But MySQL doesn't support them, so the next consideration would be a view with the computed columns, or a trigger to update an actual table.
Long history question, for currently, I found this useful, answered by microstrategy engineer
BTW, another already have solutions like (cube.dev/dremio) you don't have to do by yourself.

Dealing with "hypernormalized" data

My employer, a small office supply company, is switching suppliers and I am looking through their electronic content to come up with a robust database schema; our previous schema was pretty much just thrown together without any thought at all, and it's pretty much led to an unbearable data model with corrupt, inconsistent information.
The new supplier's data is much better than the old one's, but their data is what I would call hypernormalized. For example, their product category structure has 5 levels: Master Department, Department, Class, Subclass, Product Block. In addition the product block content has the long description, search terms and image names for products (the idea is that a product block contains a product and all variations - e.g. a particular pen might come in black, blue or red ink; all of these items are essentially the same thing, so they apply to a single product block). In the data I've been given, this is expressed as the products table (I say "table" but it's a flat file with the data) having a reference to the product block's unique ID.
I am trying to come up with a robust schema to accommodate the data I'm provided with, since I'll need to load it relatively soon, and the data they've given me doesn't seem to match the type of data they provide for demonstration on their sample website (http://www.iteminfo.com). In any event, I'm not looking to reuse their presentation structure so it's a moot point, but I was browsing the site to get some ideas of how to structure things.
What I'm unsure of is whether or not I should keep the data in this format, or for example consolidate Master/Department/Class/Subclass into a single "Categories" table, using a self-referencing relationship, and link that to a product block (product block should be kept separate as it's not a "category" as such, but a group of related products for a given category). Currently, the product blocks table references the subclass table, so this would change to "category_id" if I consolidate them together.
I am probably going to be creating an e-commerce storefront making use of this data with Ruby on Rails (or that's my plan, at any rate) so I'm trying to avoid getting snagged later on or having a bloated application - maybe I'm giving it too much thought but I'd rather be safe than sorry; our previous data was a real mess and cost the company tens of thousands of dollars in lost sales due to inconsistent and inaccurate data. Also I am going to break from the Rails conventions a little by making sure that my database is robust and enforces constraints (I plan on doing it at the application level, too), so that's something I need to consider as well.
How would you tackle a situation like this? Keep in mind that I have the data to be loaded already in flat files that mimic a table structure (I have documentation saying which columns are which and what references are set up); I'm trying to decide if I should keep them as normalized as they currently are, or if I should look to consolidate; I need to be aware of how each method will affect the way I program the site using Rails since if I do consolidate, there will be essentially 4 "levels" of categories in a single table, but that definitely seems more manageable than separate tables for each level, since apart from Subclass (which directly links to product blocks) they don't do anything except show the next level of category under them. I'm always a loss for the "best" way to handle data like this - I know of the saying "Normalize until it hurts, then denormalize until it works" but I've never really had to implement it until now.
I would prefer the "hypernormalized" approach over a denormal data model. The self referencing table you mentioned might reduce the number of tables down and simplify life in some ways, but in general this type of relationship can be tricky to deal with. Hierarchical queries become a pain, as does mapping an object model to this (if you decide to go that route).
A couple of extra joins is not going to hurt and will keep the application more maintainable. Unless performance degrades due to the excessive number of joins, I would opt to leave things like they are. As an added bonus if any of these levels of tables needed additional functionality added, you will not run into issues because you merged them all into the self referencing table.
I totally disagree with the criticisms about self-referencing table structures for parent-child hierarchies. The linked list structure makes UI and business layer programming easier and more maintainable in most cases, since linked lists and trees are the natural way to represent this data in languages that the UI and business layers would typically be implemented in.
The criticism about the difficulty of maintaining data integrity constraints on these structures is perfectly valid, though the simple solution is to use a closure table that hosts the harder check constraints. The closure table is easily maintained with triggers.
The tradeoff is a little extra complexity in the DB (closure table and triggers) for a lot less complexity in UI and business layer code.
If I understand correctly, you want to take their separate tables and turn them into a hierarchy that's kept in a single table with a self-referencing FK.
This is generally a more flexible approach (for example, if you want to add a fifth level), BUT SQL and relational data models don't tend to work well with linked lists like this, even with new syntax like MS SQL Servers CTEs. Admittedly, CTEs make it much better though.
It can be difficult and costly to enforce things, like that a product must always be on the fourth level of the hierarchy, etc.
If you do decide to do it this way, then definitely check out Joe Celko's SQL for Smarties, which I believe has a section or two on modeling and working with hierarchies in SQL or better yet get his book that is devoted to the subject (Joe Celko's Trees and Hierarchies in SQL for Smarties).
Normalization implies data integrity, that is: each normal form reduces the number of situations where you data is inconsistent.
As a rule, denormalization has a goal of faster querying, but leads to increased space, increased DML time, and, last but not least, increased efforts to make data consistent.
One usually writes code faster (writes faster, not the code faster) and the code is less prone to errors if the data is normalized.
Self referencing tables almost always turn out to be much worse to query and perform worse than normalized tables. Don't do it. It may look to you to be more elegant, but it is not and is a very poor database design technique. Personally the structure you described sounds just fine to me not hypernormalized. A properly normalized database (with foreign key constraints as well as default values, triggers (if needed for complex rules) and data validation constraints) is also far likelier to have consistent and accurate data. I agree about having the database enforce the rules, likely this is part of why the last application had bad data because the rules were not enforced in the proper place and people were able to easily get around them. Not that the application shouldn't check as well (no point even sending an invalid date for instance for the datbase to fail on insert). Since youa redesigning, I would put more time and effort into designing the necessary constraints and choosing the correct data types (do not store dates as string data for instance), than in trying to make the perfectly ordinary normalized structure look more elegant.
I would bring it in as close to their model as possible (and if at all possible, I would get files which match their schema - not a flattened version). If you bring the data directly into your model, what happens if data they send starts to break assumptions in the transformation to your internal application's model?
Better to bring their data in, run sanity checks and check that assumptions are not violated. Then if you do have an application-specific model, transform it into that for optimal use by your application.
Don't denormalize. Trying to acheive a good schema design by denormalizing is like trying to get to San Francisco by driving away from New York. It doesn't tell you which way to go.
In your situation, you want to figure out what a normalized schema would like. You can base that largely on the source schema, but you need to learn what the functional dependencies (FD) in the data are. Neither the source schema nor the flattened files are guaranteed to reveal all the FDs to you.
Once you know what a normalized schema would look like, you now need to figure out how to design a schema that meets your needs. It that schema is somewhat less than fully normalized, so be it. But be prepared for difficulties in programming the transformation between the data in the flattened files and the data in your desgined schema.
You said that previous schemas at your company cost millions due to inconsistency and inaccuracy. The more normalized your schema is, the more protected you are from internal inconsistency. This leaves you free to be more vigilant about inaccuracy. Consistent data that's consistently wrong can be as misleading as inconsistent data.
is your storefront (or whatever it is you're building, not quite clear on that) always going to be using data from this supplier? might you ever change suppliers or add additional different suppliers?
if so, design a general schema that meets your needs, and map the vendor data to it. Personally I'd rather suffer the (incredibly minor) 'pain' of a self-referencing Category (hierarchical) table than maintain four (apparently semi-useless) levels of Category variants and then next year find out they've added a 5th, or introduced a product line with only three...
For me, the real question is: what fits the model better?
It's like comparing a Tuple and a List.
Tuples are a fixed size and are heterogeneous -- they are "hypernormalized".
Lists are an arbitrarty size and are homogeneous.
I use a Tuple when I need a Tuple and a List when I need a list; they fundamentally server different purposes.
In this case, since the product structure is already well defined (and I assume not likely to change) then I would stick with the "Tuple approach". The real power/use of a List (or recursive table pattern) is when you need it to expand to an arbitrary depth, such as for a BOM or a genealogy tree.
I use both approaches in some of my database depending upon the need. However, there is also the "hidden cost" of a recursive pattern which is that not all ORMs (not sure about AR) support it well. Many modern DBs have support for "join-throughs" (Oracle), hierarchy IDs (SQL Server) or other recursive patterns. Another approach is to use a set-based hierarchy (which generally relies on triggers/maintenance). In any case, if the ORM used does not support recursive queries well, then there may be the extra "cost" of using the to the DB features directly -- either in terms of manual query/view generation or management such as triggers. If you don't use a funky ORM, or simply use a logic separator such as iBatis, then this issue may not even apply.
As far as performance, on new Oracle or SQL Server (and likely others) RDBMS, it ought to be very comparable so that would be the least of my worries: but check out the solutions available for your RDBMS and portability concerns.
Everybody who recommends you not to have a hierarchy introduced in the database, considering just the option of having a self-referenced table. This is not the only way to model the hierarchy in the database.
You may use a different approach, that provides you with easier and faster querying without using recursive queries.
Let's say you have a big set of nodes (categories) in your hierarchy:
Set1 = (Node1 Node2 Node3...)
Any node in this set can also be another set by itself, that contains other nodes or nested sets:
Node1=(Node2 Node3=(Node4 Node5=(Node6) Node7))
Now, how we can model that? Let's have each node to have two attributes, that set the boundaries of the nodes it contains:
Node = { Id: int, Min: int, Max: int }
To model our hierarchy, we just assign those min/max values accordingly:
Node1 = { Id = 1, Min = 1, Max = 10 }
Node2 = { Id = 2, Min = 2, Max = 2 }
Node3 = { Id = 3, Min = 3, Max = 9 }
Node4 = { Id = 4, Min = 4, Max = 4 }
Node5 = { Id = 5, Min = 5, Max = 7 }
Node6 = { Id = 6, Min = 6, Max = 6 }
Node7 = { Id = 7, Min = 8, Max = 8 }
Now, to query all nodes under the Set/Node5:
select n.* from Nodes as n, Nodes as s
where s.Id = 5 and s.Min < n.Min and n.Max < s.Max
The only resource-consuming operation would be if you want to insert a new node, or move some node within the hierarchy, as many records will be affected, but this is fine, as the hierarchy itself does not change very often.