I am building a poor man's data warehouse using a RDBMS. I have identified the key 'attributes' to be recorded as:
sex (true/false)
demographic classification (A, B, C etc)
place of birth
date of birth
weight (recorded daily): The fact that is being recorded
My requirements are to be able to run 'OLAP' queries that allow me to:
'slice and dice'
'drill up/down' the data and
generally, be able to view the data from different perspectives
After reading up on this topic area, the general consensus seems to be that this is best implemented using dimension tables rather than normalized tables.
Assuming that this assertion is true (i.e. the solution is best implemented using fact and dimension tables), I would like to seek some help in the design of these tables.
'Natural' (or obvious) dimensions are:
Date dimension
Geographical location
Which have hierarchical attributes. However, I am struggling with how to model the following fields:
sex (true/false)
demographic classification (A, B, C etc)
The reason I am struggling with these fields is that:
They have no obvious hierarchical attributes which will aid aggregation (AFAIA) - which suggest they should be in a fact table
They are mostly static or very rarely change - which suggests they should be in a dimension table.
Maybe the heuristic I am using above is too crude?
I will give some examples on the type of analysis I would like to carryout on the data warehouse - hopefully that will clarify things further.
I would like to aggregate and analyze the data by sex and demographic classification - e.g. answer questions like:
How does male and female weights compare across different demographic classifications?
Which demographic classification (male AND female), show the most increase in weight this quarter.
etc.
Can anyone clarify whether sex and demographic classification are part of the fact table, or whether they are (as I suspect) dimension tables.?
Also assuming they are dimension tables, could someone elaborate on the table structures (i.e. the fields)?
The 'obvious' schema:
CREATE TABLE sex_type (is_male int);
CREATE TABLE demographic_category (id int, name varchar(4));
may not be the correct one.
Not sure why you feel that using RDBMS is poor man's solution, but hope this may help.
Tables dimGeography and dimDemographic are so-called mini-dimensions; they allow for slicing based on demographic and geography without having to join dimUser, and also to capture user's current demographic and geography at the time of measurement.
And by the way, when in DW world, verbose -- Gender = 'female', AgeGroup = '30-35', EducationLevel = 'university', etc.
Star schema searches are the SQL equivalent of the intersection points of Venn Diagrams. As your sample queries clearly show, SEX_TYPE and DEMOGRAPHIC_CATEGORY are sets you want to search by and hence must be dimensions.
As for the table structures, I think your design for SEX_TYPE is misguided. For starters it is easier, more intuitive, to design queries on the basis of
where sex_type.name = 'FEMALE'
than
where sex_type.is_male = 1
Besides, in the real world sex is not a boolean. Most applications should gather UNKNOWN and TRANSGENDER as well, and that's certainly true for health/medical apps which is what you seem to be doing. Furthermore, it will avoid some unpleasant office arguments if you have any female co-workers.
Edit
"I am thinking of how to deal with
cases of new sex_types and demographic
categories not already in the
database"
There was a vogue for not having foreign keys in Data Warehouses. But they provide useful metadata which a query optimizer can use to derive the most efficient search path. This is particularly important when there is a lot of data and ad hoc queries to process. Dealing with new dimension values is always going to be hard, unless your source systems provide you with notification. This really depends on your set-up.
Generally, all numeric quantities and measures are columns in the fact table(s). Then everything else is a dimensional attribute. Which dimension they belong in is rather pragmatic and depends on the data.
In addition to the suggestions you have already received, I saw no mention of degenerate dimensions. In these cases, things like an invoice number or sequence number timestamp which is different for every fact needs to be stored in the fact, otherwise the dimension table will become 1-1 with the fact table.
A key design decision in your case is probably the analysis of data related to age if the study is ongoing. Because people's ages change with time, they will move to another age group at some point. Depending on whether the groups are fixed at the beginning of a study or not, this may determine how you want to aggregate. I'm not necessarily saying you should have a group dimension and get to age through that, but that you may need to determine the correct age/demographic dimension during the ETL. But this depends on the end use (or accommodate both with two dimension roles linked from the fact table - initial demographics, which never changes, and current demographics which will change with time).
A similar thing could apply with geography. Although you can obviously track a person's geography by analysing current geography changes over time, the point of a dimensional DW is to have all the relevant dimensions linked straight to the fact (things which you might normally derive in a normalized model through the network of an Entity-Relationship model - these get locked in at the time of ETL). This redundancy makes analysis quicker on the dimensional model in traditional RDBMSes.
Note that a lot of this does not apply in massively parallel DW like Teradata, which don't perform well with star schemas - they like all the data normalized and linked up to the same primary index because they the primary index to distribute the data over the processing units.
What OLAP / presentation tier tool are you intending to use? These often have features of their own to support building of cubes, hierarchies, aggregations, etc.
Normal Form is usually the most sound basis for a flexible and efficient Data Warehouse, although Marts are sometimes denormalized to support a specific set of reporting requirements. In the absence of any other information I suggest you aim to ensure your database is in at least Boyce-Codd / 5th Normal Form.
Related
So I'm trying to learn some basic database design principles and decided to download a copy of the sr27 database provided by the USDA. The database is storing nutritional information on food, and statistical information on how these nutritional values were derived.
When I first started this project, my thoughts were: well, I want to be able to search for food names, and I will probably want to do some basic statistical modeling on your most common nutritional values like calories, proteins, fats, etc. So, the thought was simple, just make 3 tables that look like this:
One table for food names
One table for common nutritional values (1-1 relationship with names)
One table for other nutritional values (1-1 relationship with names)
However, it's not clear that this is even necessary. Do you gain anything from partitioning the columns (or values) based on the idea: I like to do searches on names, so let's keep that as one table for less overhead, and I like to data calculations on common nutritional values so let's keep that as another table. (Question 1) Or does proper indexing make this moot?
My next question is then: Why in the world did the USDA decide to use 12 tables? Is this considered good database design practice, or would they have been better off merging a lot of these tables? (this excerpt is taken from the PDF provided in the USDA link above, pg 29)
Do you gain anything from partitioning the columns (or values) based
on the idea: I like to do searches on names, so let's keep that as one
table for less overhead, and I like to data calculations on common
nutritional values so let's keep that as another table. (Question 1)
Or does proper indexing make this moot?
if you just had a list of items, and you want to summarize on just some of them, then indexing is the way to address performance, not splitting some into another table arbitrarily.
Also, do read up on Normalization.
My next question is then: Why in the world did the USDA decide to use
12 tables? Is this considered good database design practice, or would
they have been better off merging a lot of these tables? (this excerpt
is taken from the PDF provided in the USDA link above, pg 29)
Probably because the types of questions they want to ask are not exactly the same ones you are trying to ask.
They clearly have more info about each food - like groups, nutrients, weights, and they are also apparently tracking where the source data is coming from...
There are important rules related to design relational databases - Normal forms - that reduces some artefacts and reduce IO operations. This design is usual for OLTP databases - and I have a possibility to see terrible slow databases because the developers has zero knowledge about it. Analytical databases OLAP are little bit different - there are wide tables used and some modern OLAP databases with column store support it.
PostgreSQL is classic row store database - so all in one table is not common and it is not good strategy. You can use a view to create some typical and often used views on data - so the complex schema can be invisible (transparent) for you.
Most articles about OLAP are written with a transactional system in mind (think customers/orders). And the few that aren't like that, specifically those that explain 'factless fact tables' are usually based around some kind of event, see http://www.1keydata.com/datawarehousing/factless-fact-table.html
What about databases that only house demographic data? For example, one table with a customer id, and several related tables each with one customer attributes such as citizenship, age, etc.
If I want to create a model where the user can slice and aggregate on these extended customer attributes, does it even make sense to build a cube, or is that the wrong tool for the job?
Would the citizenship table be a dimension and the customer table be a fact?
What numeric fact are you aggregating? The only thing you can aggregate here is the count. Generally without some kind of numeric fact to aggregate there really isn't much you can analyse in OLAP except for drilling up/down dimensions. In this case it's more of a data discovery function
I think I know what a domain table is (it basically contains all the possible values that some other column can contain), and I've looked up dimension table in Wikipedia. Unfortunately, I'm having a hard time understanding the description they have there, because they explain it in terms of another piece of jargon: "fact table", which is explained to "consists of the measurements, metrics or facts of a business process." To me, that's very tautological, which is not helpful. Can someone explain this in plain English?
Short version:
Domains represent data you've pulled out of your fact table to make the fact table smaller.
Dimensions represent axes that you've pre-aggregated along for faster querying.
Here is a long version in plain English:
You start with some set of facts. For instance every single sale your company has received, with date, product, price, geographical location, customer name - whatever your full combination of information might be - for each sale. You can put these facts into a huge table.
A large variety of queries that you want to run are in principle some fairly simple query on your fact table. However, your fact table is freaking huge. You need to make the queries faster.
(1) The first trick to making it faster is to move data out of it so it is smaller. So you can take every column that is "long text", put its possible values into a domain table, and replace the original column with an id into that table. This will make your fact table much smaller, and you can still get at your original data if you need it. This makes it much faster to query all rows since they take up less data.
That's fine if you have a small enough data set that querying the whole fact table is acceptably fast. But a lot of companies have too much data for this to be enough, so they have to be smarter.
(2) The second trick to making it faster is to pre-compute queries. Here is one way to do this. Identify a set of dimensions, and then pre-compute along dimensions and combinations of dimensions.
For instance customer name is one dimensions, some queries are per customer name, and others are across all customers. So you can add to your fact table pre-computed facts that have pre-aggregated data across all customers, and customer name has become a dimension.
Another good candidate for a dimension is geographical location. You can add summary records that aggregate, by county, by state, and across all locations. This summarizing is done after you've done the customer name aggregation, and so it will automatically have a record for total sales for all customers in a given zip code.
Repeat for any number of other dimensions.
Now when someone comes along with a query, odds are good that their query can be rewritten to take advantage of your pre-aggregated dimensions to only look at a few pre-aggregated facts, rather than all of the individual sales records. This will greatly speed up queries.
In practice, this winds up pre-aggregating more than you really need. So people building data warehouses do clever things which let them trade off effort spent pre-aggregating combinations that nobody may want versus run-time effort of having to compute a combination on the fly that could have been done in advance.
You can start with http://en.wikipedia.org/wiki/Star_schema if you
want to dig deeper on this topic.
Fact Tables and Dimension Tables, taken together, make up a Star Schema. A Star Schema is a representation, in SQL tables, of a Multidimensional data model. A multidimensional data model stores statistics, "facts", as values in a multidimensional space, where the "location" in each dimension establishes part of the context for the fact. The multidimensional data model was developed in the context of advancing the concept of data warehousing.
Dimension tables provide a key to each dimension, and attributes relevant to that dimension.
An MDDB can be stored in a data cube specially built for that purpose instead of using an SQL (relational) database. Cognos is one vendor that has its own data cube product out there. There are some advantages to using an SQL database and a star schema, over using a special purpose data cube product. There are other advantages to using a data cube product. Sometimes the advantages to the SQL plus Star schema approach outweigh the advantages of a data cube product.
Some of the advantages obtained by normalization can be obtained by designing a Snowflake Schema instead of a Star schema. However, neither star schema nor snowflake schema are going to be free from update anomalies. They are generally used in data warehousing or reporting databases, and copying data from an operational DB into one of these databases is something of a programming challenge. There are tools sold for this purpose.
A Fact table is a table which contains measurements or metrics or facts of business processes. Example:
"monthly sales number" in the Sales business process
"monthly profit amount" in the Profit business process
Most of them are additive (sales, profit), some are semi-additive (balance as of) and some are not additive (unit price).
The level of detail in Fact table is called the "grain" of the table i.e. the granularity can be fine or coarse. Fact table also contains Foreign Keys for the dimension tables.
Whereas Dimension Tables are those tables which contain attributes that helps in describing the facts of the fact table.
The following are types of Dimension Tables:
Slowly Changing Dimensions
Junk Dimensions
Confirmed Dimensions
Degenerated Dimensions
To know more you can go through the Data Warehousing Tutorials
I understand that cubes are optimized data structures for aggregating and "slicing" large amounts of data. I just don't know how they are implemented.
I can imagine a lot of this technology is proprietary, but are there any resources that I could use to start implementing my own cube technology?
Set theory and lots of math are probably involved (and welcome as suggestions!), but I'm primarily interested in implementations: the data structures and query algorithms.
Thanks!
There is a fantastic book that describes many internal details of SSAS implementation, including storage and query mechanism details:
http://www.amazon.com/Microsoft-Server-Analysis-Services-Unleashed/dp/0672330016
In a star-schema database, facts are usually acquired and stored at the finest grain.
So let's take the SalesFact example from Figure 10 in http://www.ciobriefings.com/Publications/WhitePapers/DesigningtheStarSchemaDatabase/tabid/101/Default.aspx
Right now, the grain is Product, Time (at a day granularity), Store.
Let's say you want that rolled up by month, pre-aggregated (this particular example is very unlikely to need pre-aggregation, but if the sales were detailed by customer, by minute, pre-aggregation might be necessary).
Then you would have a SalesFactMonthly (or add a grain discrimination to the existing fact table since the dimensions are the same - sometimes in aggregation, you may actually lose dimensions just like you can lose grain, for instance if you only wanted by store and not by product).
ProductID
TimeID (only linking to DayOfMonth = 1)
StoredID
SalesDollars
And you would get this by doing:
INSERT INTO SalesFactMonthly (ProductID, TimeID, StoreID, SalesDollars)
SELECT sf.ProductID
,(SELECT TimeID FROM TimeDimension WHERE Year = td.Year AND Month = td.Month AND DayOfMonth = 1) -- One way to find the single month dimension row
,sf.StoreID
,SUM(sf.SalesDollars)
FROM SalesFact AS sf
INNER JOIN TimeDimension AS td
ON td.TimeID = sf.TimeID
GROUP BY td.Year, td.Month
What happens in cubes is you basically have fine-grain stars and pre-aggregates together - but every implementation is proprietary - sometimes you might not even have the finest-grain data in the cube, so it can't be reported on. But every way you might want to slice the data needs to be stored at that grain, otherwise you can't produce analysis that way.
Generally, a data warehouse uses a relational database, but the tables aren't normalized like an operational relational database.
A data warehouse is subject oriented. Data warehouse subject tables usually have the following characteristics:
Many indexes.
No joins, except to look up tables.
Duplicated data, the subject table is
highly denormalized.
Contains derived and aggregated information.
The database tables in a data warehouse are arranged in a star schema. A star schema is basically one subject table with an array of look up tables. The keys of the look up tables are foreign keys in the subject table. If you draw an entity relationship diagram of the subject table, the look up tables would surround the subject table like star points.
As far as the queries, that depends on the subject tables and the number of rows. Generally, expect queries to take a long time (many minutes, sometimes hours).
Here's a general article to get you started: Developing a Data Warehouse Architecture
Here's a high level overview of the design of a star schema: Designing the Star Schema Database
I'm tasked with creating a datawarehouse for a client. The tables involved don't really follow the traditional examples out there (product/orders), so I need some help getting started. The client is essentially a processing center for cases (similar to a legal case). Each day, new cases are entered into the DB under the "cases" table. Each column contains some bit of info related to the case. As the case is being processed, additional one-to-many tables are populated with events related to the case. There are quite a few of these event tables, example tables might be: (case-open, case-dept1, case-dept2, case-dept3, etc.). Each of these tables has a caseid which maps back to the "cases" table. There are also a few lookup tables involved as well.
Currently, the reporting needs relate to exposing bottlenecks in the various stages and the granularity is at the hour level for certain areas of the process.
I may be asking too much here, but I'm looking for some direction as to how I should setup my Dim and Fact tables or any other suggestions you might have.
The fact table is the case event and it is 'factless' in that it has no numerical value. The dimensions would be time, event type, case and maybe some others depending on what other data is in the system.
You need to consolidate the event tables into a single fact table, labelled with an 'event type' dimension. The throughput/bottleneck reports are calculating differences between event times for specific combinations of event types on a given case.
The reports should calculate the event-event times and possibly bin them into a histogram. You could also label certain types of event combinations and apply the label to the events of interest. These events could then have the time recorded against them, which would allow slice-and-dice operations on the times with an OLAP tool.
If you want to benchmark certain stages in the life-cycle progression you would have a table that goes case type, event type1, event type 2, benchmark time.
With a bit of massaging, you might be able to use a data mining toolkit or even a simple regression analysis to spot correlations between case attributes and event-event times (YMMV).
I suggest you check out Kimball's books, particularly this one, which should have some examples to get you thinking about applications to your problem domain.
In any case, you need to decide if a dimensional model is even appropriate. It is quite possible to treat a 3NF database 'enterprise data warehouse' with different indexes or summaries, or whatever.
Without seeing your current schema, it's REALLY hard to say. Sounds like you will end up with several star models with some conformed dimensions tying them together. So you might have a case dimension as one of your conformed dimensions. The facts from each other table would be in fact tables which link both to the conformed dimension and any other dimensions appropriate to the facts, so for instance, if there is an employee id in case-open, that would link to an employee conformed dimension, from the case-open-fact table. This conformed dimension might be linked several times from several of your subsidiary fact tables.
Kimball's modeling method is fairly straightforward, and can be followed like a recipe. You need to start by identifying all your facts, grouping them into fact tables, identifying individual dimensions on each fact table and then grouping them as appropriate into dimension tables, and identifying the type of each dimension.
Like any other facet of development, you must approach the problem from the end requirements ("user stories" if you will) backwards. The most conservative approach for a warehouse is to simply represent a copy of the transaction database. From there, guided by the requirements, certain optimizations can be made to enhance the performance of certain data access patterns. I believe it is important, however, to see these as optimizations and not assume that a data warehouse automatically must be a complex explosion of every possible dimension over every fact. My experience is that for most purposes, a straight representation is adequate or even ideal for 90+% of analytical queries. For the remainder, first consider indexes, indexed views, additional statistics, or other optimizations that can be made without affecting the structures. Then if aggregation or other redundant structures are needed to improve performance, consider separating these into a "data mart" (at least conceptually) which provides a separation between primitive facts and redundancies thereof. Finally, if the requirements are too fluid and the aggregation demands to heavy to efficiently function this way, then you might consider wholesale explosions of data i.e. star schema. Again though, limit this to the smallest cross section of the data as possible.
Here's what I came up with essentially. Thx NXC
Fact Events
EventID
TimeKey
CaseID
Dim Events
EventID
EventDesc
Dim Time
TimeKey
Dim Regions
RegionID
RegionDesc
Cases
CaseID
RegionID
This may be a case of choosing a solution before you've considered the problem. Not all datawarehouses fit into the star schema model. I don't see that you are aggregating any data here. So far we have a factless fact table and at least one rapidly changing dimension (cases).
Looking at what I see so far I think the central entity in this database should be the case. Trying to stick the event at the middle doesn't seem right. Try looking at it a different way. Perhaps, case, events, and case events to start.