Objective: I have a relational database(RDB) model. A few tables have an attribute timestamp.
I want to create a date dimension for my multidimensional model.
Viewing the Microsoft Tutorial Solution 3, I noticed that the time Date dimension table's attribute FullDateAlternateKey has the same format as the timestamp attribute in the RDB's tables.
Question: So, I was wondering if there is a way to automatically generate a Date dimension table schema (with the FullDateAlternateKey as primary key) and populate it with the data from the timestamps in the RDB's tables?
Then I could make the timestamp attribute from the RDB's tables a foreign key to the Time dimension table in my multidimensional model.
Don't.
First, decide the "grain" of your dimension. It sounds like you want a DATE dimension, so the grain will be a day.
Then, decide the columns you want in the dimension. Examples are week number, day number in week, day number in year, day name, month name, etc.
Next, build a spreadsheet that contains one row per date, for the range of dates you need, and calculates the columns you require.
Finally, load and process the dimension, from the spreadsheet, using your preferred ETL/ELT method.
The reason you don't build it from the incoming data values is that you may have gaps in the data. A date dimension should have ALL dates in your desired range (ie, 1900-01-01 to 2999-12-31) so that your BI tools can eventually use it for time series reporting. If you don't have ALL the dates, and try to show date on the x-axis of a graph, you will get misrepresentative visualisations.
Another reason for using a spreadsheet as your source is that the DATE dimension is one of the most volatile dimensions in your design. Your users will ask for new columns, and variations on columns (ie, "Can we have a column with the date like 4th. of August, 2017?") and a spreadsheet is a very fast way to manage the data, and rebuild the dimension when necessary.
Step 1:
Choose granularity and generate time dimension key based on that granularity. For example, choosing the granularity hour, one would need something like 2001020323 (yyyymmddhh).
There is no automatic way to do this using SSAS, so it's best to use a script to build the time dimension table in the underlying data source, then import it to DSV in SSAS and use it to build the time dimension. (like this one)
Step 2:
I have to match the time dimension's keys, so I need an ETL process/job/script taking my timestamp as input and returning a key for that timestamp that matches the keys in the time dimension table.
Related
I have two Measure Groups, one with time grain year/month(YearMonthDim), the other with time grain datetime(CalenderDim). Can I link the Month-grained fact to the CalenderDimension so I can make reports joined from both fact tables on the time dimension?
I just made a quick fix and added the YearMonthDim to the fact table with the datetime granularity. Is there another way to solve this?
The term I was looking for was Role Playing Dimension.
I connected the second measure group on Month granularity with the Month attribute in the already existing CalendarDimension whose granularity is on days.
It now works like a charm.
I am in the middle of design a data-ware house. There are multiple fact tables and its highly likely that hundreds of facts are inserted into each fact table. Even though its a bit early but still I was thinking of optimizations.
I have two tables for time
date (unique row
for each day)
time of the day (unique row
for each minute in a day)
in all my fact tables I do have the full date column.
What does your experience say, should I used select statements in code to query the dimension Ids from time dimension table or I allow the time dimension columns in fact tables to be null able and use triggers to fill in the values?
Date and time-of-day dimensions are the (very unusual) case in the data warehousing when a surrogate key with "magic" values is beneficial. You can make primary keys in the date dimension integers with values like 20110516 and in the time-of-day dimensions either 1 - 1440 or 1 - 2400.
I suggest calculating corresponding values in your fact records and adding fields for them, say, CALENDAR_ID, TIME_OF_DAY_ID. Depending on the size of your data, you are likely to benefit from indexing on CALENDAR_ID and, maybe, even partitioning on it. If you are sure in the quality of your data, you can skip foreign key constraints on these fields to gain some performance during loading.
No nulls allowed for FKs in fact tables.
Simply use your ETL to lookup keys from date and time dimension for each row of the fact table.
No triggers in DW, all loading and key-lookup is done through the ETL application.
CouchDB employs a cool pattern that can be used in a multitude of other scenarios. I'm talking about the persisted B-tree index of map/reduce results. The idea is to precalculate the aggregated data and store it at different levels of the B-tree index. The index can then be used to efficiently query the aggregate without having to reaggregate all the data all the time. Then, if any leaf-level value changes, only the ascending path through the tree has to get recalculated.
For example, if the data is price over time, the index could store the SUM and the COUNT of items at day, month, and year levels. Then, if anybody wants to query average price year-to-date all you had to do is sum up all the SUMs and COUNTs for all the full months since year start, plus all the days available for the last month, then divide total SUM by total COUNT. If a past price has to change, the change has to propagate through the index, but only corresponding day's and month's and year's values have to be updated, and even then the values for other days and other months within the year can be reused for the calculation.
What is generic name of this approach? Is anything similar exists in any of the popular RDBMSes? Any experience with using this in practice?
Materialized view
"A materialized view is a database object that contains the results of a query. They are local copies of data located remotely, or are used to create summary tables based on aggregations of a table's data. Materialized views, which store data based on remote tables, are also known as snapshots."
This is from a wikipedia article that mainly discusses storing of results in the context of a RDBMS.
Personally I prefer the term "indexed view". I actually found that wikipedia article by searching for "indexed view" on Google.
We're thinking about adding a weekly summary table to our little data warehouse. We have a classic time dimension down to the daily level (Year/Month/Day) with the appropriate Week/Quarter/etc. columns.
We'd like to have the time key in this new weekly summary table reference our time dimension. What's the best practice here—have the time key reference the id of the first day in the week it represents? Or the last day? Or something entirely different?
By convention, the fact tables with date period aggregations (week, month...) reference the DateKey of the last day of the period -- so, for this example you would reference the last day of the week.
Kind of logical too, the week must end in order to be aggregated.
It is important to clearly state (somewhere) that the grain of the fact table is one-week, so that report designers are aware of this.
Days are a good example of an entity best identified by natural keys — their representations in Gregorian calendar.
To identify a week or a month, it's best to use its first day. In Oracle, you can easily retrieve it by a call to TRUNC:
SELECT TRUNC(fact_date, 'month'), SUM(fact_value)
FROM fact
GROUP BY
TRUNC(fact_date, 'month')
In other systems it's a little bit more complex but quite easy too.
What about making a new dimension "Week"?
You can create a relation between time and week dimension, if you need.
Apropos an earlier answer I would actually expect to store data associated with an interim level of the time dimension hierarchy - when it relates to an atomic measurement for that interim time period - by attaching to the key associated with the first day of the period - this makes it much more straightforward when loading (esp with months - I guess weeks might always require some calculation) and also when reporting - nonetheless it is a convention and as long as you pick a common-sense option (and stick to it) you will be fine.
BTW do not create a week dimension - you should be using a rich time dimension with all the hierarchies available within it for year, quarter, month, week, day etc (bearing in mind there are often multiple, exclusive heirarchies) and in this instance only would also recommend a non-meaningless surrogate key in the form 20100920 - dates are immutable and in this format can easily be contained as int columns so there is little value in using a meaningless keys for dates (or in dim_time either) - if you have ever had to write queries to dereference data where meaningless SKs are used for the time dimension you know the (unnecessary) pain...
M
I would like to analyse data per hour using SSAS. The built in date dimension does not create any hour attributes.
Currently I am creating a new table with HourOfDay and HourOfDayName fields and will use this table to create a date dimension.
Could any one tell me if there is a common way of achieving time of day based analysis using SSAS05.
Thanks
Typically you create a separate day and time dimension. This is done to prevent the dimension from growing to be too large. You can add special descriptive attributes to the time dimension to designate time periods that are relevant to your business or type of analysis.(Different shifts in a factory for example). Then you would just use the time dimension to slice the data like any other dimension.
You can also build more interesting analysis paths by pivoting your design and use a time period i.e. duration as a measure. This often requires creating a new fact table or using Relational views.