Combining Relational and OLAP data in an MDX Query - sql

I have an SSAS 2008 cube that is being used to house end of day financial data from the stock market. The cube is only processed once a day after the market closes, so it never has any information about the current intraday trading data. I also have a relational database that houses the current intraday trading information for stocks. I am trying to find a way to combine those two data sources so that I can perform calculations such as a 30 day moving average for a stock that is based off of its current price, as well as the previous 29 days of historical data. I am using SSAS Standard edition, so I don't have access to features such as Proactive Caching or multiple partitions to help me process the current data in near real time.
Is there any way that can somehow dynamically include rows from my SQL database into my fact table, for the context of an individual query? Essentially just bring in a small subset of data into the cube temporarily in order to process a certain calculation?

no, you should create a measure group that maps to your OLTP table

You should be able to create a partition for the current days data and specify ROLAP as the storage mode.
To simplify maintenance, I would probably create a view for the fact table and, in the definition, use date functions in the where clause. Something like:
CREATE VIEW CurrentTrades AS
SELECT * FROM factTrades
WHERE TradingDate BETWEEN DATEADD(dd, 0, DATEDIFF(dd, 0, GETDATE())) AND DATEADD(dd, 1, DATEDIFF(dd, 0, GETDATE()))
You could then use that view as the data source for the ROLAP partition.

You can incrementally process data for the cube on specific time intervals during the day depending on how long does it take to process new data. ( Of course if delay are acceptable )

It is possible to write your own DLL and call it from within MDX. It's not terribly graceful but I've done it in the past.
Not a great idea for 1000s of rows of data, but if you need less than 100, your function call could pass a value from MDX to the DLL, which can call the SQL database to return the numbers. Then your results get displayed in the cellset alongside the numbers from OLAP.

Related

How to provide YTD, 12M, and Annualized measures in SQL Server data warehouse?

Project requires a usable data warehouse (DW) in SQL Server tables. They prefer no analysis services, with the SQL Server DW providing everything they need.
They are starting to use PowerBI, and have expressed the desire to provide all facts and measures in SQL Server tables, as opposed to a multi dimensional cube. Client has also used SSRS (to a large degree), and some Excel (by users).
At first they only required the Revenue FACT for Period x Product x Location. This uses a periodic snapshot type of fact, not a transactional grain fact.
So, then, to provide YTD measures for all periods, my first challenge was filling in the "empty" facts, for which there was no revenue, but there was revenue in prior (and/or subsequent) periods. The empty fact table includes a column for the YTD measure.
I solved that by creating empty facts -- no revenue -- for the no revenue periods, such as this:
Period 1, Loc1, Widget1, $1 revenue, $1 YTD
Period 2, Loc1, Widget1, $0 revenue, $1 YTD (this $0 "fact" created for YTD)
Period 3, Loc1, Widget1, $1 revenue, $2 YTD
I'm just using YTD as an example, but the requirements include measures for Last 12 Months, and Annualized, in addition to YTD.
I found that the best way to tally YTD was actually to create records to hold the measure (YTD) where there is no fact coming from the transactional data (meaning: no revenue for that combination of dimensions).
Now the requirements need the revenue fact by two more dimensions, Market Segment and Customer. This means I need to refactor my existing stored procedures to do the same process, but now for a more granular fact:
Period x Widget x Location x Market x Customer
This will result in creating many more records to hold the YTD (and other) measures. There will be far more records for these measures than are were real facts.
Here are what I believe are possible solutions:
Just do it in SQL DW table(s). This makes it easy to use wherever needed. (like it is now)
Do this in Power BI -- assume DAX expression in PBIX?
SSAS Tabular -- is tabular an appropriate place calculate the YTD, etc. measures, or should it be handled at the reporting layer?
For what it's worth, the client is reluctant to use SSAS Tabular because they want to keep the number of layers to a minimum.
Follow up questions:
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
Is this something that SSAS Tabular solves, inherently?
This is my experience:
I have always maintained that the DW should be where all of your data is. Then any client tool can use that DW and get the same answer.
I came across the same issue in a recent project: Generating "same day last year" type calcs (along with YTD, Fin YTD etc.). SQL Server seemed like the obvious place to put these 'sparse' facts but as I discovered (as you have) the sparsity just gets bigger and bigger and more complicated as the dimensions increase, and you end up blowing out the size and continually coming back and chasing down of those missing sparse facts, and worst of all having to come up with weird 'allocation' rules to push measures down to the require level of detail
IMHO DAX is the place to do this but there is a lot of pain in learning the language, especially of you come from a traditional relational background. But I really do think it's the best thing since SQL, if you can just get past the learning curve.
One of the most obvious advantages of using DAX, not the DW, is that DAX recognises what your current filters are in the client tool (in Power BI, excel, or whatever) at run time and can adjust it's calc automatically. Obviously you can't do that with figures in the DW. For example you can recognise the person or chart or row is filtered on a given date, so your current year/prior calcs automatically calc the correct YTD based on the date.
DAX has a number of 'calendar' type functions (called "time intelligence"), but they only work for a particular type of calendar and there are a lot of constraints, so usually you end up needing to create your own calendar table and build functions around that calendar table.
My advice is to start here: https://www.daxpatterns.com/ and try generating some YTD calcs in DAX
For what it's worth, the client is reluctant to use SSAS Tabular
because they want to keep the number of layers to a minimum.
Power BI already has a (required) modelling layer that effectively uses SSAS tabular internally, so you already have an additional logical layer. It's just in the same tool as the reporting layer. The difference is that doing modelling only in Power BI currently isn't an "Enterprise" approach. Features such as model version control, partitioned loads, advanced row level security aren't supported by Power BI (although who knows what next month will bring)
Layers are not bad things as long as you keep them under control. Otherwise we should just go back to monolithic cobol programs.
It's certainly possible to start doing your modelling purely in Power BI then at a later stage when you need the features, control and scalabiliyt, migrate to SSAS Tabular.
One thing to consider is that SSAS Tabular PaaS offering in Azure can get pretty pricey, but if you ever need partitioned loads (i.e. load just this weeks data into a very large cube with a lot of history), you'll need to use it.
Is there a SQL Server architecture to provide this sort of solution as I did it, maybe reducing the number of records necessary?
I guess that architecture would be defining the records in views. That has a lot of obvious downfalls. There is a 'sparse' designator but that just optimise storage for fields that have lots of NULLs, which may not even be the case.
If they use PowerBI for YTD, 12M, Annualized measures, what do I need to provide in the SQL DW, anything more than the facts?
You definitely need a comprehensive calendar table defining the fiscal year
Is this something that SSAS Tabular solves, inherently?
If you only want to report by calendar periods (1st Jan to 31st Dec) then built in time intelligence is "inherent" but if you want to report by fiscal periods, the time intelligence can't be used. Regardless you still need to define the DAX calcs. and they can get really big
First, SSAS Tabular and Power BI use the same engine. So they are equally applicable.
And the ability to define measures that can be calculated across any slice of the data, using any of a large number categorical attributes is one of the main reasons why you want something like SSAS Tabular or Power BI in front of your SQL Server. (The others are caching, simplified end-user reporting, the ability to mash-up data across sources, and custom security.)
Ideally, SQL Server should provide the Facts, along with single-column joins to any dimension tables, including a Date Dimension table. Power BI / SSAS Tabular would then layer on the DAX Measure definitions, the Filter Flow behavior and perhaps Row Level Security.

Creative use of date partitions

I have some data that I would like to partition by date, and also partition by an internally-defined client id.
Currently, we store this data uses the table-per-date model. It works well, but querying individual client ids is slow and expensive.
We have considered creating a table per client id, and using date partitioning within those tables. The only issue here is that would force us to incur thousands of load jobs per day, and also have the data partitioned by client id in advance.
Here is a potential solution I came up with:
-Stick with the table-per-date approach (eg log_20170110)
-Create a dummy date column which we use as the partition date, and set that date to -01-01 (eg for client id 1235, set _PARTITIONTIME to 1235-01-01)
This would allow us to load data per-day, as we do now, would give us partitioning by date, and would leverage the date partitioning functionality to partition by client id. Can you see anything wrong with this approach? Will BigQuery allow us to store data for the year 200, or the year 5000?
PS: We could also use a scheme that pushes the dates to post-zero-unixtime, eg add 2000 to the year, or push the last two digits to the month and day, eg 1235 => 2012-03-05.
Will BigQuery allow us to store data for the year 200, or the year 5000?
Yes, any date between 00001-01-01 and 9999-12-31
So formally speaking this is an option (and btw depends on how many clients you plan / already have)
See more about same idea at https://stackoverflow.com/a/41091896/5221944
Meantime, I would expect BigQuery to have soon ability to partition by arbitrary field. Maybe at NEXT 2017 - just guessing :o)
The suggested idea is likely to create some performance issues for queries (as the number of partitions increase). Generally speaking, Date partitioning works well with a few 1000 partitions.
client_ids are generally unrelated with each other and are ideal for hashing. While we work towards supporting richer partitioning flavors, one option is to hash your client_ids into N buckets (~100?), and have N partitioned tables. That way you can query across your N tables for a given date. Using, for example, 100 tables would drop the cost down to 1% of what it would be using 1 table with all the client_ids. It should also scan a small number of partitions, improving performance also accordingly. Unfortunately, this approach doesn't address the concern of putting the client ids in the right table (it has to be managed by you).

Delta refresh of SSAS Cube

I have a scenario where an SSAS cube's data needs to be refreshed. We want to avoid using a full refresh that takes an hour, and do a 'delta' refresh. The delta refresh should
1) Update fact records that have changed
2) Insert fact records that are new
3) Delete fact records that no longer exist
Consider a fact table with three dimensions: Company, Security, FiscalYear
and two measures: Qty, Amount
Scenario: In the fact table, a record with Company A, Security A, FiscalYear A has the measure Qty changed from 2 to 20. Previously the cube correctly showed the Qty to be 2. After the update,
If we do a Full refresh, it correctly shows 20. But in order to get this, we had to suffer a full hour of cube processing.
We tried adding a timestamp column to the fact table, split the cube into Current and Old partitions, full refreshed the Current Partition and Merged into Old partition as seems to be the popular suggestion. When we browse the cube, it shows 22, which is incorrect
We tried an Incremental refresh of the cube, same issue. It shows 22, also incorrect.
So what I am trying to ascertain here, is whether there is no way to process a cube so it only takes the changes (and by that I mean Updates, Inserts AND deletes, not just Inserts!) and applies them to the data inside an SSAS cube?
Any help would be greatly appreciated!
Thanks!
No, there is no way to do this. The only control you have over processing is the granularity of what you process. For instance, if you know that data over a certain age will never change, you can put data over that age in a partition, and not include it in your processing.

How do you design an aggregation framework for large datasets?

I have a large dataset of incoming messages and what I want to do is provide aggregated statistics for each message owner, such as rate of messages per day, week, last two weeks, and year. The aggregations can be simple, such as a word count, or more complex, such as keywords used...either way, I want to -- in an organized fashion -- precalculate and store these aggregations so that when I do something like:
Person.word_count_last_10_days
-- that this query isn't run on the entire message archive database, but pulls from a table of precalculated aggregations...something like:
SELECT SUM(value) from aggregations
WHERE
category = "word_count" AND
timeframe = "day" AND date > "2013-05-18" AND date < "2013-05-28" AND
person_id = 42
GROUP BY person_id
And aggregations of larger timeframes, such as "year", would simply count up all the days that make up that year.
The overall objective is to decouple the analytics browsing from the massive message archive. For the most part, there's no reason for the analytics system to actually query the message archive, if the aggregations table contains all the data needed.
This strikes me as a very common usecase...it doesn't matter whether it's done through Hadoop or through non-parallel processing...so I was wondering if there was already a framework/wrapper/design-methodology that provides some convention for this, rather writing one completely from scratch? I'm using Ruby but language doesn't matter, I'm just interested in whatever frameworks/wrappers (in any language) that have already been conceived.
I would look into OLAP/cubes for this kind of work.
Here is an open source OLAP server: http://mondrian.pentaho.com/
The idea is that with a cube you can set up pre-processed aggregations and then run them and afterwards the results can be queried quickly.
The MDX language is the equivalent of SQL for cubes - and it has a pretty steep learning curve - but some of the basic stuff should be easy to handle out of the box.
It takes a bit of reading to get up to speed on cubes in general. Check out: http://en.wikipedia.org/wiki/OLAP_cube.
It is well worth it for pre-processed aggregations.

How to reuse process result of fixed data

In a financial system, transactions of every year is stored in a separate table. So, there are Transactions2007, Transactions2008, ..., Transactions2012 tables in the system. They all have the same table design. The data in tables of previous years never change. But current years data is updated in a daily manner.
I want to build a cube on the union of tables of all years. The question is how to prevent SSAS from reprocessing previous years.
When processing the Cube, you can set the cube process option to Process Incremental and then in the Configuration Dialog, you can select a query to select data only from the recent tables. Here is a link for more info.
I handled it by partitioning the cube (by time dimension) and processing only the most recent partition.