Need help optimizing a multi level DAX summarize calculation - sql

I'm looking for advice on how to optimize a multi-level DAX summarize query. This one is very slow because, I think, it is running O(n^3) because of the nesting. Unfortunately, i need to have several levels because the hierarchy levels Order > Order Line > Order Detail need to be calculated differently.
Units need to sum up to the Detail level
That needs to be averaged up to the Line level
That needs to be summed up to the Order level
SUMX(
SUMMARIZE(
'FACT Opportunity'
,Opportunity[LineId]
,"Units"
,AVERAGEX(
SUMMARIZE(
'FACT Opportunity'
,Opportunity[DetailId]
,"SumDetail"
,SUM('FACT Opportunity'[Units])
)
,[SumDetail]
)
)
,[Units]
)
Any help or advice you could provide would be very much appreciated.

It's very hard to provide an optimisation advise without seeing the data and data model (it'd be great if they were included in the question).
The key issue here is that the presence of the duplicates makes fact "Units" non-additive, meaning that you can't simply roll it up the hierarchy. As a result, you are forced to do a very expensive triple-looping.
An obvious solution then is to make "Units" fully additive. You can compute de-duplicated (adjusted for duplicates) Units and store them in fact Opportunity, as a calculated column:
Adjusted Units =
DIVIDE (
'FACT Opportunity'[Units],
CALCULATE ( COUNT ( 'FACT Opportunity'[DetailId] ) )
)
Here, you divide Units by the number of unique DetailIDs (usually, it will be 1, but in case of duplicate DetailIDs it will be 2, etc).
This calculated column will increase your data loading time a bit, but save a lot of query time. To further optimize, consider pre-calculating it in a data warehouse.
The adjusted Units are fully additive, so you dax is now simple:
Total Units = SUM('FACT Opportunity'[Adjusted Units])
It should work correctly on any level of the Order > Line > Detail hierarchy (unless there are additional problems not described in the question), and it should be fast.

Related

Optimizing Summarize in DAX

I have this DAX formula that gives me a count of id that appear on the fact table in a month, averaged over the year. I can put this measure is a table ad it's unpacked by row with no issues (by adding variables from dimensions)
Measure:= AVERAGEX(
SUMMARIZE(
CALCULATETABLE(fact_table;FILTER('Time_Dimension';'Time_Dimension'[Last_month] <> "LAST"));
Time_Dimension[Month Name];
"Count";DISTINCTCOUNT(fact_table[ID])
);
[Count]
)
But it's terrible slow (I have 3 measures like this on a single table) and the fact table is big (like 300Million rows big)
I was reading that SUMMARIZE perform really bad with aggregations and It should be replaced with SUMMARIZECOLUMNS.
I wrote this formula
Measure_v2:= AVERAGEX(
SUMMARIZECOLUMNS(
Time_Dimension[Month Name];
FILTER(Time_Dimension;
Time_Dimension[Month Name]<>"LAST"
);
"Count";DISTINCTCOUNT(fact_table[ID])
)
[Count]
)
And it works when I visualize the measure as it is, but when I try to put it in a context (like the table above) it gives me the error "Can't use SUMMARIZECOLUMN and ADDMISSINGITEMS() in this context" How can I make a sustainable optimization from the original SUMMARIZE function?
Before optimizing SUMMARIZE, I would re-visit the overall approach. If your goal is to calculate average fact count per year-month, there is a simpler (and faster) way.
[ID Count]:=CALCULATE(COUNT('fact_table'[ID]),'Time_Dimension'[Last_month] <> "LAST")
[Average ID Count]:=AVERAGEX( VALUES('Time_Dimension'[Year_Month]), [ID Count`])
assuming that:
you have year-month attribute in your time dimension;
IDs in your fact table are unique (and therefore, simple count is
enough)
If this solution does not solve your problem, then please post your data model - it's hard to optimize without knowing the data structure.
On a side note, I would remove ID field from the fact table. It adds no value to the model, and consumes huge amounts of memory. Your objective can be achieved by simply counting rows:
[Fact Count]:=CALCULATE(COUNTROWS('fact_table'),'Time_Dimension'[Last_month] <> "LAST")

MDX Calculated measure based on IF and greater than

I am editing my question which I want exactly.
I have two columns Actual Units, Future Units from Fact A and Fact B respectively but at same granular level.I also have Demand Units from Fact B
My requirement is :
1. Projected Units = Coalesce(Actual Units,Future Units)
2. Stock Units = IF(Projected Units > Demand Units,Demand Units,Projected
Units)
3. Stock Rate = (Stock Units/Demand Units)
I cannot join the two facts in the data source view level and do the
calculation there because they are a very huge tables, so I think the
performance would be very slow. If you say that doing the calculations at
the data source view level level is the only way we have, please let me
know.
Did you get this?
When calculating the grand total MDX is summing up A, summing up B, and then comparing them.
If you want the calculation to occur at the row level (checking whether B>A) then edit the Data Source View and add a new calculated column to the table your measure group is based upon. The calculated column should be:
CASE WHEN B>A THEN A ELSE B END
Then create a Sum measure based upon that new column.
This approach will perform much better compared to any completely MDX approach to calculating this at a very detailed grain. If your fact tables had 500,000 rows or less and you had a degenerate Dimension which was the same grain as the grain you need to calculate at, we could possibly do it in MDX. But since you are concerned with SQL query performance I am assuming the tables are big. Just remember that SQL is done once at processing time. MDX is calculated in every query at query time. So do expensive things in SQL when you can.

"Running Product" aggregate/ windowed function in PostgreSql?

I am trying to normalize End-of-Day stock prices in PostgreSql.
Let's say I have a stock table defined as such:
create table eod (
date date not null,
stock_id int not null,
split decimal(16,8) not null,
close decimal(12,6) not null,
constraint pk_eod primary key (date, stock_id)
);
Data in this table might look like this:
"date","stock_id","eod_split","close"
"2014-06-13",14010920,"1.00000000","182.560000"
"2014-06-13",14010911,"1.00000000","91.280000"
"2014-06-13",14010923,"1.00000000","41.230000"
"2014-06-12",14010911,"1.00000000","92.290000"
"2014-06-12",14010920,"1.00000000","181.220000"
"2014-06-12",14010923,"1.00000000","40.580000"
"2014-06-11",14010920,"1.00000000","182.250000"
"2014-06-11",14010911,"1.00000000","93.860000"
"2014-06-11",14010923,"1.00000000","40.860000"
"2014-06-10",14010911,"1.00000000","94.250000"
"2014-06-10",14010923,"1.00000000","41.110000"
"2014-06-10",14010920,"1.00000000","184.290000"
"2014-06-09",14010920,"1.00000000","186.220000"
"2014-06-09",14010911,"7.00000000","93.700000"
"2014-06-09",14010923,"1.00000000","41.270000"
"2014-06-06",14010923,"1.00000000","41.480000"
"2014-06-06",14010911,"1.00000000","645.570000"
"2014-06-06",14010920,"1.00000000","186.370000"
"2014-06-05",14010920,"1.00000000","185.980000"
"2014-06-05",14010911,"1.00000000","647.350000"
"2014-06-05",14010923,"1.00000000","41.210000"
...
"2005-03-04",14010920,"1.00000000","92.370000"
"2005-03-04",14010911,"1.00000000","42.810000"
"2005-03-04",14010923,"1.00000000","25.170000"
"2005-03-03",14010923,"1.00000000","25.170000"
"2005-03-03",14010911,"1.00000000","41.790000"
"2005-03-03",14010920,"1.00000000","92.410000"
"2005-03-02",14010920,"1.00000000","92.920000"
"2005-03-02",14010923,"1.00000000","25.260000"
"2005-03-02",14010911,"1.00000000","44.121000"
"2005-03-01",14010920,"1.00000000","93.300000"
"2005-03-01",14010923,"1.00000000","25.280000"
"2005-03-01",14010911,"1.00000000","44.500000"
"2005-02-28",14010923,"1.00000000","25.160000"
"2005-02-28",14010911,"2.00000000","44.860000"
"2005-02-28",14010920,"1.00000000","92.580000"
"2005-02-25",14010923,"1.00000000","25.250000"
"2005-02-25",14010920,"1.00000000","92.800000"
"2005-02-25",14010911,"1.00000000","88.990000"
"2005-02-24",14010923,"1.00000000","25.370000"
"2005-02-24",14010920,"1.00000000","92.640000"
"2005-02-24",14010911,"1.00000000","88.930000"
"2005-02-23",14010923,"1.00000000","25.200000"
"2005-02-23",14010911,"1.00000000","88.230000"
"2005-02-23",14010920,"1.00000000","92.100000"
...
"2003-02-24",14010920,"1.00000000","78.560000"
"2003-02-24",14010911,"1.00000000","14.740000"
"2003-02-24",14010923,"1.00000000","24.070000"
"2003-02-21",14010920,"1.00000000","79.950000"
"2003-02-21",14010923,"1.00000000","24.630000"
"2003-02-21",14010911,"1.00000000","15.000000"
"2003-02-20",14010911,"1.00000000","14.770000"
"2003-02-20",14010920,"1.00000000","79.150000"
"2003-02-20",14010923,"1.00000000","24.140000"
"2003-02-19",14010920,"1.00000000","79.510000"
"2003-02-19",14010911,"1.00000000","14.850000"
"2003-02-19",14010923,"1.00000000","24.530000"
"2003-02-18",14010923,"2.00000000","24.960000"
"2003-02-18",14010911,"1.00000000","15.270000"
"2003-02-18",14010920,"1.00000000","79.330000"
"2003-02-14",14010911,"1.00000000","14.670000"
"2003-02-14",14010920,"1.00000000","77.450000"
"2003-02-14",14010923,"1.00000000","48.300000"
"2003-02-13",14010920,"1.00000000","75.860000"
"2003-02-13",14010911,"1.00000000","14.540000"
"2003-02-13",14010923,"1.00000000","46.990000"
Note the "split" column. When a split value other than 1 is recorded, it basically means that the stock shares split by that factor. IOW, when the split is 2.0, the number of the outstanding shares doubled, but the value of each individual share is halved from that point on. If the stock was worth $100 per share, it's now worth $50 per share.
If you graph this with raw numbers, this sort of thing is truly ugly. Sharp cliffs show up, when the overall value of the company did not significantly change... and when you have multiple splits, you end up with a graph that does not properly reflect the trending of the company, often by a large margin. In the above example, where there was a 2:1 split, your close prices for a stock would look something like 100, 100, 100, 50, 50, 50.
I want to use this table to create a "normalized" price, in a reasonably efficient manner (there's quite a few records to chunk through). Continuing the sample, this would show the stock prices at 50, 50, 50, 50, 50, 50. If there were multiple splits, the data should still be consistent and smooth, if we ignored actual market value changes.
My idea is, if I can create a CTE of a "running product" aggregate of the split value, going back in time, I can define date ranges per stock and what the modifier value to apply to the closing cost should be, then join that back to the eod table and select into a new table the adjusted close value for each stock.
...the problem is, I cannot wrap my head around how to do that in anything other than a whole bunch of temp tables and multi-step processes. I do not know of any built-in functionality to make this easier, either.
Can someone show me how I can generate the normalized data?
You don't need a CTE. You just need a cumulative product. Postgres doesn't have one built in. But, arithmetic to the rescue!
select eod.*,
exp(sum(ln(eod_split)) over (partition by stock_id order by date)) as cume_split,
(close *
exp(sum(ln(eod_split)) over (partition by stock_id order by date))
) as normalized_price
from eod;
Hilarious, looking for this solution, I find that an associate already asked about it. Here is the basic algebra behind this ingenious solution: https://blog.prepscholar.com/natural-log-rules

Create a Calculated Member to remove need to load pre calculated measure

I am trying to replicate the following sql statement into MDX so that I can create a calculated member in the cube using the base loaded members instead of having to calculate it outside the cube in the table and then loading it
SUM(CASE WHEN ((A.SALES_TYPE_CD = 1) AND (A.REG_SALES=0))
THEN A.WIN_SALES
ELSE 0
END) AS Z_SALES
I am currently loading SALES_TYPE_CD as a dimension and REG_SALES and WIN_SALES as measures.
I also have a few other dimensions in the cube but for simplicity, lets just say I have 2 other dimensions, LOCATION and ITEM
The dimension has LOCATION has 3 levels, "Region"->"District"->"Store", ordered from top to bottom level.
The dimension has ITEM has 3 levels, "CLASS"->"SUBCLASS"->"SKU", ordered from top to bottom level.
The dimension has SALES TYPE has 2 levels, "SALES_TYPE_GROUP"->"SALES_TYPE_CD", ordered from top to bottom level.
I know that I cannot create a simple calculated member in the cube which crossjoins the "SALES_TYPE" dimension with another dimension to get the answer I want.
I would think that it would be a more complicated MDX statement something like :
CREATE MEMBER CURRENTCUBE.[Measures].[Z_Sales]
AS 'sum(filter(crossjoin(leaves(), [Sales Type].[Sales Type].
[Sales_Type_CD].&[1]), [Measures].[REG_SALES]=0),[Measures].
[WIN_SALES])',
FORMAT_STRING = '#,#',
VISIBLE = 1 ;
But this does not seem to return the desired result.
What would be the proper MDX code to generate the desired result?
I did a bunch of testing with the data and I now know that there is no way I can get the right answer by using MDX alone in this scenario. Like "Greg" and "Tab" suggested, the only way would be to have reg sales as a dimension. Since this is a measure, that is out of the question because of the large number of possibilities for the value which has a data type of decimal (18,2)
Thanks for taking the time to answer the question.

MDX Query SUM PROD to do Weighted Average

I'm building a cube in MS BIDS. I need to create a calculated measure that returns the weighted-average of the rank value weighted by the number of searches. I want this value to be calculated at any level, no matter what dimensions have been applied to break-down the data.
I am trying to do something like the following:
I have one measure called [Rank Search Product] which I want to apply at the lowest level possible and then sum all values of it
IIf([Measures].[Searches] IS NOT NULL, [Measures].[Rank] * [Measures].[Searches], NULL)
And then my weighted average measure uses this:
IIf([Measures].[Rank Search Product] IS NOT NULL AND SUM([Measures].[Searches]) <> 0,
SUM([Measures].[Rank Search Product]) / SUM([Measures].[Searches]),
NULL)
I'm totally new to writing MDX queries and so this is all very confusing to me. The calculation should be
([Rank][0]*[Searches][0] + [Rank][1]*[Searches][1] + [Rank][2]*[Searches][2] ...)
/ SUM([searches])
I've also tried to follow what is explained in this link http://sqlblog.com/blogs/mosha/archive/2005/02/13/performance-of-aggregating-data-from-lower-levels-in-mdx.aspx
Currently loading my data into a pivot table in Excel is return #VALUE! for all calculations of my custom measures.
Please halp!
First of all, you would need an intermediate measure, lets say Rank times Searches, in the cube. The most efficient way to implement this would be to calculate it when processing the measure group. You would extend your fact table by a column e. g. in a view or add a named calculation in the data source view. The SQL expression for this column would be something like Searches * Rank. In the cube definition, you would set the aggregation function of this measure to Sum and make it invisible. Then just define your weighted average as
[Measures].[Rank times Searches] / [Measures].[Searches]
or, to avoid irritating results for zero/null values of searches:
IIf([Measures].[Searches] <> 0, [Measures].[Rank times Searches] / [Measures].[Searches], NULL)
Since Analysis Services 2012 SP1, you can abbreviate the latter to
Divide([Measures].[Rank times Searches], [Measures].[Searches], NULL)
Then the MDX engine will apply everything automatically across all dimensions for you.
In the second expression, the <> 0 test includes a <> null test, as in numerical contexts, NULL is evaluated as zero by MDX - in contrast to SQL.
Finally, as I interpret the link you have in your question, you could leave your measure Rank times Searches on SQL/Data Source View level to be anything, maybe just 0 or null, and would then add the following to your calculation script:
({[Measures].[Rank times Searches]}, Leaves()) = [Measures].[Rank] * [Measures].[Searches];
From my point of view, this solution is not as clear as to directly calculate the value as described above. I would also think it could be slower, at least if you use aggregations for some partitions in your cube.