I have the following data:
DATE COUNTRY ITEM Value
2005-01-01 UK op_rate 30%
2005-01-01 UK proc 1000
2005-01-01 UK export 750
2005-01-01 ITA op_rate 45%
2005-01-01 ITA proc 500
2005-01-01 ITA export 350
Basically, data in normal format, which includes both ratios (the op_rate) and other items such as exported volumes and processed volumes ("proc").
I need to aggregate by SUM for "proc" and "export", but not for the "op_rate", for which I need a weighted average by "proc".
In this case the aggregated op_rate would be:
0.45*500 + 0.30*1000 = 0.35 // instead of a .75 SUM or 0.375 AVERAGE
All example I find for weighted average are across measures, but none covers using other dimensions.
Any help most welcome!
I understand that you are reluctant to change your model. The problem you have here is that you are trying to consume a highly normalised table and use it for analysis using an OLAP tool. OLAP tools prefer Fact/Dim star schemas and Tabular/PowerBI is no different. I suspect that this is going to continue to problems with future requirements too. Taking the hit on changing the structure now is the best time to do it as it will get more difficult the longer you leave it.
This isn't to say that you can't do what you want using the tools, but the resulting dax will be less efficient and the storage required will be sub-optimal.
So with that caveat/lecture given (!) here is how you can do it.
op_rate_agg =
VAR pivoted =
ADDCOLUMNS (
SUMMARIZE ( 'Query1', Query1[COUNTRY], Query1[DATE] ),
"op_rate", CALCULATE ( AVERAGE ( Query1[Value] ), Query1[ITEM] = "op_rate" ),
"proc", CALCULATE ( SUM ( Query1[Value] ), Query1[ITEM] = "proc" )
)
RETURN
DIVIDE ( SUMX ( pivoted, [op_rate] * [proc] ), SUMX ( pivoted, [proc] ) )
It is really inefficient as it is having to build your pivoted set every execution and you will see that the query plan is having to do a lot more work than it would if you persisted this as a proper Fact table. If your model is large you will likely have performance issues with this measure and any that references it.
#RADO is correct. You should definitely pivot your ITEM column to get this.
Then a weighted average on op_rate can be written simply as
= DIVIDE(
SUMX(Table3, Table3[op_rate] * Table3[proc]),
SUMX(Table3, Table3[proc]))
Related
I'm looking for advice on how to optimize a multi-level DAX summarize query. This one is very slow because, I think, it is running O(n^3) because of the nesting. Unfortunately, i need to have several levels because the hierarchy levels Order > Order Line > Order Detail need to be calculated differently.
Units need to sum up to the Detail level
That needs to be averaged up to the Line level
That needs to be summed up to the Order level
SUMX(
SUMMARIZE(
'FACT Opportunity'
,Opportunity[LineId]
,"Units"
,AVERAGEX(
SUMMARIZE(
'FACT Opportunity'
,Opportunity[DetailId]
,"SumDetail"
,SUM('FACT Opportunity'[Units])
)
,[SumDetail]
)
)
,[Units]
)
Any help or advice you could provide would be very much appreciated.
It's very hard to provide an optimisation advise without seeing the data and data model (it'd be great if they were included in the question).
The key issue here is that the presence of the duplicates makes fact "Units" non-additive, meaning that you can't simply roll it up the hierarchy. As a result, you are forced to do a very expensive triple-looping.
An obvious solution then is to make "Units" fully additive. You can compute de-duplicated (adjusted for duplicates) Units and store them in fact Opportunity, as a calculated column:
Adjusted Units =
DIVIDE (
'FACT Opportunity'[Units],
CALCULATE ( COUNT ( 'FACT Opportunity'[DetailId] ) )
)
Here, you divide Units by the number of unique DetailIDs (usually, it will be 1, but in case of duplicate DetailIDs it will be 2, etc).
This calculated column will increase your data loading time a bit, but save a lot of query time. To further optimize, consider pre-calculating it in a data warehouse.
The adjusted Units are fully additive, so you dax is now simple:
Total Units = SUM('FACT Opportunity'[Adjusted Units])
It should work correctly on any level of the Order > Line > Detail hierarchy (unless there are additional problems not described in the question), and it should be fast.
I would like to calculate percentage of one of my measures.
For example:
I have a measure with aggregator distinct-count.
I would like to calculate the percentage of that measure, based on the current information.
For example: gender users-distinct-count percentage
male 25 25% (25/100)
female 41 41% (41/100)
unk 34 34% (34/100)
But, if I filter out unk, I want the percentage to be out of 25+41, i.e. 66
gender users-distinct-count percentage
male 25 37.8% (25/66)
female 41 %62.2 (41/66)
I also want, that when viewing the data with different dimensions, the total sum will be updated accordingly.
I tried this:
<CalculatedMember name="user_percentage" caption="Users Percentage"
formula="[Measures].user_count/ ([Measures].user_count,[dim1].[All Dim1],[dim2].[All Dim2])" dimension="Measures" visible="true">
</CalculatedMember>
but, when filtering values on the dimensions (like removing the
"unk", the total remains the same (over all dim).
Thanks,
You should do it at the client level, not the schema level.
The schema has no idea what you're querying on your rows or columns, only the client does.
Some client tools allow you to create a calculated measure as a % of the visible values, but that has to be done by the query.
Example:
With
SET ROWSET as {[Gender].[Male],[Gender].[Female]}
MEMBER [Gender].[Visible] as Aggregate( ROWSET, [Measures].[user_count] )
MEMBER [Measures].[Percentage] as ( [Measures].[user_count], [Gender].CurrentMember ) / ( [Measures].[user_count], [Gender].[Vislble] )
SELECT
ROWSET on Rows,
{ [Measures][user_count], [Measures].[Percentage] } on Columns
FROM [My Cube]
As you must reference the set selected on rows when defining the percentage, you cannot define it at the schema level.
I'm building a cube in MS BIDS. I need to create a calculated measure that returns the weighted-average of the rank value weighted by the number of searches. I want this value to be calculated at any level, no matter what dimensions have been applied to break-down the data.
I am trying to do something like the following:
I have one measure called [Rank Search Product] which I want to apply at the lowest level possible and then sum all values of it
IIf([Measures].[Searches] IS NOT NULL, [Measures].[Rank] * [Measures].[Searches], NULL)
And then my weighted average measure uses this:
IIf([Measures].[Rank Search Product] IS NOT NULL AND SUM([Measures].[Searches]) <> 0,
SUM([Measures].[Rank Search Product]) / SUM([Measures].[Searches]),
NULL)
I'm totally new to writing MDX queries and so this is all very confusing to me. The calculation should be
([Rank][0]*[Searches][0] + [Rank][1]*[Searches][1] + [Rank][2]*[Searches][2] ...)
/ SUM([searches])
I've also tried to follow what is explained in this link http://sqlblog.com/blogs/mosha/archive/2005/02/13/performance-of-aggregating-data-from-lower-levels-in-mdx.aspx
Currently loading my data into a pivot table in Excel is return #VALUE! for all calculations of my custom measures.
Please halp!
First of all, you would need an intermediate measure, lets say Rank times Searches, in the cube. The most efficient way to implement this would be to calculate it when processing the measure group. You would extend your fact table by a column e. g. in a view or add a named calculation in the data source view. The SQL expression for this column would be something like Searches * Rank. In the cube definition, you would set the aggregation function of this measure to Sum and make it invisible. Then just define your weighted average as
[Measures].[Rank times Searches] / [Measures].[Searches]
or, to avoid irritating results for zero/null values of searches:
IIf([Measures].[Searches] <> 0, [Measures].[Rank times Searches] / [Measures].[Searches], NULL)
Since Analysis Services 2012 SP1, you can abbreviate the latter to
Divide([Measures].[Rank times Searches], [Measures].[Searches], NULL)
Then the MDX engine will apply everything automatically across all dimensions for you.
In the second expression, the <> 0 test includes a <> null test, as in numerical contexts, NULL is evaluated as zero by MDX - in contrast to SQL.
Finally, as I interpret the link you have in your question, you could leave your measure Rank times Searches on SQL/Data Source View level to be anything, maybe just 0 or null, and would then add the following to your calculation script:
({[Measures].[Rank times Searches]}, Leaves()) = [Measures].[Rank] * [Measures].[Searches];
From my point of view, this solution is not as clear as to directly calculate the value as described above. I would also think it could be slower, at least if you use aggregations for some partitions in your cube.
I have a table:
LocationId OriginalValue Mean
1 0.45 3.99
2 0.33 3.99
3 16.74 3.99
4 3.31 3.99
and so forth...
How would I work out the Standard Deviation using this table and also what would you recommend - STDEVP or STDEV?
To use it, simply:
SELECT STDEVP(OriginalValue)
FROM yourTable
From below, you probably want STDEVP.
From here:
STDEV is used when the group of numbers being evaluated are only a partial sampling of the whole population. The denominator for dividing the sum of squared deviations is N-1, where N is the number of observations ( a count of items in the data set ). Technically, subtracting the 1 is referred to as "non-biased."
STDEVP is used when the group of numbers being evaluated is complete - it's the entire population of values. In this case, the 1 is NOT subtracted and the denominator for dividing the sum of squared deviations is simply N itself, the number of observations ( a count of items in the data set ). Technically, this is referred to as "biased." Remembering that the P in STDEVP stands for "population" may be helpful. Since the data set is not a mere sample, but constituted of ALL the actual values, this standard deviation function can return a more precise result.
Generally, you should use STDEV when you have to estimate standard deviation based on a sample. But if you have entire column-data given as arguments, then use STDEVP.
In general, if your data represents the entire population, use STDEVP; otherwise, use STDEV.
Note that for large samples, the functions return nearly the same value, so better use STDEV in this case.
In statistics, there are two types of standard deviations: one for a sample and one for a population.
The sample standard deviation, generally notated by the letter s, is used as an estimate of the population standard deviation.
The population standard deviation, generally notated by the Greek letter lower case sigma, is used when the data constitutes the complete population.
It is difficult to answer your question directly -- sample or population -- because it is difficult to tell what you are working with: a sample or a population. It often depends on context.
Consider the following example.
If I want to know the standard deviation of the age of students in my class, then I u=would use STDEVP because the class is my population. But if I want the use my class as a sample of the population of all students in the school (this would be what is known as a convenience sample, and would likely be biased, but I digress), then I would use STDEV because my class is a sample. The resulting value would be my best estimate of STDEVP.
As mentioned above (1) for large sample sizes (say, more than thirty), the difference between the two becomes trivial, and (2) generally you should use STDEV, not STDEVP, because in practice we usually don't have access to the population. Indeed, one could argue that if we always had access to populations, then we wouldn't need statistics. The entire point of inferential statistics is to be able to make inferences about a population based on the sample.
I'm creating Analysis Services cubes in Visual Studio BIDS, and have a question about summing in calculated members.
The data has to do with commercial real estate transactions. I want to sum square feet of building space involved in sales transactions for each region. I'm going to use that result in a weighted average calculation. However, I only want to sum the square feet of transactions which have non-null values for the corresponding building capitalization rate (cap rate) member.
Here is a drill-down to Athens in the cube browser:
Note that Athens has 15 values for square feet, but only 5 values for cap rate, reflecting my relational data source as shown here:
So, I only want to sum the five square feet values that have associated cap rate values. Doing the math with the relational query result above you can see that this should result in a sum just over 900K, not the 2 million+ sum shown in the BIDS screenshot.
My attempt at this calculation:
sum(
descendants(
[Property].[Property by Region].CurrentMember,
[Property].[Property by Region].[Metro Area]
),
iif([Measures].[Cap Rate] is null or [Measures].[Sq Ft] is null, 0,
[Measures].[Sq Ft])
)
ends up including the square feet values that have no corresponding cap rates, so I still end up with a value in the 2 millions.
Why is my iff() clause not working as one would expect?
I was finally able to create the weighted average calculation using a combination of Named Calculations in the Data Source View (DSV) and a calculated member (in the cube script). First, I went to the DSV and added a named calculation called xWeightedCapRt with a formula as follows:
CASE WHEN CapRate IS Null THEN Null Else CapRate * SqFt END
In the cube, I then added xWeightedCapRt as a New Measure. I set its aggregation function to Sum and left its Visible property set to True temporarily.
I created an additional Named Calculation called "xSqFt", defined as:
CASE WHEN CapRate IS Null THEN Null Else SqFt END
and again created a corresponding measure.
On the Calculation tab (of the cube designer) I created a new calculated member, [WAvg Cap Rate by Sq Ft], with the following formula:
[Measures].[x Weighted Cap Rt] / [Measures].[x Sq Ft]
After deploying and processing the cube, I was able to verify that the weighted average calculation matched my spreadsheet numbers. At that point, I set the Visible property of the two intermediate measures to False and redeployed.
What I've learned is that calculations at the "row-level" are best performed through the DSV. You can then use those to build up more complex calculations within the cube.
(NOTE: One thing that needs to be added to the steps above is logic to handle division by zeros.)
Couldnt you have done a nonempty around the descendants on the cap rate measure?