Distinctcount - suppliers for departments over a period of time - slow performance

Distinctcount - suppliers for departments over a period of time - slow performance - mdx

In a model that contains the following dimensions:
- Time - granularity month - 5 years - 20 quarters - 60 months
- Suppliers- 6000 suppliers at lowest level
- departments - 500 departments on lowest level
I need to have the distinct count of the suppliers for each department.
I use the function:
with member [measures].[#suppliers] as
distinctcount(([Supplier].[Supplier].[supplier].members
,[Measures].[amount]))
)
select [Measures].[#suppliers] on 0
, order([Department].[Department].[department].members, [#suppliers], BDESC) on 1
from [cube]
where [Time].[Time].[2017 10]:[Time].[Time].[2018 01]
The time component may vary, as the dashboard user is free to choose a reporting period.
But the MDX is very slow. It takes about 38ms to calculate the measure for each row. I want to use this measure to rank the departments and to calculate a cumulative % and assign scores to these values. As you can imagine performance will not improve.
I have tried to use functions and cache the result, but results - for me - got worse (according to the log 2x as bad).
What can I do to improve the performance?

To go fast adding a measure that calculates de Distinct Count on the Supplier ID of the table associated to[Measures].[Amount] will help. In the Schema definition.
The other ones are not scalable as Supplier is growing.
Nonetheless, why did you use DistinctCount instead of Count(NonEmpty())) ?
DistinctCount is mainly for calculating the number of members/tuples that are different in a set. It only makes sense if it's possible to have two same members in a set. As our initial members have no duplicated, it's useless.
Count(NonEmpty()) filters the set whith the nonempty and counts the number of items in the set. This can be easily calculated in parallel

Related

SSAS MDX Calculation - Sum based off a group value

I work for a hotel company and I have set up a fact table with the granularity of a stay night for each guest, e.g. if a guest stays for 3 nights, there would be a row for each night of the stay.
What I am trying to do is create a measure for the occupancy percentage (rooms booked divided by available rooms).
I have a column in the fact table that says how many rooms the hotel has, but just summing up that value doesn't work because then it is just multiplying the number of rooms by the number of guests. So I need to sum up the total guests and then divide by the number of rooms that that particular hotel has. Does this make sense?
[Measures].[On The Books] / [Measures].[Rooms Available]
The SQL for this would this:
SELECT stay.PropertyKey, prop.RoomsAvailable, stay.StayDateKey, COUNT(stay.Confirmation) AS Confirmation,
CAST(COUNT(stay.Confirmation) AS DECIMAL(13,9)) / CAST(prop.RoomsAvailable AS DECIMAL(13,9)) AS OccupancyPercentage
FROM dbo.FactStayNight stay
INNER JOIN
(
SELECT DISTINCT PropertyKey, RoomsAvailable
FROM dbo.FactStayNight
) prop
ON stay.PropertyKey = prop.PropertyKey
GROUP BY stay.PropertyKey, stay.StayDateKey, prop.RoomsAvailable

Your fact table is good, apart from the column with total number of rooms. The fact row is at the granularity level "Room", but the total number of rooms is at granularity level "Entire Hotel".
(You can imagine a "Real estate assets" hierarchy dimension, assuming you don't have one:
Hotel
Floor
Room
)
Possible solutions:
Add a "number of rooms" available in your Date dimension, at the Day level (strictly, "Night" level). This will sum commensurably with COUNT(Guests staying on that day). You could even adjust this number to reflect e.g. rooms under repair in particular periods.
You could implement a Room dimension, with each guest's Fact_NightStayed assigned to a Room. Then make what is technically called a "headcount" table, just like your Fact_NightStayed. But this table would be a "roomcount" table: a row indicates that a room exists on a particular day (or, if you decide, that a room exists and is usable i.e. not broken/being repaired). Pre-populate this table with one row per room per date, into the future up to a date you decide (this would be an annual refresh process). Then, joining Fact_NightStayed to Fact_RoomCount, your measure would be COUNT(NightStayed)/COUNT(RoomCount).
Watch out for aggregating this measure (however you implement it) over time: the aggregation function itself from the Day leaf level up the Date hierarchy should be AVG rather than SUM.

QlikView: aggregating a calculated expression

I have a table that is used to calculate a daily completion score by individuals at various locations. Example: on day 1, 9/10 people completed the task, so the location score is 90%. The dimension is "ReferenceDate." The expression is a calculation of count(distinct if(taskcompleted=yes, AccountNumber)) / count(distinct AccountNumber).
Now, I want to report on the average scores per month. I DO NOT want to aggregate all the data and then divide; I want the daily average. Example:
day 1: 9/10 = 90%
day 2: 90/100 = 90% (many more people showed up a the same location)
average of two days is 90%.
it's not 99/110
and it also not distinct(99) / distinct(110). It is the more simple (.9 + .9) /2
Does this make sense?
What I have now is a line graph showing the daily trend across many months. I need to roll that up into bar charts by month and then compare multiple locations so we can see what locations are having the lower average completion scores.

You need to use the aggr() function to tell QlikView to do the sum day by day and then average the answers.
It should look something like this. (I just split the lines to show which terms are working together.
avg(
aggr(
count(distinct if(taskcompleted=yes, AccountNumber))
/ count(distinct AccountNumber)
,ReferenceDate)
)

MDX - sum costs up to a given date

This is a slight modification of what I stumbled upon while searching the web:
Let's say I have a dimension PROJECTS which contains:
project_id - unique id
category - category of a cost
project_date - date of summing up the cost
My warehouse also has the dimension of TIME with date, and a dimension COSTS containing values of costs. Those three dimensions are connected by the measure group EXPENSES which has:
id_date
id_cost
id_project
I want to wirte an MDX query which would group the projects by their category, and sum up all the costs, but only those which do not exceed the date given in the project_date attribute of the dimension PROJECTS (each category has the same project_date, I know it's redundant but I can't change it..)

I'm not sure, but maybe something alongside this?
SELECT
[COSTS].[COST] ON 0,
[PROJECTS].[category] ON 1
FROM [CUBE]
WHERE
[PROJECTS].[project_date] < #project_date

What is the best partitioning strategy for multiple distinct count measures in a cube

I have a cube that has a fact table with a month's worth of data. The fact table is 1.5 billion rows.
Fact table contains the following columns { DateKey,UserKey,ActionKey, ClientKey, ActionCount } .
The fact table contains one row per user per client per action per day with the no of activities done.
Now I want to calculate the below measures in my cube as follows
Avg Days Engaged per user
AVG([Users].[User Key].[User Key], [Measures].[DATE COUNT])
Users Engaged >= 14 Days
SUM([Users].[User Key].[User Key], IIF([Measures].[DATE COUNT] >= 14, 1, 0))
Avg Requests Per User
IIF([Measures].[USER COUNT] = 0, 0 ,[Measures].[ACTIVITY COUNT]/[Measures].[USER COUNT])
So to do this, I have created two distinct count measures DATE COUNT and USER COUNT which are distinct aggregations on the DateKey and UserKey columns of fact table. I want to know partition the measure group(s) ( there are 3 of them bcoz of distinct measure going in it's own measure group).
What is the best strategy to partition the cube? I have read the analysis service distinct count guide end-end and it mentioned that partitioning the cube by non-overlapping user ids is the best strategy for single user queries and user X time is the best for single user time-set queries.
I want to know if I should partition by cube into 75 partitions each (1.5 billion rows/20 million rows per partition) which will have each partition with non-overlapping and sequential user ids or should I partition it into 31 partitions one per day with overlapping userid but distinct days in each partition or 31 * 3 = 93 partitions where I break down the cube into per day and then for each day further partition in to 3 equal parts with non-overlapping user ids within each day (but users will overlap between days) or partition by ActionKey into 45 partitions of un-equal size since most of the times the measures are sliced by Action?
I'm a bit confused because the paper only talks about optimizing on a single distinct count measure, where as I need to do distinct counts on both user and dates for my measures.
any tips ?

I would first take a step back and try the Many-to-Many dimension count technique to achieve Distinct Count results without the overhead of actual Distinct Count aggregations.
Probably the best explanation of this is the "Distinct Count" section of the "Many to Many Revolution 2.0" paper:
http://www.sqlbi.com/articles/many2many/
Note Solution C is the one I am referring to.
You usually find this solution scales much better than a standard "Distinct Count" measure. For example I have one cube with 2b rows in the biggest Fact (and only 4 partitions), and a "M2M Distinct Count" fact on 9m rows - performance is great e.g. 6-7 hours to completely reprocess all the data, less than 5 seconds for most queries. The server is OK but not great e.g. VM, 4 cores, 32 GB RAM (shared with SQL, SSRS, SSIS etc), no SSD.
I think you can get carried away with too many partitions and overcomplicating the design. The basic engine can do wonders with careful design.

SSAS Daily Calculation Rolled up to Any Dimension

Im trying to create a daily calculation in my Cube or an MDX statement that will do a calculation daily and roll up to any dimension. I've been able to successfully get the values back, however the performance is not what I think it should be.
My fact table will have 4 dimensions 1 of which being daily date (time). I have a formula that uses 4 other measures in this fact table and those need to be calculated daily and then geometrically linked across the time dimension.
The following MDX statement works great and produces the correct value but it is very slow. I have tried using exp(sum(log+1))-1 and multiply seems to perform a little better but not good enough. Is there another approach to this solution or is there something wrong with my MDX statement?
I have tried defining aggregations For [Calendar_Date] and [Dim_Y].[Y ID], but it does not seem to use these aggregations.
WITH
MEMBER Measures.MyCustomCalc AS (
(
Measures.x -Measures.y
) -
(
Measures.z - Measures.j
)
)
/
Measures.x
MEMBER Measures.LinkedCalc AS ASSP.MULTIPLY(
[Dim_Date].[Calendar Date].Members,
Measures.MyCustomCalc + 1
) - 1
SELECT
Measures.LinkedCalc ON Columns,
[Dim_Y].[Y ID].Members ON Rows
FROM
[My DB]
The above query takes 7 seconds to run w/ the following number of records:
Measure: 98,160 records
Dim_Date: 5,479 records
Dim_Y: 42 records
We have assumed that by defining an aggregation that the amount of calculations we'd be performing would only be 42 * number of days, in this case a maximum of 5479 records.
Any help or suggestions would be greatly appreciated!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Distinctcount - suppliers for departments over a period of time - slow performance - mdx

Related

SSAS MDX Calculation - Sum based off a group value

QlikView: aggregating a calculated expression

MDX - sum costs up to a given date

What is the best partitioning strategy for multiple distinct count measures in a cube

SSAS Daily Calculation Rolled up to Any Dimension

Categories

Resources