What is the best partitioning strategy for multiple distinct count measures in a cube

What is the best partitioning strategy for multiple distinct count measures in a cube - ssas

I have a cube that has a fact table with a month's worth of data. The fact table is 1.5 billion rows.
Fact table contains the following columns { DateKey,UserKey,ActionKey, ClientKey, ActionCount } .
The fact table contains one row per user per client per action per day with the no of activities done.
Now I want to calculate the below measures in my cube as follows
Avg Days Engaged per user
AVG([Users].[User Key].[User Key], [Measures].[DATE COUNT])
Users Engaged >= 14 Days
SUM([Users].[User Key].[User Key], IIF([Measures].[DATE COUNT] >= 14, 1, 0))
Avg Requests Per User
IIF([Measures].[USER COUNT] = 0, 0 ,[Measures].[ACTIVITY COUNT]/[Measures].[USER COUNT])
So to do this, I have created two distinct count measures DATE COUNT and USER COUNT which are distinct aggregations on the DateKey and UserKey columns of fact table. I want to know partition the measure group(s) ( there are 3 of them bcoz of distinct measure going in it's own measure group).
What is the best strategy to partition the cube? I have read the analysis service distinct count guide end-end and it mentioned that partitioning the cube by non-overlapping user ids is the best strategy for single user queries and user X time is the best for single user time-set queries.
I want to know if I should partition by cube into 75 partitions each (1.5 billion rows/20 million rows per partition) which will have each partition with non-overlapping and sequential user ids or should I partition it into 31 partitions one per day with overlapping userid but distinct days in each partition or 31 * 3 = 93 partitions where I break down the cube into per day and then for each day further partition in to 3 equal parts with non-overlapping user ids within each day (but users will overlap between days) or partition by ActionKey into 45 partitions of un-equal size since most of the times the measures are sliced by Action?
I'm a bit confused because the paper only talks about optimizing on a single distinct count measure, where as I need to do distinct counts on both user and dates for my measures.
any tips ?

I would first take a step back and try the Many-to-Many dimension count technique to achieve Distinct Count results without the overhead of actual Distinct Count aggregations.
Probably the best explanation of this is the "Distinct Count" section of the "Many to Many Revolution 2.0" paper:
http://www.sqlbi.com/articles/many2many/
Note Solution C is the one I am referring to.
You usually find this solution scales much better than a standard "Distinct Count" measure. For example I have one cube with 2b rows in the biggest Fact (and only 4 partitions), and a "M2M Distinct Count" fact on 9m rows - performance is great e.g. 6-7 hours to completely reprocess all the data, less than 5 seconds for most queries. The server is OK but not great e.g. VM, 4 cores, 32 GB RAM (shared with SQL, SSRS, SSIS etc), no SSD.
I think you can get carried away with too many partitions and overcomplicating the design. The basic engine can do wonders with careful design.

Related

How to find count of items in a table by 2 groups

To keep it simple, I have a table with 3 columns:
Request Id (eg. REQ1, REQ2, and so on)
Status (possible values - In Planning, Work in Progress, In Review, Completed)
Due Type (possible values - Due Today, Due Tomorrow, This Week, Next Week, In 15 Days)
Now, all I want to find out is, how do I arrive at a result which will tell how many In Planning are Due Today, how many In Review are Due tomorrow and so on.
Tried using count with over and partition by, but it gives me the count of statuses and count of due types but not a combination of both maintaining the relation

Are you just looking for aggregation?
select due_type, status, count(*)
from t
group by due_type, status;
If so, this is a basic SQL query. You should brush up on group by and other SQL fundamentals.

How to partition based on the month and year in Azure SQL Data Warehouse

I am going to use ADF to copy 5 billion rows to Azure SQL data warehouse. Azure SQL DWH will distribute the table into 60 distributions by default, But I want to add another 50 partitions based on the month and year as follow:
PARTITION ( DateP RANGE RIGHT FOR VALUES
(
'2015-01-01', 2015-02-01', ......2018-01-01','2018-02-01','2018-03-01','2018-04-01','2018-5-01',.......
))
But, the column that I am using to partition the table includes date and time together :
2015-01-01 00:30:00
Do you think my partitioning approach is correct?

5B rows / (50 partitions x 60 Distributions) = 1.7M rows/partition on average
That's probably too many partitions, but if you have a lot of single-month queries it might be worth it. You would definitely want to defragment your columnstores after load.

I tend to agree with David that this is probably overkill for the number of partitions. You'll want to make sure that you have a pretty even distribution of data and with 1.7M rows or so, you'll be on the lower side. You can probably move to quarter based partitions (e.g., '2017-12-31', '2018-03-01', '2018-06-30') to get good results for query performance. This would give you 4 partitions a year since 2015 (or 20 total). So the math is:
5B rows / (20 partitions * 60 distributions) = 4.167M rows/partition.
While the number of partitions does matter for partition elimination scenarios, this is a fact table with columnstore indexes which will do an additional level of index segment elimination during query time. Over partitioning can make the situation worse rather than better.

The guideline from Microsoft specifies that while sizing partitions, especially for columnstore indexed tables in Azure DW, the MINIMUM volume MUST be 60 million rows PER partition. Anything lower may NOT give an optimum performance. The logic to that is, there must be a MINIMUM of 1 M rows per distribution per partition. Since every partition created will internally create sixty additional distributions, the minimum works out to 60M per partition proposed to be created

Distinctcount - suppliers for departments over a period of time - slow performance

In a model that contains the following dimensions:
- Time - granularity month - 5 years - 20 quarters - 60 months
- Suppliers- 6000 suppliers at lowest level
- departments - 500 departments on lowest level
I need to have the distinct count of the suppliers for each department.
I use the function:
with member [measures].[#suppliers] as
distinctcount(([Supplier].[Supplier].[supplier].members
,[Measures].[amount]))
)
select [Measures].[#suppliers] on 0
, order([Department].[Department].[department].members, [#suppliers], BDESC) on 1
from [cube]
where [Time].[Time].[2017 10]:[Time].[Time].[2018 01]
The time component may vary, as the dashboard user is free to choose a reporting period.
But the MDX is very slow. It takes about 38ms to calculate the measure for each row. I want to use this measure to rank the departments and to calculate a cumulative % and assign scores to these values. As you can imagine performance will not improve.
I have tried to use functions and cache the result, but results - for me - got worse (according to the log 2x as bad).
What can I do to improve the performance?

To go fast adding a measure that calculates de Distinct Count on the Supplier ID of the table associated to[Measures].[Amount] will help. In the Schema definition.
The other ones are not scalable as Supplier is growing.
Nonetheless, why did you use DistinctCount instead of Count(NonEmpty())) ?
DistinctCount is mainly for calculating the number of members/tuples that are different in a set. It only makes sense if it's possible to have two same members in a set. As our initial members have no duplicated, it's useless.
Count(NonEmpty()) filters the set whith the nonempty and counts the number of items in the set. This can be easily calculated in parallel

How to do bitwise operations in SSAS cube for aggregations using MDX

I want to model a fact table for our users to help us calculate DAU (Daily active Users), WAU (Weekly active users) and MAU (Monthly active users).
The definitions of these measures are as follows:
1. DAU are users who is active every day during last 28 days.
2. WAU are users who are active at least on one day in each 7 days period during last 28 days
3. MAU are users who are active at least 20 days during last 28 days
I have built a SSAS cube with my fact table and user dimension table as follows
Fact : { date, user_id, activity_name}
Dimension: { date, user_id, gender, age, country }
Now I want to build a cube over this data so that we can see all the measures in any given day for last 28 days.
I thought of initially storing 28 days of data for all users in the SQL server and then do count distinct on date to see which measures they fall into.. but this proved very expensive since the data per day is huge..almost 10 millions rows.
So my next thought was to model the fact table (before moving it to SQL) such that it has a new column called "active_status" which is a 32 bit binary type column.
Basically, I'll store a binary number (or decimal equivalent) like 11000001101111011111111111111 which has a bit set on the days the user is active and off on the days user is not active.
This way I can compress 28 days worth of data in a single day before loading into data mart
Now the problem is , I think MDX doesn't support bitwise operations on columns in the expressions for calculated members like regular SQL does. I was hoping to create calculated measures daily_active_users, weekly_active_users and monthly_active_users using MDX that looks at this active_status bit for the user and does bitwise operation to determine the status.
Any suggestions on how to solve this problem? if MDX doesn't allow bitwise, what else can I do SSAS to achieve this.
thanks for the help
Additonal notes:
#Frank
Interesting thought about using a view to do the conversion from bitset to a dimension category..but I'm afraid it won't work. Because I have few dimensions connected to this fact table that have many-many relationships..for ex: I have a dimension called DimLanguage and another dimension called DimCountry and they have many-many relationship. And what ultimately I would like to do in the cube is to calculate the DAU/WAU/MAU which are COUNT(DISTINCT UserId) based on the combination of dimensions. So for ex; If a user is not MAU for dimension country US because he is only active 15 days out of 28 ....but he will be considered

You do not want to show the bitmap data to the users of the cube, but just the categories DAU, WAU, MAU, you should do the conversion from bitmap to category on data loading time. Just create a dimension table containing e. g. the following data:
id category
-- --------
1 DAU
2 WAU
3 MAU
Then define a view on your fact table that evaluates the bitmap data, and for each user and each date just calculates the id value of the category the user is in. This is then conceptually a foreign key to the dimension table. Use this view instead of the fact table in your cube.
All the bitmap evaluations are thus done on the relational side, where you have the bit operators available.
EDIT
As your requirement is that you need to aggregate the bitmap data in Analysis Services using bitwise OR as the aggregation method, I see no simple way to do that.
What you could do, however, would be to have 28 single columns, say Day1 to Day28, which would be either 0 or 1. These could be of type byte to save some space. You would use Maximum as aggregation method, which is equivalent to binary OR on a single bit.
Then, it would not be really complex to calculate the final measure, as we know the values are either zero or one, and thus we can just sum across the days:
CASE
WHEN Measures.[Day1] + ... + Measures.[Day28] = 28 THEN 'DAU'
WHEN Measures.[Day1] + ... + Measures.[Day7] >= 1 AND
Measures.[Day8] + ... + Measures.[Day14] >= 1 AND
Measures.[Day15] + ... + Measures.[Day21] >= 1 AND
Measures.[Day22] + ... + Measures.[Day28] >= 1 THEN 'WAU'
WHEN Measures.[Day1] + ... + Measures.[Day28] >= 20 THEN 'MAU'
ELSE 'Other'
END
The order of the clauses in the CASE is relevant, as the first condition matching is taken, and your definitions of WAU and MAU have some intersection.
If you have finally tested everything, you would make the measures Day1 to Day28 invisible in order not to confuse the users of the cube.

SQL: Minimising rows in subqueries/partitioning

So here's an odd thing. I have limited SQL access to a database - the most relevant restriction here being that if I create a query, a maximum of 10,000 rows is returned.
Anyway, I've been trying to have a query return individual case details, but only at busy times - say when 50+ cases are attended to in an hour. So, I inserted the following line:
COUNT(CaseNo) OVER (PARTITION BY DATEADD(hh,
DATEDIFF(hh, 0, StartDate), 0)) AS CasesInHour
... And then used this as a subquery, selecting only those cases where CasesInHour >= 50
However, it turns out that the 10,000 rows limit affects the partitioning - when I tried to run over a longer period nothing came up, as it was counting the cases in any given hour from only a (fairly random) much smaller selection.
Can anyone think of a way to get around this limit? The final total returned will be much lower than 10,000 rows, but it will be looking at far more than 10,000 as a starting point.

If this is really MySQL we're talking about, sql_big_selects and max_join_size affects the number of rows examined, not the number of rows "returned". So, you'll need to reduce the number of rows examined by being more selective and using proper indexes.
For example, the following query may be examining over 10,000 rows:
SELECT * FROM stats
To limit the selectivity, you might want to grab only the rows from the last 30 days:
SELECT * FROM stats
WHERE created > DATESUB(NOW(), INTERVAL 30 DAY)
However, this only reduces the number of rows examined if there is an index on the created column and the cardinality of the index is sufficient to reduce the rows examined.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas