How to do bitwise operations in SSAS cube for aggregations using MDX

How to do bitwise operations in SSAS cube for aggregations using MDX - sql

I want to model a fact table for our users to help us calculate DAU (Daily active Users), WAU (Weekly active users) and MAU (Monthly active users).
The definitions of these measures are as follows:
1. DAU are users who is active every day during last 28 days.
2. WAU are users who are active at least on one day in each 7 days period during last 28 days
3. MAU are users who are active at least 20 days during last 28 days
I have built a SSAS cube with my fact table and user dimension table as follows
Fact : { date, user_id, activity_name}
Dimension: { date, user_id, gender, age, country }
Now I want to build a cube over this data so that we can see all the measures in any given day for last 28 days.
I thought of initially storing 28 days of data for all users in the SQL server and then do count distinct on date to see which measures they fall into.. but this proved very expensive since the data per day is huge..almost 10 millions rows.
So my next thought was to model the fact table (before moving it to SQL) such that it has a new column called "active_status" which is a 32 bit binary type column.
Basically, I'll store a binary number (or decimal equivalent) like 11000001101111011111111111111 which has a bit set on the days the user is active and off on the days user is not active.
This way I can compress 28 days worth of data in a single day before loading into data mart
Now the problem is , I think MDX doesn't support bitwise operations on columns in the expressions for calculated members like regular SQL does. I was hoping to create calculated measures daily_active_users, weekly_active_users and monthly_active_users using MDX that looks at this active_status bit for the user and does bitwise operation to determine the status.
Any suggestions on how to solve this problem? if MDX doesn't allow bitwise, what else can I do SSAS to achieve this.
thanks for the help
Additonal notes:
#Frank
Interesting thought about using a view to do the conversion from bitset to a dimension category..but I'm afraid it won't work. Because I have few dimensions connected to this fact table that have many-many relationships..for ex: I have a dimension called DimLanguage and another dimension called DimCountry and they have many-many relationship. And what ultimately I would like to do in the cube is to calculate the DAU/WAU/MAU which are COUNT(DISTINCT UserId) based on the combination of dimensions. So for ex; If a user is not MAU for dimension country US because he is only active 15 days out of 28 ....but he will be considered

You do not want to show the bitmap data to the users of the cube, but just the categories DAU, WAU, MAU, you should do the conversion from bitmap to category on data loading time. Just create a dimension table containing e. g. the following data:
id category
-- --------
1 DAU
2 WAU
3 MAU
Then define a view on your fact table that evaluates the bitmap data, and for each user and each date just calculates the id value of the category the user is in. This is then conceptually a foreign key to the dimension table. Use this view instead of the fact table in your cube.
All the bitmap evaluations are thus done on the relational side, where you have the bit operators available.
EDIT
As your requirement is that you need to aggregate the bitmap data in Analysis Services using bitwise OR as the aggregation method, I see no simple way to do that.
What you could do, however, would be to have 28 single columns, say Day1 to Day28, which would be either 0 or 1. These could be of type byte to save some space. You would use Maximum as aggregation method, which is equivalent to binary OR on a single bit.
Then, it would not be really complex to calculate the final measure, as we know the values are either zero or one, and thus we can just sum across the days:
CASE
WHEN Measures.[Day1] + ... + Measures.[Day28] = 28 THEN 'DAU'
WHEN Measures.[Day1] + ... + Measures.[Day7] >= 1 AND
Measures.[Day8] + ... + Measures.[Day14] >= 1 AND
Measures.[Day15] + ... + Measures.[Day21] >= 1 AND
Measures.[Day22] + ... + Measures.[Day28] >= 1 THEN 'WAU'
WHEN Measures.[Day1] + ... + Measures.[Day28] >= 20 THEN 'MAU'
ELSE 'Other'
END
The order of the clauses in the CASE is relevant, as the first condition matching is taken, and your definitions of WAU and MAU have some intersection.
If you have finally tested everything, you would make the measures Day1 to Day28 invisible in order not to confuse the users of the cube.

Related

SQL Retention based on cohort and period

I have already seen all the related posts, but none have been able to help me.
I Have the following fields:
Where:
SOLD_AT is the date of each transaction
CUSTOMER_ID is a unique ID for each customer
COHORT is the date (Year-Month) of the first purchase of the user in that row
ORDER_MONTH is the date of (Year-Month) of the purchase in that row
PERIOD_NUMBER is the date difference in months between COHORT and ORDER_MONTH
N_CUSTOMERS is the number of customers in each PERIOD_NUMBER in each COHORT
In case is useful, I have the querys with which I have obtained these fields, but I think that including them would only add noise since the definition of each variable is more useful.
What I need to do and am not able to do is add an additional field for the retention of each period number of each cohort (not a pivot table by adding the period numbers of each cohort).
Specifically, I need the retention of each period number to be the division of the number of users of that period by the number of users of the previous period, in this way:
To do this in Python, I simply do:
cohort_pivot = df_cohort.pivot_table(index = 'cohort',
columns = 'period_number',
values = 'n_customers')
cohort_size = cohort_pivot.iloc[:,0]
retention_matrix1 = cohort_pivot.divide(cohort_size, axis = 0)
and I can then unpivot and take out the retention for each period of each cohort to create an additional column with this value.
One of the answers that I tried because it was the closest thing I saw was the answer chosen in this post, but I am not able to know the number of periods_numbers or historical months that I am going to have since the code has to be dynamic for any company that is loaded (For example, in DBT, which is the tool I'm using, you can create dynamic pivot tables instead of static ones that require to know this information, but as I say I need to create the field, not the pivot table)
Any ideas will be more than welcome, thank you very much

Rolling Balances with Allocated Transactions

I am needing to Calculate the start/end Balances by day for each Site/Department.
I have a source table call it “Source” that has the following fields:
Site
Department
Date
Full_Income
Income_To_Allocate
Payments_To_Allocate
There are 4 Sites (SiteA/SiteB/SiteC/SiteD), Sites B-D have only 1 department and Site A has 10 departments.
This table is “mostly” a daily summary. I say “mostly” as the daily detail from 2018 was lost and instead we just have the monthly summary inputted as one entry on the last day of the month. For 2018 there is only data going back to September. From 1/1/2019 the summary is actually daily.
Any Income in the Full_Income field will be given to that Site/Department at 100% value.
Any Income in the Income_To_Allocate field will be spread among all the Site/Departments using the below logic:
(
(Prior_Month_Site_Department_ Balance+ This_Month_Site_Department_Full_Income)
/
(Prior_Month_All_Department_Balance + This_Month_All_Department_Full_Income)
)
*
(This_Month_All_Department_Income_to_Allocate)
Any Payments in the Payments_to Allocate) field will be spread among all the Site/Departments using the below logic:
(
(Prior_Month_Site_Department_ Balance+ This_Month_Site_Department_Full_Income)
/
(Prior_Month_All_Department_Balance + This_Month_All_Department_Full_Income)
)
*
(This_Month_All_Department_Payments_to_Allocate)
The idea behind these pieces of logic is to spread the allocated pieces based on the % of business each Site/Department did when looking at the Full_Income data.
The Balance would be calculated with this logic:
Start Balance:
Prior day Ending Balance
Ending Balance:
Prior day Ending Balance + (Site_Department_Full_Income) + (Site_Department_Allocated_Income)- (SiteDepartment_Allocated_Income)
I have tried to do things using the lag function to grab the prior info that I am needing for these calculations. I always get real close but I always wind up stuck on the fact the Ending Balance is calculated using the post spread values for the allocated income and reseeds while the calculation for the spread is using the prior month balance info. This ends up being almost circular logic but with a finite start point. I am at a loss for how to make this work.
I am using SQL Server 2012. Let me know if you need any more details.

Creating a calculated column (not aggregate) that changes value based on context SSAS tabular DAX

Data: I have a single row that represents an annual subscription to a product, it has an overall startDate and endDate, there is also third date which is startdate + 1 month called endDateNew. I also have a non-related date table (called table X).
Output I'm looking for: I need a new column called Categorisation that will return 'New' if the date selected in table X is between startDate and endDateNew and 'Existing' if the date is between startDate and endDate.
Problem: The column seems to evaluate immediately without taking in to account the date context from the non-related date table - I kinda expected this to happen in visual studio (where it assumes the context is all records?) but when previewing in Excel it carries through this same value through.
The bit that is working:I have an aggregate (an active subscriber count) that correctly counts the subscription as active over the months selected in Table X.
The SQL equivalent on an individual date:
case
when '2015-10-01' between startDate and endDateNew then 'New'
when '2015-10-01' < endDate then 'Existing'
end as Category
where the value would be calculated for each date in table X
Thanks!
Ross

Calculated columns are only evaluated at model refresh/process time. This is by design. There is no way to make a calculated column change based on run-time changes in filter context from a pivot table.

Ross,
Calculated columns work differently than Excel. Optimally the value is known when the record is first added to the model.
Your example is kinda similar to a slowly changing dimension .
There are several possible solutions. Here are two and a half:
Full process on the last 32 days of data every time you process the subscriptions table (which may be unacceptably inefficient).
OR
Create a new table 'Subscription scd' with the primary key from the subscriptions table and your single calculated column of 'Subscription Age in Days'. Like an outrigger. This table could be reprocessed more efficiently than reprocessing the subscriptions table, so process the subscriptions table as incrementals only and do a full process on this table for the data within the last 32 days instead.
OR
Decide which measures are interesting within the 'new/existing' context and write explicit measures for them using a dynamic filter on the date column in the measures
eg. Define
'Sum of Sales - New Subscriptions',
'Sum of Sales - Existing Subscriptions',
'Distinct Count of New Subscriptions - Last 28 Days', etc

What is the best partitioning strategy for multiple distinct count measures in a cube

I have a cube that has a fact table with a month's worth of data. The fact table is 1.5 billion rows.
Fact table contains the following columns { DateKey,UserKey,ActionKey, ClientKey, ActionCount } .
The fact table contains one row per user per client per action per day with the no of activities done.
Now I want to calculate the below measures in my cube as follows
Avg Days Engaged per user
AVG([Users].[User Key].[User Key], [Measures].[DATE COUNT])
Users Engaged >= 14 Days
SUM([Users].[User Key].[User Key], IIF([Measures].[DATE COUNT] >= 14, 1, 0))
Avg Requests Per User
IIF([Measures].[USER COUNT] = 0, 0 ,[Measures].[ACTIVITY COUNT]/[Measures].[USER COUNT])
So to do this, I have created two distinct count measures DATE COUNT and USER COUNT which are distinct aggregations on the DateKey and UserKey columns of fact table. I want to know partition the measure group(s) ( there are 3 of them bcoz of distinct measure going in it's own measure group).
What is the best strategy to partition the cube? I have read the analysis service distinct count guide end-end and it mentioned that partitioning the cube by non-overlapping user ids is the best strategy for single user queries and user X time is the best for single user time-set queries.
I want to know if I should partition by cube into 75 partitions each (1.5 billion rows/20 million rows per partition) which will have each partition with non-overlapping and sequential user ids or should I partition it into 31 partitions one per day with overlapping userid but distinct days in each partition or 31 * 3 = 93 partitions where I break down the cube into per day and then for each day further partition in to 3 equal parts with non-overlapping user ids within each day (but users will overlap between days) or partition by ActionKey into 45 partitions of un-equal size since most of the times the measures are sliced by Action?
I'm a bit confused because the paper only talks about optimizing on a single distinct count measure, where as I need to do distinct counts on both user and dates for my measures.
any tips ?

I would first take a step back and try the Many-to-Many dimension count technique to achieve Distinct Count results without the overhead of actual Distinct Count aggregations.
Probably the best explanation of this is the "Distinct Count" section of the "Many to Many Revolution 2.0" paper:
http://www.sqlbi.com/articles/many2many/
Note Solution C is the one I am referring to.
You usually find this solution scales much better than a standard "Distinct Count" measure. For example I have one cube with 2b rows in the biggest Fact (and only 4 partitions), and a "M2M Distinct Count" fact on 9m rows - performance is great e.g. 6-7 hours to completely reprocess all the data, less than 5 seconds for most queries. The server is OK but not great e.g. VM, 4 cores, 32 GB RAM (shared with SQL, SSRS, SSIS etc), no SSD.
I think you can get carried away with too many partitions and overcomplicating the design. The basic engine can do wonders with careful design.

SSAS Daily Calculation Rolled up to Any Dimension

Im trying to create a daily calculation in my Cube or an MDX statement that will do a calculation daily and roll up to any dimension. I've been able to successfully get the values back, however the performance is not what I think it should be.
My fact table will have 4 dimensions 1 of which being daily date (time). I have a formula that uses 4 other measures in this fact table and those need to be calculated daily and then geometrically linked across the time dimension.
The following MDX statement works great and produces the correct value but it is very slow. I have tried using exp(sum(log+1))-1 and multiply seems to perform a little better but not good enough. Is there another approach to this solution or is there something wrong with my MDX statement?
I have tried defining aggregations For [Calendar_Date] and [Dim_Y].[Y ID], but it does not seem to use these aggregations.
WITH
MEMBER Measures.MyCustomCalc AS (
(
Measures.x -Measures.y
) -
(
Measures.z - Measures.j
)
)
/
Measures.x
MEMBER Measures.LinkedCalc AS ASSP.MULTIPLY(
[Dim_Date].[Calendar Date].Members,
Measures.MyCustomCalc + 1
) - 1
SELECT
Measures.LinkedCalc ON Columns,
[Dim_Y].[Y ID].Members ON Rows
FROM
[My DB]
The above query takes 7 seconds to run w/ the following number of records:
Measure: 98,160 records
Dim_Date: 5,479 records
Dim_Y: 42 records
We have assumed that by defining an aggregation that the amount of calculations we'd be performing would only be 42 * number of days, in this case a maximum of 5479 records.
Any help or suggestions would be greatly appreciated!

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas