I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!
I'm trying to get a count of a number of policies issued per month. This is close to returning the correct information:
SELECT count(policy_no), left(issue_date,6)
FROM table_a
WHERE indicator = 'fln'
GROUP BY left(issue date,6)
the indicator is narrowing it down to the types of policies I want. The only problem I'm having is that there will be an entry with an identical policy number every year as the policy renews. I need to only count the lowest issue date for each policy, not every policy every time. If a policy was issued in November of 2010, I want it to count that one time, not once for November 2010,2011,2012, etc. The issue dates are in the format of yyyymmdd. Only year and month are relevant.
I'm sure this is an easy one for the more experienced among you, I haven't been able to piece it together by other questions on this forum. Any help would be appreciated!
Something like this will get what you want:
SELECT LEFT(FirstIssued, 6) AS YYMM, COUNT(DISTINCT Policy_No) AS NumPolicies
FROM
(
SELECT Policy_No, MIN(issue_date) AS FirstIssued
FROM table_a
WHERE indicator = 'fln'
GROUP BY Policy_No
) A
GROUP BY LEFT(FirstIssued,6)
The key is to first find the min date for each policy, before aggregating the counts. Note that the only months you will have appear are those with at least one policy, so if you would prefer to have 0s you need to add in a date generator.
I have a cube that has a fact table with a month's worth of data. The fact table is 1.5 billion rows.
Fact table contains the following columns { DateKey,UserKey,ActionKey, ClientKey, ActionCount } .
The fact table contains one row per user per client per action per day with the no of activities done.
Now I want to calculate the below measures in my cube as follows
Avg Days Engaged per user
AVG([Users].[User Key].[User Key], [Measures].[DATE COUNT])
Users Engaged >= 14 Days
SUM([Users].[User Key].[User Key], IIF([Measures].[DATE COUNT] >= 14, 1, 0))
Avg Requests Per User
IIF([Measures].[USER COUNT] = 0, 0 ,[Measures].[ACTIVITY COUNT]/[Measures].[USER COUNT])
So to do this, I have created two distinct count measures DATE COUNT and USER COUNT which are distinct aggregations on the DateKey and UserKey columns of fact table. I want to know partition the measure group(s) ( there are 3 of them bcoz of distinct measure going in it's own measure group).
What is the best strategy to partition the cube? I have read the analysis service distinct count guide end-end and it mentioned that partitioning the cube by non-overlapping user ids is the best strategy for single user queries and user X time is the best for single user time-set queries.
I want to know if I should partition by cube into 75 partitions each (1.5 billion rows/20 million rows per partition) which will have each partition with non-overlapping and sequential user ids or should I partition it into 31 partitions one per day with overlapping userid but distinct days in each partition or 31 * 3 = 93 partitions where I break down the cube into per day and then for each day further partition in to 3 equal parts with non-overlapping user ids within each day (but users will overlap between days) or partition by ActionKey into 45 partitions of un-equal size since most of the times the measures are sliced by Action?
I'm a bit confused because the paper only talks about optimizing on a single distinct count measure, where as I need to do distinct counts on both user and dates for my measures.
any tips ?
I would first take a step back and try the Many-to-Many dimension count technique to achieve Distinct Count results without the overhead of actual Distinct Count aggregations.
Probably the best explanation of this is the "Distinct Count" section of the "Many to Many Revolution 2.0" paper:
http://www.sqlbi.com/articles/many2many/
Note Solution C is the one I am referring to.
You usually find this solution scales much better than a standard "Distinct Count" measure. For example I have one cube with 2b rows in the biggest Fact (and only 4 partitions), and a "M2M Distinct Count" fact on 9m rows - performance is great e.g. 6-7 hours to completely reprocess all the data, less than 5 seconds for most queries. The server is OK but not great e.g. VM, 4 cores, 32 GB RAM (shared with SQL, SSRS, SSIS etc), no SSD.
I think you can get carried away with too many partitions and overcomplicating the design. The basic engine can do wonders with careful design.
I need to calculate average client age across population. I have an enrollment record per client per month with age dimension key. So, to calculate an average age for one month is not a problem, but when I need to do it for a time period of a year that's where I flounder. How can I write a calculated measure to get age of only the first record per client in the time range?
Thanks everyone for the suggestions.
Michael
I would make age a measure with aggregation function "FirstChild". This means that for the time dimension, the first child is taken on each hierarchy level, and for all other dimensions, the age is summed. Hence, this age should be made invisible, and only the average age measure calculated from it by dividing by the number of clients should be made visible.
For this to work properly, you need to set the "Type" property of your time dimension to "Time", and it requires the Enterprise or Development edition of SQL Server. The "FirstChild" aggregation function is not supported in standard edition.
I have a problem and was wondering if anyone could help or if it is even possible to have an algorithm for something like this.
I need to create a predictive ordering wizard. So based on previous sales, we will determine that that a certain amount of an item is required. E.g 31 apples. Now i need to work out the number of cases that needs to be ordered. If the cases come in say 60, 30, 15, 10 apples, the order should be a case of 30 and a case of 10 apples.
The number of items that need to be ordered change in each row of the result set. The case sizes could also change for each item. So some items may have an option of 5 different cases and some items may land up with an option of only one case.
Other examples would be i need 39 cans of coke and the cases come in only 24 per case. Therefore needing 2 cases. I need 2 shots of baileys and the bottle of baileys come in 50cl or 70cl. Therefore i need the 50cl.
The results sets columns are ItemName, ItemSize, QuantityRequired, PackSize and PackSizeMultiple.
The ItemName is the item to be ordered. ItemSize is the size the item is used in eg. can of coke. QuantityRequired how man of the item, in this case cans of coke, need to be ordered. PackSize is the size of the case. PackSizeMultiple is the number to multiply the item with to work out how many of the items are in the case.
ps. this will be a query in SQL Server 2008
Sounds like you need a UOM (Unit of Measure) table and a function to calc co-pack measure count and and unit count measure qty. with UOM type based on time between orders. You would also need to create a cron cycle and freeze table managed by week/time interval in order to create a freeze view of the current qty sold each week and the number of units since last order. Based on the 2 previous orders to your prior order you would set the current prediction based on min time between the last 2 freeze cycles containing an order and the duration of days between them. based on the average time between orders and the unit qty in each order, you can create a unit decay ratio percentage based on days and store it in each slice forward. Based on a reference to this data you will be able to create a prediction that will allow you to trigger a notice to sales or a message to the client to reorder. In addition, if you engage response data from sales based on unit count feedback from the client, you can reference an actual and tune your decay rate against your prediction. You should also consider managing and rolling up these freezes by month, so that you can view historical trending and forecast revenue based on velocity of reorder and same period last year. Basically this is similar to sales forcasting and we are switching out your opportunity percentage of close with Predicted Remaining Qty. percentage remaining.