I have an hourly (timestamp) dataset of events from the past month.
I would like to check the performance of events that occurred between certain hours, group them together and average the results.
For example: AVG income of the hours 23:00-02:00 per user:
So if I have this data set below. I'd like to summarise the coloured rows and then average them (the result should be 218).
I tried NTILE but it couldn't divide the data properly, ignoring the irrelevant hours.
Is there a good way to create these custom buckets using SQL?
dataset
From description not exactly sure how you want to aggregate. If you provide an example dataset can update answer.
However you can easily achieve this with an AVG and IF statement.
AVG(IF(EXTRACT(HOUR FROM timestamp_field) BETWEEN 0 AND 4, value, NULL) as avg_value
Using the above you can then group by either day or month to get the aggregation level you want.
In a model that contains the following dimensions:
- Time - granularity month - 5 years - 20 quarters - 60 months
- Suppliers- 6000 suppliers at lowest level
- departments - 500 departments on lowest level
I need to have the distinct count of the suppliers for each department.
I use the function:
with member [measures].[#suppliers] as
distinctcount(([Supplier].[Supplier].[supplier].members
,[Measures].[amount]))
)
select [Measures].[#suppliers] on 0
, order([Department].[Department].[department].members, [#suppliers], BDESC) on 1
from [cube]
where [Time].[Time].[2017 10]:[Time].[Time].[2018 01]
The time component may vary, as the dashboard user is free to choose a reporting period.
But the MDX is very slow. It takes about 38ms to calculate the measure for each row. I want to use this measure to rank the departments and to calculate a cumulative % and assign scores to these values. As you can imagine performance will not improve.
I have tried to use functions and cache the result, but results - for me - got worse (according to the log 2x as bad).
What can I do to improve the performance?
To go fast adding a measure that calculates de Distinct Count on the Supplier ID of the table associated to[Measures].[Amount] will help. In the Schema definition.
The other ones are not scalable as Supplier is growing.
Nonetheless, why did you use DistinctCount instead of Count(NonEmpty())) ?
DistinctCount is mainly for calculating the number of members/tuples that are different in a set. It only makes sense if it's possible to have two same members in a set. As our initial members have no duplicated, it's useless.
Count(NonEmpty()) filters the set whith the nonempty and counts the number of items in the set. This can be easily calculated in parallel
I have a SQL question.
I am trying to find the average injection volume per month. Currently my code takes the sum of all days of injection, and divides them by the TOTAL DAYS in the month.
Sum(W1."INJECTION_VOLUME" /
EXTRACT(DAY FROM LAST_DAY(W1."INJECTION_DATE"))) AS "AVGINJ"
This is not what I wanted.
I need to take the injection_volume and divide by the total days in the DATA .
ie. right now the data only 8 days of injection volume, lets say it is 3000.
So right now the sql is 3000/31.
I need to have it be 3000/8 (the total days in the data for the current month.)
Also, this should only be for the current month. All other completed months should be divided by the total days in the month.
Use
SELECT
SUM(W1.INJECTION_VOLUME) / COUNT(DISTINCT MyDateField)
FROM MyTable
WHERE X=Value
This gives you what you're after
SUM(W1.INJECTION_VOLUME) is the total volume for the dataset
Gives you the number of days, no matter how many records you have
COUNT(DISTINCT MyDateField)
So if you have 100 records but only 5 actual unique days in this time, this expression gives you 5
Note that this kind of calc is normally worked out with
SUM(A) / SUM(B)
not
SUM(A/B)
They give you completely different answers.
In order to get the average of the data for the current month you will need to divide by the count in the month:
SUM(`W1`.`INJECTION_VOLUME` / COUNT(EXTRACT(YEAR_MONTH FROM `W1`.`INJECTION_DATE`)))
To get all other data as the full month you'll need to combine your code:
SUM(`W1`.`INJECTION_VOLUME` / EXTRACT(DAY FROM LAST_DAY(`W1`.`INJECTION_DATE`)))
With an IF. So something like this:
SUM(
IF(
EXTRACT(YEAR_MONTH FROM `W1`.`INJECTION_DATE`) = EXTRACT(YEAR_MONTH FROM NOW()),
`W1`.`INJECTION_VOLUME` / COUNT(EXTRACT(YEAR_MONTH FROM `W1`.`INJECTION_DATE`)),
`W1`.`INJECTION_VOLUME` / EXTRACT(DAY FROM LAST_DAY(`W1`.`INJECTION_DATE`)
)
)
Note: this is untested and I'm not sure about the RDBMS you are using so you may need to change the code slightly to make it work.
I have a table of data with the following:
User,Platform,Dt,Activity_Flag,Total_Purchases
1,iOS,05/05/2016,1,1
1,Android,05/05/2016,1,2
2,iOS,05/05/2016,1,0
2,Android,05/05/2016,1,2
3,iOS,05/05/2016,1,1
3,Android,06/05/2016,1,3
1,iOS,06/05/2016,1,2
4,Android,06/05/2016,1,2
1,Android,06/05/2016,1,0
3,iOS,07/05/2016,1,2
2,iOS,08/05/2016,1,0
I want to do a GROUPING SETS (Platform,Dt,(Platform,Dt),()) aggregation to be able to find for each combination of Platform and Dt the following:
Total Purchases
Total Unique Users
Average Purchases per User per Day
The first two are simple as these can be achieved via a sum(Total_Purchases) and count(distinct user) respectively.
The problem I have is with the last metric. The result set should look like this but I don't know how to get the last column to be calculated correctly:
Platform,Dt,Total_Purchases,Total_Unique_Users,Average_Purchases_Per_User_Per_Day
Android,05/05/2016,4,2,2.0
iOS,05/05/2016,2,3,0.7
Android,06/05/2016,5,3,1.7
iOS,06/05/2016,2,1,2.0
iOS,07/05/2016,2,1,2.0
iOS,08/05/2016,0,1,0.0
,05/05/2016,6,3,2.0
,06/05/2016,7,3,2.3
,07/05/2016,1,1,1.0
,08/05/2016,1,1,1.0
Android,,9,4,1.8
iOS,,6,3,1.2
,,15,4,1.6
For the first ten rows we see that getting the Average purchase per user per day is a simple division of the first two columns as the dimension in these rows represent a single date only. But when we look at the final 3 rows we see that the division is not the way to achieve the desired result. This is because it needs to take an average for each day in turn to get the overall per day amount.
If this isn't clear please let me know and I'll be happy to explain better. This is my first post on this site!
I'm currently working on a project in which I want to aggregate data (resolution = 15 minutes) to weekly values.
I have 4 weeks and the view should include a value for each week AND every station.
My dataset includes more than 50 station.
What I have is this:
select name, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name
order by name
But it only displays the avg value of all weeks. What I need is avg values for each week and each station.
Thanks for your help!
The problem is that when you do a 'GROUP BY' on just name you then flatten the weeks and you can only perform aggregate functions on them.
Your best option is to do a GROUP BY on both name and week so something like:
select name, week, avg(parameter1), avg(parameter2)
from data
where week in ('29','30','31','32')
group by name, week
order by name
PS - It' not entirely clear whether you're suggesting that you need one set of results for stations and one for weeks, or whether you need a set of results for every week at every station (which this answer provides the solution for). If you require the former then separate queries are the way to go.