DAX obtaining time series from Start and End Date - sql

I have a series of events with a Start and End date, with a Value and a series of other attributes.
Country --- Location --- Start-Date --- End-Date --- Value per day
Italy Rome 2018-01-01 2018-03-15 50
Belgium BXL 2017-12-04 2017-12-6 120
Italy Milan 2018-03-17 2018-04-12 80
I want to convert this, in DAX, to a monthly time-series like:
Country --- Location --- Month --- Value per day
Italy Rome 2018-01 50
Italy Rome 2018-02 50
Italy Rome 2018-03 22.58 (= 50 /31*(31-17) days)
The value is a weighted average of industrial capacity.
I have done this with a CROSS JOIN with the Calendar table, but this is quite heavy and requires to calculate each possible value, while a calculation on-the-fly would be likely faster.
Any help?
Many thanks

DAX would be similar to:
Total =
Var DayDiff =
SUMMARIZE(Table1,Appointments[End-Date],
"DayDiff",DATEDIFF(min(Table1[Start-Date]),MAX(Table1[End-Date]),DAY)
)
RETURN
sumx(DayDiff,[DayDiff])
You do not have to use the Country, Location and Month in the filter (in above dax) as they will be available in filter context
you use (for e.g. PivotTable).
Please paste sample rows if this do not work.

Related

What logic should be used to label customers (monthly) based on the categories they bought more often in the preceding 4 calendar months?

I have a table that looks like this:
user
type
quantity
order_id
purchase_date
john
travel
10
1
2022-01-10
john
travel
15
2
2022-01-15
john
books
4
3
2022-01-16
john
music
20
4
2022-02-01
john
travel
90
5
2022-02-15
john
clothing
200
6
2022-03-11
john
travel
70
7
2022-04-13
john
clothing
70
8
2022-05-01
john
travel
200
9
2022-06-15
john
tickets
10
10
2022-07-01
john
services
20
11
2022-07-15
john
services
90
12
2022-07-22
john
travel
10
13
2022-07-29
john
services
25
14
2022-08-01
john
clothing
3
15
2022-08-15
john
music
5
16
2022-08-17
john
music
40
18
2022-10-01
john
music
30
19
2022-11-05
john
services
2
20
2022-11-19
where i have many different users, multiple types making purchases daily.
I want to end up with a table of this format
user
label
month
john
travel
2022-01-01
john
travel
2022-02-01
john
clothing
2022-03-01
john
travel-clothing
2022-04-01
john
travel-clothing
2022-05-01
john
travel-clothing
2022-06-01
john
travel
2022-07-01
john
travel
2022-08-01
john
services
2022-10-01
john
music
2022-11-01
where the label would record the most popular type (based on % of quantity sold) for each user in a timeframe of the last 4 months (including the current month). So for instance, for March 2022 john ordered 200/339 clothing (Jan to and including Mar) so his label is clothing. But for months where two types are almost even I'd want to use a double label like for April (185 travel 200 clothing out of 409). In terms of rules this is not set in stone yet but it's something like, if two types are around even (e.g. >40%) then use both types in the label column; if three types are around even (e.g. around 30% each) use three types as label; if one label is 40% but the rest is made up of many small % keep the first label; and of course where one is clearly a majority use that. One other tricky bit is that there might be missing months for a user.
I think regarding the rules I need to just compare the % of each type, but I don't know how to retrieve the type as label afterwards. In general, I don't have the SQL/BigQuery logic very clearly in my head. I have done somethings but nothing that comes close to the target table.
Broken down in steps, I think I need 3 things:
group by user, type, month and get the partial and total count (I have done this)
then retrieve the counts for the past 4 months (have done something but it's not exactly accurate yet)
compare the ratios and make the label column
I'm not very clear on the sql/bigquery logic here, so please advise me on the correct steps to achieve the above. I'm working on bigquery but sql logic will also help
Consider below approach. It looks a little bit messy and has a room to optimize but hope you get some idea or a direction to address your problem.
WITH aggregation AS (
SELECT user, type, DATE_TRUNC(purchase_date, MONTH) AS month, month_no,
SUM(quantity) AS net_qty,
SUM(SUM(quantity)) OVER w1 AS rolling_qty
FROM sample_table, UNNEST([EXTRACT(YEAR FROM purchase_date) * 12 + EXTRACT(MONTH FROM purchase_date)]) month_no
GROUP BY 1, 2, 3, 4
WINDOW w1 AS (
PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
)
),
rolling AS (
SELECT user, month, ARRAY_AGG(STRUCT(type, net_qty)) OVER w2 AS agg, rolling_qty
FROM aggregation
QUALIFY ROW_NUMBER() OVER (PARTITION BY user, month) = 1
WINDOW w2 AS (PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW)
)
SELECT user, month, ARRAY_TO_STRING(ARRAY(
SELECT type FROM (
SELECT type, SUM(net_qty) / SUM(SUM(net_qty)) OVER () AS pct,
FROM r.agg GROUP BY 1
) QUALIFY IFNULL(FIRST_VALUE(pct) OVER (ORDER BY pct DESC) - pct, 0) < 0.10 -- set threshold to 0.1
), '-') AS label
FROM rolling r
ORDER BY month;
Query results

Quicksight Calculated field: sum of average?

The dataset I have is currently like so:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
total_views_per_country_and_day is already pre-calculated to be the sum grouped by country and day. That is why for each country-day pair, the number is the same.
I have a Quicksight analysis with a filter for day.
The first thing I want is to have a table on my dashboard that shows the number of total views for each country.
However, if I were to do it with the dataset just like that, the table would sum everything:
country
total_views
USA
900+900+900=2700
UK
350+350=700
So what I did was, create a calculated field which is the average of total_views. Which worked---but only if my day filter on dashboard was for ONE day.
When filtered for day = 2022-06-15: correct
country
avg(total_views)
USA
2700/3=900
UK
700/2=350
But let's say we have data from 2022-06-16 as well, the averaging method doesn't work, because it will average based on the entire dataset. So, example dataset with two days:
country
itemid
device
num_purchases
total_views_per_country_and_day
day
USA
ABC
iPhone11
2
900
2022-06-15
USA
ABC
iPhoneX
5
900
2022-06-15
USA
DEF
iPhoneX
8
900
2022-06-15
UK
ABC
iPhone11
10
350
2022-06-15
UK
DEF
iPhone11
20
350
2022-06-15
USA
ABC
iPhone11
2
1000
2022-06-16
USA
ABC
iPhoneX
5
1000
2022-06-16
UK
ABC
iPhone11
10
500
2022-06-16
UK
DEF
iPhone11
20
500
2022-06-16
Desired Table Visualization:
country
total_views
USA
900 + 1000 = 1900
UK
350 + 500 = 850
USA calculation: (900 * 3)/3 + (1000 * 2) /2 = 900 + 1000
UK calculation: (350 * 2) /2 + (500 * 2) /2 = 350 + 500
Basically---a sum of averages.
However, instead it is calculated like:
country
avg(total_views)
USA
[(900 * 3) + (1000*2)] / 5 = 940
UK
[(350 * 2) + (500 * 2)] / 4 = 425
I want to be able to use this calculation later on as well to calculate num_purchases / total_views. So ideally I would want it to be a calculated field. Is there a formula that can do this?
I also tried, instead of calculated field, just aggregating total_views by average instead of sum in the analysis -- exact same issue, but I could actually keep a running total if I include day in the table visualization. E.G.
country
day
running total of avg(total_views)
USA
2022-06-15
900
USA
2022-06-16
900+1000=1900
UK
2022-06-15
350
UK
2022-06-16
350+500=850
So you can see that the total (2nd and 4th row) is my desired value. However this is not exactly what I want.. I don't want to have to add the day into the table to get it right.
I've tried avgOver with day as a partition, that also requires you to have day in the table visualization.
sum({total_views_per_country_and_day}) / distinct_count( {day})
Basically your average is calculated as sum of metric divided by number of unique days. The above should help.

How can I convert monthly data to yearly data in pandas dataframe?

My data looks like that:
data_dte
Year
Month
usg_apt
Total
01/1990
1990
1
JFK
80
01/1990
1990
1
MIA
100
01/1990
1990
1
ORD
58
I want to have a yearly total for each "usg_apt" instead of monthly.
"usg_apt" stands for "US Gateway Airport Code".
Assuming that the dataframe in question is called df, maybe you could try df.groupby(["usg_apt", "Year"])['Total'].agg('sum').

How to measure an average count from a set of days each with their own data points, in SQL/LookerML

I have the following table:
id | decided_at | reviewer
1 2020-08-10 13:00 john
2 2020-08-10 14:00 john
3 2020-08-10 16:00 john
4 2020-08-12 14:00 jane
5 2020-08-12 17:00 jane
6 2020-08-12 17:50 jane
7 2020-08-12 19:00 jane
What I would like to do is get the difference between the min and max for each day and get the total count from the id's that are the min, the range between min and max, and the max. Currently, I'm only able to get this data for the past day.
Desired output:
Date | Time(h) | Count | reviewer
2020-08-10 3 3 john
2020-08-12 5 4 jane
From this, I would like to get the average show this data over the past x number of days.
Example:
If today was the 13th, filter on the past 2 days (48 hours)
Output:
reviewer | reviews/hour
jane 5/4 = 1.25
Example 2:
If today was the 13th, filter on the past 3 days (48 hours)
reviewer | reviews/hour
john 3/3 = 1
jane 5/4 = 1.25
Ideally, if this is possible in LookML without the use of a derived table, it would be nicest to have that. Otherwise, a solution in SQL would be great and I can try to convert to LookerML.
Thanks!
In SQL, one solution is to use two levels of aggregation:
select reviewer, sum(cnt) / sum(diff_h) review_per_hour
from (
select
reviewer,
date(decided_at) decided_date,
count(*) cnt,
timestampdiff(hour, min(decided_at), max(decided_at)) time_h
from mytable
where decided_at >= current_date - interval 2 day
group by reviewer, date(decided_at)
) t
group by reviewer
The subquery filters on the date range, aggregates by reviewer and day, and computes the number of records and the difference between the minimum and the maximum date, as hours. Then, the outer query aggregates by reviewer and does the final computation.
The actual function to compute the date difference varies across databases; timestampdiff() is supported in MySQL - other engines all have alternatives.

Dependent and bound filters in a pivot table

In Sheet1 I have raw data that looks something like this:
event_name | country | event_datetime | event_day | vertical | some_metric
fun day 2018 | uk | 1/1/2018 22:00 | 1/1/2018 | something | 100
fun day 2018 | uk | 1/1/2018 23:00 | 1/1/2018 | something | 200
fun day 2018 | uk | 2/1/2018 00:00 | 2/1/2018 | something | 300
fun day 2017 | uk | 1/1/2017 22:00 | 1/1/2017 | something | 400
fun day 2017 | uk | 1/1/2017 23:00 | 1/1/2017 | something | 500
fun day 2017 | uk | 2/1/2017 00:00 | 2/1/2017 | something | 600
event_datetime is rounded to the nearest hour. events can run across multiple days.
In Sheet2 I create a pivot table using all this data. The filters are event_name, country, event_datetime; the first column is vertical and the values is sum(some_metric).
Is there a way of making the dates that show up dependent on the event_name selected? e.g. if I select fun day 2018, I just want the dates that correspond to this event to show up in the event_datetime filter dropdown (i.e. for fun day 2018 only 1/1/2018 and 2/1/2018 with all the corresponding times should come up). At the moment, all the dates show up for any event.
Is there a way to "group" the event_datetime's so that if an event is say 36 hours, I can select 24 hours / 30 hour views for that event? e.g. for fun day 2018 the 24 hour view would be anything with date 1/1/2018, 30 hours would be anything with date 1/1/2018 and the first 6 hours of 2/1/2018.
I am using Microsoft Excel on Mac Version 16.14.1. If there is a structure change I can make in the raw data itself which would enable 1/2 above to be simpler in the pivot please let me know and I can edit the SQL generating this data.
Is there a way of making the dates that show up [in the filter dropdown] dependent on the event_name selected? No. But try using a Slicer...they can give you the kind of effect you're after, where things that are 'selectable' show up at the top of the Slicer. You may need to play around with the settings.
Is there a way to "group" the event_datetime [by time elapsed]? Yes, but it will require you to add another column to your data that shows time elapsed, which will then let you set a Less Than filter on that field. But because this only works for rowfields, you'll need to use a workaround using a second hidden PivotTable and a slicer to get it to work. I have an answer somewhere on this site that I'll try to find and link to later today.
Given your requirements, I think you should just use a simple Table (and not a PivotTable) to filter and display this data. Hopefully the mac version you have allows you to put slicers on a Table, like the most recent versions of Excel do.
You can handle the time grouping requirement by adding a couple of new columns and a couple of parameter cells like below:
The orange cells have been assigned the named ranges of DayOfInterest and HoursWithin respectively
Time Elapsed: =[#[ event_datetime ]]-DayOfInterest
Time Groupings: =AND([#[Time Elapsed]]>0,[#[Time Elapsed]]<HoursWitthin/24)
This lets you set the parameter in the Hours Within to anything you want (24, 36, 48) and this then causes the Time Groupings formula to show True/False, which you can then filter on.
With a bit of VBA you can also make it more dynamic...i.e. you can have the filter apply automatically whenever someone changes the orange Input cells. You just need to add this VBA to the Sheet Module corresponding to the sheet this is in.
Option Explicit
Private Sub Worksheet_Change(ByVal Target As Range)
If Not Intersect(Target, [DayOfInterest]) Is Nothing _
Or Not Intersect(Target, [HoursWithin]) Is Nothing Then
Application.ScreenUpdating = False
With ActiveWorkbook.SlicerCaches("Slicer_Time_Groupings")
.ClearAllFilters
On Error Resume Next
.SlicerItems("TRUE").Selected = True
.SlicerItems("FALSE").Selected = False
On Error GoTo 0
End With
Application.ScreenUpdating = True
End If
End Sub