What logic should be used to label customers (monthly) based on the categories they bought more often in the preceding 4 calendar months? - sql

I have a table that looks like this:
user
type
quantity
order_id
purchase_date
john
travel
10
1
2022-01-10
john
travel
15
2
2022-01-15
john
books
4
3
2022-01-16
john
music
20
4
2022-02-01
john
travel
90
5
2022-02-15
john
clothing
200
6
2022-03-11
john
travel
70
7
2022-04-13
john
clothing
70
8
2022-05-01
john
travel
200
9
2022-06-15
john
tickets
10
10
2022-07-01
john
services
20
11
2022-07-15
john
services
90
12
2022-07-22
john
travel
10
13
2022-07-29
john
services
25
14
2022-08-01
john
clothing
3
15
2022-08-15
john
music
5
16
2022-08-17
john
music
40
18
2022-10-01
john
music
30
19
2022-11-05
john
services
2
20
2022-11-19
where i have many different users, multiple types making purchases daily.
I want to end up with a table of this format
user
label
month
john
travel
2022-01-01
john
travel
2022-02-01
john
clothing
2022-03-01
john
travel-clothing
2022-04-01
john
travel-clothing
2022-05-01
john
travel-clothing
2022-06-01
john
travel
2022-07-01
john
travel
2022-08-01
john
services
2022-10-01
john
music
2022-11-01
where the label would record the most popular type (based on % of quantity sold) for each user in a timeframe of the last 4 months (including the current month). So for instance, for March 2022 john ordered 200/339 clothing (Jan to and including Mar) so his label is clothing. But for months where two types are almost even I'd want to use a double label like for April (185 travel 200 clothing out of 409). In terms of rules this is not set in stone yet but it's something like, if two types are around even (e.g. >40%) then use both types in the label column; if three types are around even (e.g. around 30% each) use three types as label; if one label is 40% but the rest is made up of many small % keep the first label; and of course where one is clearly a majority use that. One other tricky bit is that there might be missing months for a user.
I think regarding the rules I need to just compare the % of each type, but I don't know how to retrieve the type as label afterwards. In general, I don't have the SQL/BigQuery logic very clearly in my head. I have done somethings but nothing that comes close to the target table.
Broken down in steps, I think I need 3 things:
group by user, type, month and get the partial and total count (I have done this)
then retrieve the counts for the past 4 months (have done something but it's not exactly accurate yet)
compare the ratios and make the label column
I'm not very clear on the sql/bigquery logic here, so please advise me on the correct steps to achieve the above. I'm working on bigquery but sql logic will also help

Consider below approach. It looks a little bit messy and has a room to optimize but hope you get some idea or a direction to address your problem.
WITH aggregation AS (
SELECT user, type, DATE_TRUNC(purchase_date, MONTH) AS month, month_no,
SUM(quantity) AS net_qty,
SUM(SUM(quantity)) OVER w1 AS rolling_qty
FROM sample_table, UNNEST([EXTRACT(YEAR FROM purchase_date) * 12 + EXTRACT(MONTH FROM purchase_date)]) month_no
GROUP BY 1, 2, 3, 4
WINDOW w1 AS (
PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW
)
),
rolling AS (
SELECT user, month, ARRAY_AGG(STRUCT(type, net_qty)) OVER w2 AS agg, rolling_qty
FROM aggregation
QUALIFY ROW_NUMBER() OVER (PARTITION BY user, month) = 1
WINDOW w2 AS (PARTITION BY user ORDER BY month_no RANGE BETWEEN 3 PRECEDING AND CURRENT ROW)
)
SELECT user, month, ARRAY_TO_STRING(ARRAY(
SELECT type FROM (
SELECT type, SUM(net_qty) / SUM(SUM(net_qty)) OVER () AS pct,
FROM r.agg GROUP BY 1
) QUALIFY IFNULL(FIRST_VALUE(pct) OVER (ORDER BY pct DESC) - pct, 0) < 0.10 -- set threshold to 0.1
), '-') AS label
FROM rolling r
ORDER BY month;
Query results

Related

Postgres rank() without duplicates

I'm ranking race data for series of cycling events. Racers win various amounts of points for their position in races. I want to retain the discrete event scoring, but also rank the racer in the series. For example, considering a sub-query that returns this:
License #
Rider Name
Total Points
Race Points
Race ID
123
Joe
25
5
567
123
Joe
25
12
234
123
Joe
25
8
987
456
Ahmed
20
12
567
456
Ahmed
20
8
234
You can see Joe has 25 points, as he won 5, 12, and 8 points in three races. Ahmed has 20 points, as he won 12 and 8 points in two races.
Now for the ranking, what I'd like is:
Place
License #
Rider Name
Total Points
Race Points
Race ID
1
123
Joe
25
5
567
1
123
Joe
25
12
234
1
123
Joe
25
8
987
2
456
Ahmed
20
12
567
2
456
Ahmed
20
8
234
But if I use rank() and order by "Total Points", I get:
Place
License #
Rider Name
Total Points
Race Points
Race ID
1
123
Joe
25
5
567
1
123
Joe
25
12
234
1
123
Joe
25
8
987
4
456
Ahmed
20
12
567
4
456
Ahmed
20
8
234
Which makes sense, since there are three "ties" at 25 points.
dense_rank() solves this problem, but if there are legitimate ties across different racers, I want there to be gaps in the rank (e.g if Joe and Ahmed both had 25 points, the next racer would be in third place, not second).
The easiest way to solve this I think would be to issue two queries, one with the "duplicate" racers eliminated, and then a second one where I can retain the individual race data, which I need for the points break down display.
I can also probably, given enough effort, think of a way to do this in a single query, but I'm wondering if I'm not just missing something really obvious that could accomplish this in a single, relatively simple query.
Any suggestions?
You have to break this into steps to get what you want, but that can be done in a single query with common table expressions:
with riders as ( -- get individual riders
select distinct license, rider, total_points
from racists
), places as ( -- calculate non-dense rankings
select license, rider, rank() over (order by total_points desc) as place
from riders
)
select p.place, r.* -- join rankings into main table
from places p
join racists r on (r.license, r.rider) = (p.license, p.rider);
db<>fiddle here

How to measure an average count from a set of days each with their own data points, in SQL/LookerML

I have the following table:
id | decided_at | reviewer
1 2020-08-10 13:00 john
2 2020-08-10 14:00 john
3 2020-08-10 16:00 john
4 2020-08-12 14:00 jane
5 2020-08-12 17:00 jane
6 2020-08-12 17:50 jane
7 2020-08-12 19:00 jane
What I would like to do is get the difference between the min and max for each day and get the total count from the id's that are the min, the range between min and max, and the max. Currently, I'm only able to get this data for the past day.
Desired output:
Date | Time(h) | Count | reviewer
2020-08-10 3 3 john
2020-08-12 5 4 jane
From this, I would like to get the average show this data over the past x number of days.
Example:
If today was the 13th, filter on the past 2 days (48 hours)
Output:
reviewer | reviews/hour
jane 5/4 = 1.25
Example 2:
If today was the 13th, filter on the past 3 days (48 hours)
reviewer | reviews/hour
john 3/3 = 1
jane 5/4 = 1.25
Ideally, if this is possible in LookML without the use of a derived table, it would be nicest to have that. Otherwise, a solution in SQL would be great and I can try to convert to LookerML.
Thanks!
In SQL, one solution is to use two levels of aggregation:
select reviewer, sum(cnt) / sum(diff_h) review_per_hour
from (
select
reviewer,
date(decided_at) decided_date,
count(*) cnt,
timestampdiff(hour, min(decided_at), max(decided_at)) time_h
from mytable
where decided_at >= current_date - interval 2 day
group by reviewer, date(decided_at)
) t
group by reviewer
The subquery filters on the date range, aggregates by reviewer and day, and computes the number of records and the difference between the minimum and the maximum date, as hours. Then, the outer query aggregates by reviewer and does the final computation.
The actual function to compute the date difference varies across databases; timestampdiff() is supported in MySQL - other engines all have alternatives.

Function to get rolling average with lowest 2 values eliminated?

This is my sample data with the current_Rating column my desired output.
Date Name Subject Importance Location Time Rating Current_rating
12/08/2020 David Work 1 London - - 4
1/08/2020 David Work 3 London 23.50 4 3.66
2/10/2019 David Emails 3 New York 18.20 3 4.33
2/08/2019 David Emails 3 Paris 18.58 4 4
11/07/2019 David Work 1 London - 3 4
1/06/2019 David Work 3 London 23.50 4 4
2/04/2019 David Emails 3 New York 18.20 3 5
2/03/2019 David Emails 3 Paris 18.58 5 -
12/08/2020 George Updates 2 New York - - 2
1/08/2019 George New Appointments5 London 55.10 2 -
I need to use a function to get values in the current_Rating column.The current_Rating gets the previous 5 results from the rating column for each name, then eliminates the lowest 2 results, then gets the average for the remaining 3. Also some names may not have 5 results, so I will just need to get the average of the results if 3 or below, if 4 results I will need to eliminate the lowest value and average the remaining 3. Also to get the right 5 previous results it will need to be sorted by date. Is this possible? Thanks for your time in advance.
What a pain! I think the simplest method might be to use arrays and then unnest() and aggregate:
select t.*, r.current_rating
from (select t.*,
array_agg(rating) over (partition by name order by date rows between 4 preceding and current row) as rating_5
from t
) t cross join lateral
(select avg(r) as current_rating
from (select u.*
from unnest(t.rating_5) with ordinality u(r, n)
where r is not null
order by r desc desc
limit 3
) r
) r

Group overlapping differing date-ranges with different start- and enddate in SQL

I need help with grouping overlapping date-ranges in Microsoft SQL Server 18.1. A sample of the data looks like this. I need to be able to group based on ID, Name, StartDate and EndDate. This should be done so, that if a ID 1's date-range overlaps or is less than 7 days apart from the next row of ID 1's date range, they should be assigned the same grouping ID. If the timegap between two lines is bigger than 7 days, they should be seperated as two groupings.
The data is characterized by having differing Start- and EndDate for almost all of the rows, so it cannot be grouped by Start- and EndDate. Instead the goal is to group all rows for one person where the rows date-ranges overlap with less than 7 days, and show the Start- and EndDate for the full period by using MIN and MAX as shown in the "desired output window". Each Person in the dataset can have up to 200 lines of differing date-ranges that need to be grouped if it overlaps with other date-ranges for the person in the dataset.
I suppose the solution requires running through all rows until all rows are grouped based on ID, Name and overlapping date ranges. The case is that match-grouping for period 1-4, firstly requires period 1 to be matched with period 2, then period 1-2 needs to be matched to period 3, and period 1-3 to be matched with period 4. It can be the case that Period 1's date-range (e.g. 01-01-2019 - 30-05-2019) ends later than period 2's (e.g. 05-02-2019 - 24-04-2019), where the period 1-2 comparison/match to period 3 should be the MAX EndDate for period 1-2 meaning 30-05-2019 in this case.
Period 1 Period 2 Period 3 Period 4
X
X
X
X
I need help with making a code that gets me from step 0 - Raw Data to step 1 - Grouping by overlapping date-ranges (less than 7 days apart). I have tried CASE, LAG, LEAD, PARTITION BY and some different kind of loops but haven't found a solution on how to solve the problem.
Step 0 - Raw Data:
ID Name StartDate EndDate
1 Peter Hanson 01-01-2018 15-02-2019
1 Peter Hanson 05-01-2019 23-02-2019
1 Peter Hanson 30-02-2019 18-04-2019
2 Eric Schmidt 05-01-2019 18-03-2019
2 Eric Schmidt 07-01-2019 25-05-2019
3 Martin Boyle 08-03-2018 12-01-2019
3 Martin Boyle 15-01-2019 17-04-2019
3 Martin Boyle 18-04-2019 12-05-2019
3 Martin Boyle 29-04-2019 31-09-2019
Step 1- Grouping by overlapping date-ranges (less than 7 days apart):
ID Name StartDate EndDate Grouping
1 Peter Hanson 01-01-2018 15-02-2019 1
1 Peter Hanson 05-01-2019 23-02-2019 1
1 Peter Hanson 30-02-2019 18-04-2019 2
2 Eric Schmidt 05-01-2019 18-03-2019 3
2 Eric Schmidt 07-01-2019 25-05-2019 3
3 Martin Boyle 08-03-2018 12-01-2019 4
3 Martin Boyle 23-01-2019 17-04-2019 5
3 Martin Boyle 18-04-2019 12-05-2019 5
3 Martin Boyle 29-04-2019 31-09-2019 5
Step 2 - Desired Output window:
ID Name StartDate EndDate Grouping
1 Peter Hanson 01-01-2019 23-02-2019 1
1 Peter Hanson 30-02-2019 18-04-2019 2
2 Eric Schmidt 05-01-2019 25-05-2019 3
3 Martin Boyle 08-03-2018 12-01-2019 4
3 Martin Boyle 23-01-2019 31-09-2019 5
I hope that somebody can help with this task.
You want to identify where a group starts. Based on your description, you can use lag() -- although it total overlaps are allowed then a cumulative max() is more appropriate.
Then, the groups are the cumulative sum of the starts . . . and the rest is aggregation:
select id, name, min(startdate), max(enddate),
dense_rank() over (order by id, min(startdate)) as grouping
from (select t.*,
sum(case when prev_enddate >= dateadd(day, -7, startdate) then 0 else 1 end /*end*/
) over (partition by id order by startdate) as grp
from (select t.*,
lag(enddate) over (partition by id order by startdate) as prev_enddate
from t
) t
) t
group by id, name;

SQL Query for a table

I’m looking for a little assistance. I have a table called equipment. One row is an order of some type of equipment.
Here are the fields:
num_id date player_id order_id active jersey comment
BIGINT DATE BIGINT BIGINT CHAR(1) CHAR(3) VARCHAR(1024)
11 2018-01-01 123 1 Y XL
11 2018-01-01 123 2 Y M Purple
11 2018-01-01 123 3 Y L White, Red
13 2018-01-11 456 1 N S Yellow, Light Blue
14 2018-02-01 789 1 Y M Orange, Black
15 2018-02-02 101 1 Y XL Shield
15 2018-02-02 101 2 Y XL Light Green, Grey
I need to write a query that shows one row for each month with the columns
Month
Total Orders
Total Products ordered
And one extra column for a total count of each size sold.
Is this easy? Any help would be appreciated.
EDIT: To answer people's questions below, SQL Server is the dbms. My apologies. As well, I am struggling as I don't know how to get the month from a date. And then adding the column for size counts has me baffled, but I haven't fully investigated that portion. I feel like the rest I have done individually, just never did it in one succinct query.
It looks weird here and I don't know how to add a table to stackoverflow, so I'll try to make it a little more visually appealing here:
The end goal I think would be like this:
Month Total Orders Total Products Ordered Size Count
January 1 3 S-0, M-1, L-1, XL-2
February 3 6 S–1, M–2, L–1, XL–3
Or this:
Month Total Orders Total Products Ordered S Count M Count L Count XL Count
January 1 3 0 1 1 2
February 3 6 1 2 1 3
You need PIVOT.
It basicly turns rows into columns, which exactly is your case.
https://www.codeproject.com/Tips/500811/Simple-Way-To-Use-Pivot-In-SQL-Query