How to write SQL query to calculate instances where a row containing a distinct id occurs 7 days after the fist occurrence if the unique id? - sql

I am looking to return a date, the count of unique_ids first occurrences on that date, the number unique_ids that occurred 7 days after their first occurrence and the percentage of occurrences after 7 days / number of first occurrences.
example data_import table
+---------------------+------------------+
| time | distinct_id |
+---------------------+------------------+
| 2018/10/01 | 1 | first instance of `1`
+---------------------+------------------+
| 2018/10/01 | 2 | also first instance, but does not occur 7 days later
+---------------------+------------------+
| 2018/10/02 | 1 | should be disregarded (not first instance of 1)
+---------------------+------------------+
| 2018/10/02 | 3 | first instance of `3`
+---------------------+------------------+
| 2018/10/08 | 1 | First instance 7 days after first instance of `1`
+---------------------+------------------+
| 2018/10/08 | 1 | Don't count as this is the 2nd instance of `1` on this day
+---------------------+------------------+
| 2018/10/09 | 3 | 7 days after first instance of `3`
+---------------------+------------------+
| 2018/10/09 | 1 | 7 days after non-first instance of `1`
+---------------------+------------------+
And the expected return.
+---------------------+----------------------+------------------------+---------------------------+
| time | num_of_1st_instance | num_occur_7_days_after | percent_used_7_days_after |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/01 | 2 | 1 | .50 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/02 | 1 | 1 | 1.0 |
+---------------------+----------------------+------------------------+---------------------------+
| 2018/10/03 | 0 | 0 | 0 |
+---------------------+----------------------+------------------------+---------------------------+
The query I have written is close, but counts occurrences other that the first for a distinct_id.
In my example, this query would include the occurrence of distinct_id 1 on 2018/10/02 and it's occurrence seven days after 2018/10/02 on 2018/10/09. Not wanted as the 2018/10/02 occurrence of distinct_id 1 is not it's first.
SELECT
data_import.time AS date,
count(distinct data_import.distinct_id) AS num_installs_on_install_date,
count(distinct future_activity.distinct_id) AS num_occur_7_days_after,
count(distinct future_activity.distinct_id) / count(distinct data_import.distinct_id)::float AS percent_used_7_days_after
FROM data_import
LEFT JOIN data_import AS future_activity ON
data_import.distinct_id = future_activity.distinct_id
AND
DATE(data_import.time) = DATE(future_activity.time) - INTERVAL '7 days'
AND
data_import.time = ( SELECT
time
FROM
data_import
WHERE
distinct_id = future_activity.distinct_id
ORDER BY
time
limit
1 )
GROUP BY DATE(data_import.time)
I hope that I explained this clearly. Please let me know how I can change my current query or a different approach to the solution.

Hmmm. Does this do what you want?
select di.time, sum( (seqnum = 1)::int) as first_instance,
sum( flag_7day ) as num_after_7_day,
sum( (seqnum = 1)::int) * 1.0 / sum( flag_7day ) as ratio
from (select di.*,
row_number() over (partition by distinct_id order by time) as seqnum,
(case when exists (select 1 from data_import di2 where di2.distinct_id = di.distinct_id and di2.time > di.time + interval '7 day')
then 1 else 0
end) as flag_7day
from data_import di
) di
group by di.time;
This doesn't return days with no first instances. Those days seem a bit awkward with respect to the ratio, so I'm not 100% sure that you really need them. If you do, it is easy enough to include a generate_series() to generate all dates in the range that you want.

Related

How to limit a SUM using postgres

I want to limit the SUM to 2 hours in postgres sql. I dont want to limit the result of the sum, but the value that it is going to sum.
For example, the following table:
In this case, if I SUM('02:20', '01:50', '00:30', '03:00') the result would be 07:30.
|CODE | HOUR |
| --- | ---- |
| 1 | 02:20 |
| 2 | 01:50 |
| 3 | 00:30 |
| 4 | 03:00 |
But what i want, is to limit the column HOUR to 02:00. So if the value is > 02:00, it will be replaced with 02:00, only in the SUM.
So the SUM should look like this ('02:00', '01:50', '00:30', '02:00'), and the result would be 06:20
Use case as an argument of the function:
select sum(case when hour < '2:00' then hour else '2:00' end)
from my_table
Test it in Db<>fiddle.
It's still not perfect, but the idea is this:
select sum(x.anything::time)
from (select id,
time,
case when time <= '02:00' then time::text
else '02:' || (EXTRACT(MINUTES FROM time::time)::text)
end as anything
from time_table) x

SQL Day-over-Day count miscalculation

I'm encountering a bug in my SQL code that calculates the day-over-day (DoD) count difference. This table (curr_day) summarizes the count of trades on any business day (i.e. excluding weekends and government-mandated holidays) and is joined by a similar table (prev_day) that is day-lagged (previous day). The joining is based on the day's rank; for example the first day on the curr_day table is Jan-01 and it's rank is 1, the first day (rank 1) for the prev_day table is Dec-31.
My issue is that the trade count does not seem to calculate positive changes (see table below), only negative or no changes. This problem does not affect other fields that calculate the value of a trade, simply the amount of trades on a given day.
Sample of query
with curr_day as (select GROUP, COUNT from DB where DATE is not HOLIDAY),
prev_day as (select rank()over(partition by GROUP order by DATE) as RANK, GROUP, DATE, COUNT
from curr_day where DATE is not HOLIDAY)
select ID, DATE, curr_day.COUNT-prev_day.COUNT
from (select rank()over(partition by curr_day.GROUP order by curr_day.DATE) as RANK
from curr_day
where curr_day.DATE >= (select min(curr_day.DATE)+1) from curr_day)
left join prev_day on curr_day.RANK = prev_day.RANK and curr_day.GROUP = prev_day.GROUP)
;
Output table
Date | Group | Count | DoD_Cnt_Diff
2020-12-31 | A | 1 | 0
2021-01-01 | A | 1 | 0
2021-01-02 | A | 0 | -1
2021-01-03 | A | 1 | (null)
2021-01-04 | A | 0 | -1
2021-01-05 | A | 0 | 0
2021-12-31 | B | 0 | 0

Merge rows based on a condition

Is it possible to merge a collection of rows based on a condition in Spark SQL using a sql query ?
If the difference between purch_dt of two consecutive rows placed in order (line_num) is less than 5 days, then combine them into 1 row and output that merged row and the merged row should have the max value of purch_dt for that group. I tried using the LEAD function but I can't get it to reset after each false condition is encountered and consider the following rows as a new group. I am not being able to get the max of purch_dt for each such group.
Input:
orderid | line_num | purch_dt
1 | 1 | 10-02-2020
1 | 2 | 12-02-2020
1 | 3 | 14-02-2020
1 | 4 | 21-03-2020
1 | 5 | 23-03-2020
Output:
orderid | purch_dt
1 | 14-02-2020 -- 1 - 3 combined into 1 row because difference is <5 between each
1 | 23-02-2020 -- 4 - 5 combined into 1 row because difference is <5 between each
Total Output rows = 2 because we have 2 groups.
Please note that line_num 4 is used as a set break since its difference between line_num = 3 is greater than 5. Hence it should have its own merged record set.
I have the sql below so far, but I can't get to break out and create the groups.
create temporary view next_dt as
select
order,
LEAD(purch_dt) over (partition by orderid order by line_num asc) AS next_purch_dt,
purch_dt
from orders;
select *
from (
select
order,
CASE WHEN datediff(next_purch_dt, purch_dt) < 5 OR next_purch IS NULL THEN 'Y'
ELSE 'N'
END AS flg
from
next_dt)
WHERE flg = 'Y';
Any help is appreciated.
UPDATE:
Slight change in the requirements:-
The comparison has now to be made between two different fields in consecutive records - purch_dt of the current record and the return_dt of the next record.
Also, when a merged record group is being output, it should have the purch_dt populated with the value of the record with the least line_num in that group. And the return_dt column populated with the value of the max line_num record of that same group.
Input:
orderid | line_num | purch_dt | return_dt
1 | 1 | 10-02-2020 | 10-02-2020
1 | 2 | 12-02-2020 | 13-02-2020
1 | 3 | 14-02-2020 | 14-02-2020
1 | 4 | 21-03-2020 | 23-02-2020
1 | 5 | 23-03-2020 | 24-02-2020
Output:
orderid | purch_dt | return_dt
1 | 10-02-2020 | 14-02-2020
1 | 21-03-2020 | 24-02-2020
Total Output rows = 2 because we have 2 groups.
Note that each output record contains the purch_dt of the record with min line_num in that group. And contains return_dt populated as per the record with max line_num in that group.
You almost got this, below query has worked for me,
sql("""create temporary view next_dt_orders as
select *
from (
select
orderid,line_num,purch_dt,
case when datediff(
(lead(purch_dt) over (partition by orderid order by line_num asc)),
purch_dt) < 5
then "N"
else "Y"
end as flag
from
orders) tab
where
flag='Y'""")
sql("select * from next_dt_orders").show()
+-------+--------+----------+----+
|orderid|line_num| purch_dt|flag|
+-------+--------+----------+----+
| 1| 3|2020-02-14| Y|
| 1| 5|2020-03-23| Y|
+-------+--------+----------+----+

Redshift querying data on dates

I am trying to query data from redshift table.
I have a users table with columns such as name, age, gender and created_at for example:
|----|-----|------|----------|
|name|age |gender|created_at|
------------------------------
|X | 24 | F | some_date|
______________________________
I need to query above table, in such a way that I have additional columns such as created_this_week, created_last_week, created_last_4_week, current_month, last_month etc
Additional flag columns should be 'Y' for conditions such as data is from last week, current week, current month, last month, last 4 weeks (excluding this week) so last 4 weeks starting last week etc, something like below.
|----|-----|------|-----------|------------|---------|-----------|------------|---------|
|name|age |gender|created_at |current_week|last_week|last_4_week|current_mnth|last_mnth|
_________________________________________________________________________________________
| X | 24 | F |CURRENTDATE| Y | N | N | Y | N |
_________________________________________________________________________________________
| F | 21 | M | lst_wk_dt | N | Y | Y | Depends | depends |
_________________________________________________________________________________________
I am new to PostgresSQL and Redshift, and still in my learning phase, I spent past few hrs trying to do this myself but was unsuccessful. I'd really appreciate if someone can help me out with this one.
You would use a case expressions:
select t.*,
(case when created_at >= now() - interval '1 week' then 'Y' else 'N' end) as week1,
(case when created_at >= now() - interval '4 week' then 'Y' else 'N' end) as week4,
. . .
from t;

How to aggregate based on various conditions

lets say I have a table which stores itemID, Date and total_shipped over a period of time:
ItemID | Date | Total_shipped
__________________________________
1 | 1/20/2000 | 2
2 | 1/20/2000 | 3
1 | 1/21/2000 | 5
2 | 1/21/2000 | 4
1 | 1/22/2000 | 1
2 | 1/22/2000 | 7
1 | 1/23/2000 | 5
2 | 1/23/2000 | 6
Now I want to aggregate based on several periods of time. For example, I Want to know how many of each item was shipped every two days and in total. So the desired output should look something like:
ItemID | Jan20-Jan21 | Jan22-Jan23 | Jan20-Jan23
_____________________________________________
1 | 7 | 6 | 13
2 | 7 | 13 | 20
How do I do that in the most efficient way
I know I can make three different subqueries but I think there should be a better way. My real data is large and there are several different time periods to be considered i. e. in my real problem I want the shipped items for current_week, last_week, two_weeks_ago, three_weeks_ago, last_month, two_months_ago, three_months_ago so I do not think writing 7 different subqueries would be a good idea.
Here is the general idea of what I can already run but is very expensive for the database
WITH
sq1 as (
SELECT ItemID, sum(Total_shipped) sum1
FROM table
WHERE Date BETWEEN '1/20/2000' and '1/21/2000'
GROUP BY ItemID),
sq2 as (
SELECT ItemID, sum(Total_Shipped) sum2
FROM table
WHERE Date BETWEEN '1/22/2000' and '1/23/2000'
GROUP BY ItemID),
sq3 as(
SELECT ItemID, sum(Total_Shipped) sum3
FROM Table
GROUP BY ItemID)
SELECT ItemID, sq1.sum1, sq2.sum2, sq3.sum3
FROM Table
JOIN sq1 on Table.ItemID = sq1.ItemID
JOIN sq2 on Table.ItemID = sq2.ItemID
JOIN sq3 on Table.ItemID = sq3.ItemID
I dont know why you have tagged this question with multiple database.
Anyway, you can use conditional aggregation as following in oracle:
select
item_id,
sum(case when "date" between date'2000-01-20' and date'2000-01-21' then total_shipped end) as "Jan20-Jan21",
sum(case when "date" between date'2000-01-22' and date'2000-01-23' then total_shipped end) as "Jan22-Jan23",
sum(case when "date" between date'2000-01-20' and date'2000-01-23' then total_shipped end) as "Jan20-Jan23"
from my_table
group by item_id
Cheers!!
Use FILTER:
select
item_id,
sum(total_shipped) filter (where date between '2000-01-20' and '2000-01-21') as "Jan20-Jan21",
sum(total_shipped) filter (where date between '2000-01-22' and '2000-01-23') as "Jan22-Jan23",
sum(total_shipped) filter (where date between '2000-01-20' and '2000-01-23') as "Jan20-Jan23"
from my_table
group by 1
item_id | Jan20-Jan21 | Jan22-Jan23 | Jan20-Jan23
---------+-------------+-------------+-------------
1 | 7 | 6 | 13
2 | 7 | 13 | 20
(2 rows)
Db<>fiddle.