Redshift count with variable - sql

Imagine I have a table on Redshift with this similar structure. Product_Bill_ID is the Primary Key of this table.
| Store_ID | Product_Bill_ID | Payment_Date
| 1 | 1 | 01/10/2016 11:49:33
| 1 | 2 | 01/10/2016 12:38:56
| 1 | 3 | 01/10/2016 12:55:02
| 2 | 4 | 01/10/2016 16:25:05
| 2 | 5 | 02/10/2016 08:02:28
| 3 | 6 | 03/10/2016 02:32:09
If I want to query the number of Product_Bill_ID that a store sold in the first hour after it sold its first Product_Bill_ID, how could I do this?
This example should outcome
| Store_ID | First_Payment_Date | Sold_First_Hour
| 1 | 01/10/2016 11:49:33 | 2
| 2 | 01/10/2016 16:25:05 | 1
| 3 | 03/10/2016 02:32:09 | 1

You need to get the first hour. That is easy enough using window functions:
select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
Then, you need to do the date filtering and aggregation:
select store_id, count(*)
from (select s.*,
min(payment_date) over (partition by store_id) as first_payment_date
from sales s
) s
where payment_date <= first_payment_date + interval '1 hour'
group by store_id;

SELECT
store_id,
first_payment_date,
SUM(
CASE WHEN payment_date < DATEADD(hour, 1, first_payment_date) THEN 1 END
) AS sold_first_hour
FROM
(
SELECT
*,
MIN(payment_date) OVER (PARTITION BY store_id) AS first_payment_date
FROM
yourtable
)
parsed_table
GROUP BY
store_id,
first_payment_date

Related

SQL query grouping by range

Hi have a table A with the following data:
+------+-------+----+--------+
| YEAR | MONTH | PA | AMOUNT |
+------+-------+----+--------+
| 2020 | 1 | N | 100 |
+------+-------+----+--------+
| 2020 | 2 | N | 100 |
+------+-------+----+--------+
| 2020 | 3 | O | 100 |
+------+-------+----+--------+
| 2020 | 4 | N | 100 |
+------+-------+----+--------+
| 2020 | 5 | N | 100 |
+------+-------+----+--------+
| 2020 | 6 | O | 100 |
+------+-------+----+--------+
I'd like to have the following result:
+---------+---------+--------+
| FROM | TO | AMOUNT |
+---------+---------+--------+
| 2020-01 | 2020-02 | 200 |
+---------+---------+--------+
| 2020-03 | 2020-03 | 100 |
+---------+---------+--------+
| 2020-04 | 2020-05 | 200 |
+---------+---------+--------+
| 2020-06 | 2020-06 | 100 |
+---------+---------+--------+
My DB is DB2/400.
I have tried with ROW_NUMBER partitioning, subqueries but I can't figure out how to solve this.
I understand this as a gaps-and-island problem, where you want to group together adjacent rows that have the same PA.
Here is an approach using the difference between row numbers to build the groups:
select min(year_month) year_month_start, max(year_month) year_month_end, sum(amount) amount
from (
select a.*, year * 100 + month year_month
row_number() over(order by year, month) rn1,
row_number() over(partition by pa order by year, month) rn2
from a
) a
group by rn1 - rn2
order by year_month_start
You can try the below -
select min(year)||'-'||min(month) as from_date,max(year)||'-'||max(month) as to_date,sum(amount) as amount from
(
select *,row_number() over(order by month)-
row_number() over(partition by pa order by month) as grprn
from t1
)A group by grprn,pa order by grprn
This works in tsql, guess you can adaot it to db2-400?
SELECT MIN(Dte) [From]
, MAX(Dte) [To]
-- ,PA
, SUM(Amount)
FROM (
SELECT year * 100 + month Dte
, Pa
, Amount
, ROW_NUMBER() OVER (PARTITION BY pa ORDER BY year * 100 + month) +
10000- (YEar*100+Month) rn
FROM tabA a
) b
GROUP BY Pa
, rn
ORDER BY [From]
, [To]
The trick is the row number function partitioned by PA and ordered by date, This'll count one up for each month for the, when added to the descnding count of month and month, you will get the same number for consecutive months with same PA. You the group by PA and the grouping yoou made, rn, to get the froups, and then Bob's your uncle.

Grouping consecutive sequences of rows

I'm trying to group consecutive rows where a boolean value is true on SQL Server. For example, here's what some source data looks like:
AccountID | ID | IsTrue | Date
-------------------------------
1 | 1 | 1 | 1/1/2013
1 | 2 | 1 | 1/2/2013
1 | 3 | 1 | 1/3/2013
1 | 4 | 0 | 1/4/2013
1 | 5 | 1 | 1/5/2013
1 | 6 | 0 | 1/6/2013
1 | 7 | 1 | 1/7/2013
1 | 8 | 1 | 1/8/2013
1 | 9 | 1 | 1/9/2013
And here's what I'd like as the output
AccountID | Start | End
-------------------------------
1 | 1/1/2013 | 1/3/2013
1 | 1/7/2013 | 1/9/2013
I have a hunch that there's some trick with grouping by partitions that will make this work but I've been unable to figure it out. I've made some progress using LAG but haven't been able to put it all together.
Thanks for the help!
This is an example of a gaps and islands problem. For this version, you just need a sequential number for each isTrue. Subtracting this number of days from each date is a constant for adjacent values that are the same:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
) t
group by accountId, isTrue, dateadd(day, -seqnum, date);
This defines all groups. If I assume that you just want values of "1" that are more than 1 day long, then:
select accountId, isTrue, min(date), max(date)
from (select t.*,
row_number() over (partition by accountId, isTrue order by date) as seqnum
from t
where isTrue = 1
) t
group by accountId, isTrue, dateadd(day, -seqnum, date)
having count(*) > 1;
You can try the following, here is the demo. I have assumption that id will always have consecutive values.
with cte as
(
select
*,
count(*) over (partition by IsTrue, rnk) as total
from
(
select
*,
id - row_number() over (partition by IsTrue order by id, date) as rnk
from myTable
) val
)
select
accountId,
min(date) as start,
max(date) as end
from cte
where total > 1
group by
accountId,
rnk
Output:
| accountid | start | end |
| --------- | ---------- | -----------|
| 1 | 2013-01-01 | 2013-01-03 |
| 1 | 2013-01-07 | 2013-01-09 |

Get users who took ride for 3 or more consecutive dates

I have below table, it shows user_id and ride_date.
+---------+------------+
| user_id | ride_date |
+---------+------------+
| 1 | 2019-11-01 |
| 1 | 2019-11-03 |
| 1 | 2019-11-05 |
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-05 |
| 4 | 2019-11-07 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 5 | 2019-11-11 |
| 5 | 2019-11-13 |
+---------+------------+
I want user_id who took rides for 3 or more consecutive days along with days on which they took consecutive rides
The desired result is as below
+---------+-----------------------+
| user_id | consecutive_ride_date |
+---------+-----------------------+
| 2 | 2019-11-03 |
| 2 | 2019-11-04 |
| 2 | 2019-11-05 |
| 2 | 2019-11-06 |
| 3 | 2019-11-03 |
| 3 | 2019-11-04 |
| 3 | 2019-11-05 |
| 3 | 2019-11-06 |
| 4 | 2019-11-08 |
| 4 | 2019-11-09 |
| 4 | 2019-11-10 |
+---------+-----------------------+
SQL Fiddle
With LAG() and LEAD() window functions:
with cte as (
select *,
datediff(
day,
lag([ride_date]) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev1,
datediff(
day,
lag([ride_date], 2) over (partition by [user_id] order by [ride_date]),
[ride_date]
) prev2,
datediff(
day,
[ride_date],
lead([ride_date]) over (partition by [user_id] order by [ride_date])
) next1,
datediff(
day,
[ride_date],
lead([ride_date], 2) over (partition by [user_id] order by [ride_date])
) next2
from Table1
)
select [user_id], [ride_date]
from cte
where
(prev1 = 1 and prev2 = 2) or
(prev1 = 1 and next1 = 1) or
(next1 = 1 and next2 = 2)
See the demo.
Results:
> user_id | ride_date
> ------: | :---------
> 2 | 03/11/2019
> 2 | 04/11/2019
> 2 | 05/11/2019
> 2 | 06/11/2019
> 3 | 03/11/2019
> 3 | 04/11/2019
> 3 | 05/11/2019
> 3 | 06/11/2019
> 4 | 07/11/2019
> 4 | 08/11/2019
> 4 | 09/11/2019
Here is one way to adress this gaps-and-island problem:
first, assign a rank to each user ride with row_number(), and recover the previous ride_date (aliased lag_ride_date)
then, compare the date of the previous ride to the current one in a conditional sum, that increases when the dates are successive ; by comparing this with the rank of the user ride, you get groups (aliased grp) that represent consecutive rides with a 1 day spacing
do a window count how many records belong to each group (aliased cnt)
filter on records whose window count is greater than 3
Query:
select user_id, ride_date
from (
select
t.*,
count(*) over(partition by user_id, grp) cnt
from (
select
t.*,
rn1
- sum(case when ride_date = dateadd(day, 1, lag_ride_date) then 1 else 0 end)
over(partition by user_id order by ride_date) grp
from (
select
t.*,
row_number() over(partition by user_id order by ride_date) rn1,
lag(ride_date) over(partition by user_id order by ride_date) lag_ride_date
from Table1 t
) t
) t
) t
where cnt >= 3
Demo on DB Fiddle
This is a typical gaps and island problems.
We can solve it as follows
with data
as (
select user_id
,ride_date
,dateadd(day
,-row_number() over(partition by user_id order by ride_date asc)
,ride_date) as grp_field
from Table1
)
,consecutive_days
as(
select user_id
,ride_date
,count(*) over(partition by user_id,grp_field) as cnt
from data
)
select *
from consecutive_days
where cnt>=3
order by user_id,ride_date
https://dbfiddle.uk/?rdbms=sqlserver_2017&fiddle=7bb851d9a12966b54afb4d8b144f3d46
There is no need to apply gaps-and-islands methodologies to this problem. The problem is much simpler to solve.
You can return the users and first date just by using LEAD():
SELECT t1.*
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1
WHERE ride_date_2 = DATEADD(day, 2, ride_date);
If you want the actual dates, you can unpivot the results:
SELECT DISTINCT t1.user_id, v.ride_date
FROM (SELECT t1.*,
LEAD(ride_date, 2) OVER (PARTITION BY user_id ORDER BY ride_date) as ride_date_2
FROM table1 t1
) t1 CROSS APPLY
(VALUES (t1.ride_date),
(DATEADD(day, 1, t1.ride_date)),
(DATEADD(day, 2, t1.ride_date))
) v(ride_date)
WHERE t1.ride_date_2 = DATEADD(day, 2, t1.ride_date)
ORDER BY t1.user_id, v.ride_date;

Rolling 90 days active users in BigQuery, improving preformance (DAU/MAU/WAU)

I'm trying to get the number of unique events on a specific date, rolling 90/30/7 days back. I've got this working on a limited number of rows with the query bellow but for large data sets I get memory errors from the aggregated string which becomes massive.
I'm looking for a more effective way of achieving the same result.
Table looks something like this:
+---+------------+-------------+
| | date | userid |
+---+------------+-------------+
| 1 | 2013-05-14 | xxxxx |
| 2 | 2017-03-14 | xxxxx |
| 3 | 2018-01-24 | xxxxx |
| 4 | 2013-03-21 | xxxxx |
| 5 | 2014-03-19 | xxxxx |
| 6 | 2015-09-03 | xxxxx |
| 7 | 2014-02-06 | xxxxx |
| 8 | 2014-10-30 | xxxxx |
| ..| ... | ... |
+---+------------+-------------+
Format of the desired result:
+---+------------+---------------------------------------------+
| | date | active_users_7_days | active_users_90_days |
+---+------------+---------------------------------------------+
| 1 | 2013-05-14 | 1240 | 34339 |
| 2 | 2017-03-14 | 4334 | 54343 |
| 3 | 2018-01-24 | ..... | ..... |
| 4 | 2013-03-21 | ..... | ..... |
| 5 | 2014-03-19 | ..... | ..... |
| 6 | 2015-09-03 | ..... | ..... |
| 7 | 2014-02-06 | ..... | ..... |
| 8 | 2014-10-30 | ..... | ..... |
| ..| ... | ..... | ..... |
+---+------------+---------------------------------------------+
My query looks like this:
#standardSQL
WITH
T1 AS(
SELECT
date,
STRING_AGG(DISTINCT userid) AS IDs
FROM
`consumer.events`
GROUP BY
date ),
T2 AS(
SELECT
date,
STRING_AGG(IDs) OVER(ORDER BY UNIX_DATE(date) RANGE BETWEEN 90 PRECEDING
AND CURRENT ROW) AS IDs
FROM
T1 )
SELECT
date,
(
SELECT
COUNT(DISTINCT (userid))
FROM
UNNEST(SPLIT(IDs)) AS userid) AS NinetyDays
FROM
T2
Counting unique users requires a lot of resources, even more if you want results over a rolling window. For a scalable solution, look into approximate algorithms like HLL++:
https://medium.freecodecamp.org/counting-uniques-faster-in-bigquery-with-hyperloglog-5d3764493a5a
For an exact count, this would work (but gets slower as the window gets larger):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, COUNT(DISTINCT owner_user_id) unique_90_day_users
, COUNT(DISTINCT IF(i<31,owner_user_id,null)) unique_30_day_users
, COUNT(DISTINCT IF(i<8,owner_user_id,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, owner_user_id
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1, 2
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
The approximate solution produces results way faster (14s vs 366s, but then the results are approximate):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
ORDER BY date_grp
Updated query that gives correct results - removing rows with less than 90 days (works when no dates are missing):
#standardSQL
SELECT DATE_SUB(date, INTERVAL i DAY) date_grp
, HLL_COUNT.MERGE(sketch) unique_90_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<31,sketch,null)) unique_30_day_users
, HLL_COUNT.MERGE(DISTINCT IF(i<8,sketch,null)) unique_7_day_users
, COUNT(*) window_days
FROM (
SELECT DATE(creation_date) date, HLL_COUNT.INIT(owner_user_id) sketch
FROM `bigquery-public-data.stackoverflow.posts_questions`
WHERE EXTRACT(YEAR FROM creation_date)=2017
GROUP BY 1
), UNNEST(GENERATE_ARRAY(1, 90)) i
GROUP BY 1
HAVING window_days=90
ORDER BY date_grp
You can aggregate the date and do the sum. What is the aggregation? Take the most recent date:
select count(*) as num_users,
sum(case when date > datediff(current_date, interval -30 day) then 1 else 0 end) as num_users_30days,
sum(case when date > datediff(current_date, interval -60 day) then 1 else 0 end) as num_users_60days,
sum(case when date > datediff(current_date, interval -90 day) then 1 else 0 end) as num_users_90days
from (select user_id, max(date) as max(date)
from `consumer.events` e
group by user_id
) e;
If the most recent date for the user is in the period, then the user should be counted.
You can get this "as-of" a particular date by using a where clause in the subquery.

Aggregate/Windowed Function To Find Min and Max of Sequential Rows

I've got a SQL table where I want to find the first and last dates of a group of records, providing they're sequential.
Patient | TestType | Result | Date
------------------------------------------
1 | 1 | A | 2012-03-04
1 | 1 | A | 2012-08-19
1 | 1 | B | 2013-05-27
1 | 1 | A | 2013-06-20
1 | 2 | X | 2012-08-19
1 | 2 | X | 2013-06-20
2 | 1 | B | 2014-09-09
2 | 1 | B | 2015-04-19
Should be returned as
Patient | TestType | Result | StartDate | EndDate
--------------------------------------------------------
1 | 1 | A | 2012-03-04 | 2012-08-19
1 | 1 | B | 2013-05-27 | 2013-05-27
1 | 1 | A | 2013-06-20 | 2013-06-20
1 | 2 | X | 2012-08-19 | 2013-06-20
2 | 1 | B | 2014-09-09 | 2015-04-19
The problem is that if I just group by Patient, TestType, and Result,
then the first and third rows in the example above would become a single row.
Patient | TestType | Result | StartDate | EndDate
--------------------------------------------------------
1 | 1 | A | 2012-03-04 | 2013-06-20
1 | 1 | B | 2013-05-27 | 2013-05-27
1 | 2 | X | 2012-08-19 | 2013-06-20
2 | 1 | B | 2014-09-09 | 2015-04-19
I feel like there's got to be something clever I can do with a partition, but I can't quite figure out what it is.
There are several ways to approach this. I like identifying the groups using the difference of row number values:
select patient, testtype, result,
min(date) as startdate, max(date) as enddate
from (select t.*,
(row_number() over (partition by patient, testtype order by date) -
row_number() over (partition by patient, testtype, result order by date)
) as grp
from table t
) t
group by patient, testtype, result, grp
order by patient, startdate;
select patient, testtype, result, date as startdate,
isnull(lead(date) over(partition by patient, testtype, result order by date), date) as enddate
from tablename;
You can use lead function to get the value of date (as enddate) from the next row in each group.
SQL Fiddle with sample data.
See if this gives you what you need.
with T1 as (
select
*,
case when lag(Patient,1)
over (order by Patient, TestType, Result) = Patient
and lag(TestType,1)
over (order by Patient, TestType, Result) = TestType
and lag(Result,1)
over (order by Patient, TestType, Result) = Result
then null else 1 end as Changes
from t
), T2 as (
select
Patient,
TestType,
Result,
dt,
sum(Changes) over (
order by Patient, TestType, Result, dt
) as seq
from T1
)
select
Patient,
TestType,
Result,
min(dt) as dtFrom,
max(dt) as dtTo
from T2
group by Patient, TestType, Result, seq
order by Patient, TestType, Result