My table contains answers from repeatable questionnaires that can be filled in a range of 30 days and are scheduled every 60 days.
Therefore, the answers from a single instance of a questionnaire are spread in a range of date that is always smaller tha 30 days and the first answer to the following repeatable questionnaire is at least 31 days after the last answer of the previous one.
How do I create a view that calculate a score (which is basically the sum of the answers of a single questionnaire) among the values whose dates are within 30 days from the start date (min date)?
Table raw_data
------------------------------------------------
user_name | question_id | answer | answer_date |
------------------------------------------------
user001 | 1 | 2 | 2019-02-04 |
user001 | 2 | 1 | 2019-02-04 |
user001 | 3 | 2 | 2019-02-05 |
user001 | 4 | 2 | 2019-02-05 |
user001 | 5 | 2 | 2019-02-09 |
user002 | 1 | 2 | 2019-01-09 |
user002 | 2 | 2 | 2019-01-10 |
user002 | 3 | 1 | 2019-02-01 |
user002 | 4 | 2 | 2019-02-01 |
user002 | 5 | 1 | 2019-02-01 |
user002 | 1 | 2 | 2019-03-11 |
user002 | 2 | 2 | 2019-03-11 |
user002 | 3 | 1 | 2019-03-12 |
user002 | 4 | 1 | 2019-03-13 |
user002 | 5 | 1 | 2019-03-14 |
Expected result
------------------------------
user_name | sum | start_date |
------------------------------
user001 | 9 | 2019-02-04 |
user002 | 8 | 2019-01-09 |
user002 | 7 | 2019-03-11 |
The solution I tried works for the first group only:
SELECT user_name, SUM(answer::int),
CASE
WHEN answer_date - MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) < 30
THEN MIN(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC)
ELSE answer_date END AS start_date,
FROM public.raw_data
GROUP BY user_name, answer_date
It's a classical gaps-and-islands problem. You'll find a lot under the tag I added.
An optimized query for your case could look like:
SELECT user_name
, sum(answer)
, min(answer_date) AS start_date
FROM (
SELECT user_name, answer, answer_date
, count(*) FILTER (WHERE step) OVER (PARTITION BY user_name ORDER BY answer_date) AS grp
FROM (
SELECT user_name, answer, answer_date
, lag(answer_date) OVER (PARTITION BY user_name ORDER BY answer_date) < answer_date - 30 AS step
FROM raw_data
) sub1
) sub2
GROUP BY user_name, grp
ORDER BY user_name, start_date; -- ORDER BY optional
db<>fiddle here
Closely related, with more explanation:
How to group timestamps into islands (based on arbitrary gap)?
Use lag() to find the gaps. Then a cumulative sum to assign a "question period" and then summarize:
select userid, min(answer_date) as start_date, sum(answer)
from (select rd.*,
count(*) filter (where prev_ad is null or prev_ad < answer_date - interval '30 day') over (partition by user_id) as period
from (select rd.*,
lag(answer_date) over (partition by user_id order by answer_date) as prev_ad
from raw_data rd
) rd
)
group by userid, period;
Thanks to #Gordon and to this
answer
I eventually found the missing step to determine my groups on a date range basis.
I will use the following query to create a view and SUM answers grouping by grp2
WITH query AS (
SELECT r.*,
SUM(CASE WHEN answer_date < prev_date + 30 THEN 0 ELSE 1 END) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS grp
FROM (SELECT r.*,
LAG(answer_date) OVER (PARTITION BY user_name ORDER BY user_name ASC, answer_date ASC) AS prev_date
FROM raw_data r
) r
)
SELECT user_name, question_id, answer_date, answer, DENSE_RANK() OVER (ORDER BY user_name, grp) AS grp2
FROM query
You can use the query with row_number() window analytic function as below
with raw_data( user_name, question_id, answer, answer_date ) as
(
select 'user001',1,2, '2019-02-04' union all
select 'user001',2,1, '2019-02-04' union all
select 'user001',3,2, '2019-02-05' union all
select 'user001',4,2, '2019-02-05' union all
select 'user001',5,2, '2019-02-09' union all
select 'user002',1,2, '2019-01-09' union all
select 'user002',2,2, '2019-01-10' union all
select 'user002',3,1, '2019-02-01' union all
select 'user002',4,2, '2019-02-01' union all
select 'user002',5,1, '2019-02-01' union all
select 'user002',1,2, '2019-03-11' union all
select 'user002',2,2, '2019-03-11' union all
select 'user002',3,1, '2019-03-12' union all
select 'user002',4,1, '2019-03-13' union all
select 'user002',5,1, '2019-03-14'
)
select user_name, sum(answer) as sum, min(answer_date) as start_date
from
(
select row_number() over (partition by question_id order by user_name, answer_date) as rn,
t.*
from raw_data t
) t
group by user_name, rn
order by rn;
user_name sum start_date
--------- --- ----------
user001 9 2019-02-04
user002 8 2019-01-09
user002 7 2019-03-11
Demo
Related
I have this table
| date | id | number |
|------------|----|--------|
| 2021/05/01 | 1 | 10 |
| 2021/05/02 | 2 | 20 |
| 2021/05/03 | 3 | 30 |
| 2021/05/04 | 1 | 20 |
I am trying to write a query to have this other table
| date | id | number |
|------------|----|--------|
| 2021/05/01 | 1 | 10 |
| 2021/05/02 | 1 | 10 |
| 2021/05/02 | 2 | 20 |
| 2021/05/03 | 1 | 10 |
| 2021/05/03 | 2 | 20 |
| 2021/05/03 | 3 | 30 |
| 2021/05/04 | 1 | 20 |
| 2021/05/04 | 2 | 20 |
| 2021/05/04 | 3 | 30 |
The idea is that each date should have all the previus different ids with its number, and if an id is repeated then only the last value should be considered.
One way is to expand out all the rows for each date. Then take the most recent value using qualify:
with t as (
select date '2021-05-01' as date, 1 as id, 10 as number union all
select date '2021-05-02' as date, 2 as id, 20 as number union all
select date '2021-05-03' as date, 3 as id, 30 as number union all
select date '2021-05-04' as date, 1 as id, 20 as number
)
select d.date, t.id, t.number
from t join
(select date
from (select min(date) as min_date, max(date) as max_date
from t
) tt cross join
unnest(generate_date_array(min_date, max_date, interval 1 day)) date
) d
on t.date <= d.date
where 1=1
qualify row_number() over (partition by d.date, t.id order by t.date desc) = 1
order by 1, 2, 3;
A more efficient method doesn't generate all the rows and then filter them. Instead, it just generates the rows that are needed by generating the appropriate dates within each row. That requires a couple of window functions to get the "next" date for each id and the maximum date in the data:
with t as (
select date '2021-05-01' as date, 1 as id, 10 as number union all
select date '2021-05-02' as date, 2 as id, 20 as number union all
select date '2021-05-03' as date, 3 as id, 30 as number union all
select date '2021-05-04' as date, 1 as id, 20 as number
)
select date, t.id, t.number
from (select t.*,
date_add(lead(date) over (partition by id order by date), interval -1 day) as next_date,
max(date) over () as max_date
from t
) t cross join
unnest(generate_date_array(date, coalesce(next_date, max_date))) date
order by 1, 2, 3;
Consider below [less verbose] approach
select t1.date, t2.id, t2.number
from (
select *, array_agg(struct(date, id,number)) over(order by date) arr
from `project.dataset.table`
) t1, unnest(arr) t2
where true
qualify row_number() over (partition by t1.date, t2.id order by t2.date desc) = 1
# order by date, id
if applied to sample data in your question - output is
the following code
SELECT distinct DATE_PART('year',date) as year_date,
DATE_PART('month',date) as month_date,
count(prepare_first_buyer.person_id) as no_of_customers_month
FROM
(
SELECT DATE(bestelldatum) ,person_id
,ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY person_id)
FROM ani.bestellung
) prepare_first_buyer
WHERE row_number=1
GROUP BY DATE_PART('year',date),DATE_PART('month',date)
ORDER BY DATE_PART('year',date),DATE_PART('month',date)
gives this table back:
| year_date | month_date | no_of_customers_month |
|:--------- |:----------:| ---------------------:|
| 2017 | 1 | 2 |
| 2017 | 2 | 5 |
| 2017 | 3 | 4 |
| 2017 | 4 | 8 |
| 2017 | 5 | 1 |
| . | . | . |
| . | . | . |
where als three are numeric values.
I need now a new column were i sum up all values from 'no_of_customers_month' for 12 months back.
e.g.
| year_date | month_date | no_of_customers_month | sum_12mon |
|:--------- |:----------:| :--------------------:|----------:|
| 2019 | 1 | 2 | 23 |
where 23 is the sum from 2019-1 back to 2018-1 over 'no_of_customers_month'.
Thx for the help.
You can use window functions:
SELECT DATE_TRUNC('month', date) as yyyymm,
COUNT(*) as no_of_customers_month,
SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', date) RANGE BETWEEN '11 month' PRECEDING AND CURRENT ROW)
FROM (SELECT DATE(bestelldatum), person_id,
ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY person_id)
FROM ani.bestellung
) b
WHERE row_number = 1
GROUP BY yyyymm
ORDER BY yyyymm;
Note: This uses date_trunc() to retrieve the year/month as a date, allowing the use of range(). I also find a date more convenient than having the year and month in separate columns.
Some versions of Postgres don't support range window frames. Assuming you have data for each month, you can use rows:
SELECT DATE_TRUNC('month', date) as yyyymm,
COUNT(*) as no_of_customers_month,
SUM(COUNT(*)) OVER (ORDER BY DATE_TRUNC('month', date) ROWS BETWEEN 11 PRECEDING AND CURRENT ROW)
FROM (SELECT DATE(bestelldatum), person_id,
ROW_NUMBER() OVER (PARTITION BY person_id ORDER BY person_id)
FROM ani.bestellung
) b
WHERE row_number = 1
GROUP BY yyyymm
ORDER BY yyyymm;
Suppose I have a table of Events that lists a userId and the time the Event occurred:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 2 | 190 | 2020-07-13 20:57:07.138+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
| 4 | 46 | 2020-07-22 10:17:11.104+00 |
| 5 | 97 | 2020-07-13 20:57:07.138+00 |
| 6 | 17 | 2020-07-04 11:33:21.919+00 |
| 6 | 17 | 2020-07-11 09:23:21.919+00 |
+----+--------+----------------------------+
I want to get the list of events that had a previous event on the same day, by the same user. The result for the above table would be:
+----+--------+----------------------------+
| id | userId | time |
+----+--------+----------------------------+
| 1 | 46 | 2020-07-22 11:22:55.307+00 |
| 3 | 17 | 2020-07-11 11:33:21.919+00 |
+----+--------+----------------------------+
How can I perform a select query that filters results by evaluating them against other rows in the table?
This can be done using an EXISTS condition:
select t1.*
from the_table t1
where exists (select *
from the_table t2
where t2.userid = t1.userid -- for the same user
and t2.time::date = t1.time::date -- on the same
and t2.time < t1.time); -- but previously on that day
You can use lag():
select t.*
from (select t.*,
lag(time) over (partition by userid, time::date order by time) as prev_time
from t
) t
where prev_time is not null;
Here is a db<>fiddle.
Or row_number():
select t.*
from (select t.*,
row_number() over (partition by userid, time::date order by time) as seqnum
from t
) t
where seqnum >= 2;
You can use LAG() to find the previous row for a user. Then a simple comparison will tell if it occured in the same day or not.
For example:
select *
from (
select
*,
lag(time) over(partition by userId order by time) as prev_time
from t
) x
where date::date = prev_time::date
You can use ROW_NUMBER() analytic function :
SELECT id , userId , time
FROM
(
SELECT ROW_NUMBER() OVER (PARTITION BY UserId, date_trunc('day',time) ORDER BY time DESC) AS rn,
t.*
FROM Events
) q
WHERE rn > 1
in order to bring the latest event for UserId who takes place in more than one event.
I am trying to resolve on simple task for first look.
I have transactions table.
| name |entity_id| amount | date |
|--------|---------|--------|------------|
| Github | 1 | 4.80 | 01/01/2014 |
| itunes | 2 | 2.80 | 22/01/2014 |
| Github | 1 | 4.80 | 01/02/2014 |
| Foods | 3 | 24.80 | 01/02/2014 |
| amazon | 4 | 14.20 | 01/03/2014 |
| amazon | 4 | 14.20 | 01/04/2014 |
I have to select rows which repeat every month in same day with same the amount for entity_id.(Subscriptions). Thanks for help
If your date column is created as a date type,
you could use a recursive CTE to collect continuations
after that, eliminate duplicate rows with distinct on
(and you should rename that column, because it's a reserved name in SQL)
with recursive recurring as (
select name, entity_id, amount, date as first_date, date as last_date, 0 as lvl
from transactions
union all
select r.name, r.entity_id, r.amount, r.first_date, t.date, r.lvl + 1
from recurring r
join transactions t
on row(t.name, t.entity_id, t.amount, t.date - interval '1' month)
= row(r.name, r.entity_id, r.amount, r.last_date)
)
select distinct on (name, entity_id, amount) *
from recurring
order by name, entity_id, amount, lvl desc
SQLFiddle
group it by day, for sample:
select entity_id, amount, max(date), min(date), count(*)
from transactions
group by entity_id, amount, date_part('day', date)
I am having some issues with ranking some columns in Oracle. I have two columns I need to rank--a group id and a date.
I want to group the table two ways:
Rank the records in each GROUP_ID by DATETIME (RANK_1)
Rank the GROUP_IDs by their DATETIME, GROUP_ID (RANK_2)
It should look like this:
GROUP_ID | DATE | RANK_1 | RANK_2
----------|------------|-----------|----------
2 | 1/1/2012 | 1 | 1
2 | 1/2/2012 | 2 | 1
2 | 1/4/2012 | 3 | 1
3 | 1/1/2012 | 1 | 2
1 | 1/3/2012 | 1 | 3
I have been able to do the former, but have been unable to figure out the latter.
SELECT group_id,
datetime,
ROW_NUMBER() OVER (PARTITION BY group_id ORDER BY datetime) AS rn,
DENSE_RANK() OVER (ORDER BY group_id) AS rn2
FROM table_1
ORDER BY group_id;
This incorrectly orders the RANK_2 field:
GROUP_ID | DATE | RANK_1 | RANK_2
----------|------------|-----------|----------
1 | 1/3/2012 | 1 | 1
2 | 1/1/2012 | 1 | 2
2 | 1/2/2012 | 2 | 2
2 | 1/4/2012 | 3 | 2
3 | 1/1/2012 | 1 | 3
Assuming you don't have an actual id column in the table, it appears that you want to do the second rank by the earliest date in each group. This will require a nested subquery:
select group_id, datetime, rn,
dense_rank() over (order by EarliestDate, group_id) as rn2
from (SELECT group_id, datetime,
ROW_NUMBER() OVER (PARTITION BY group_id ORDER BY datetime) AS rn,
min(datetime) OVER (partition by group_id) as EarliestDate
FROM table_1
) t
ORDER BY group_id;