SQL ORACLE - get min row with sequence equal values - sql

My have table similar to:
MY_DAT | STATUS
=========|========
1.1.2017 | A
2.1.2017 | A
3.1.2017 | A
4.1.2017 | B
5.1.2017 | B
6.1.2017 | A
7.1.2017 | C
8.1.2017 | A
9.1.2017 | A
10.1.2017| A
I want SQL query that by date(MY_DAT) return min date with equal STATUS without interruption.
Example
MY_DAT = '1.1.2017' -> '1.1.2017',A
MY_DAT = '3.1.2017' -> '1.1.2017',A
MY_DAT = '10.1.2017' -> '8.1.2017',A
MY_DAT = '5.1.2017' -> '4.1.2017',B
I don't how this sql have to look like.
EDIT
I need result to be for every date. In this example result have to be:
MY_DAT | STATUS | BEGIN
=========|========|========
1.1.2017 | A |1.1.2017
2.1.2017 | A |1.1.2017
3.1.2017 | A |1.1.2017
4.1.2017 | B |4.1.2017
5.1.2017 | B |4.1.2017
6.1.2017 | A |6.1.2017
7.1.2017 | C |7.1.2017
8.1.2017 | A |8.1.2017
9.1.2017 | A |8.1.2017
10.1.2017| A |8.1.2017
ANSWER
select my_date, status,
min(my_date) over (partition by grp, status) as begin
from (select my_date,status ,
row_number() over(order by my_date)
-row_number() over(partition by status order by my_date) as grp
from tbl ) t
Thanks to Vamsi Prabhala

Use a difference of row numbers approach to assign groups to consecutive rows with same status. (Run the inner query to see this.). After this, it is just a group by operation to get the min date.
select status,min(my_date)
from (select my_date,status
,row_number() over(order by my_date)
-row_number() over(partition by status order by my_date) as grp
from tbl
) t
group by grp,status

Please try this.
SELECT status, min(my_dat)
FROM dates
GROUP BY status
OK, then what about this?
SELECT *
FROM dates
INNER JOIN
(
SELECT status, min(my_dat)
FROM dates
GROUP BY status
) sub
ON dates.status = sub.status

Related

How to combine Cross Join and String Agg in Bigquery with date time difference

I am trying to go from the following table
| user_id | touch | Date | Purchase Amount
| 1 | Impression| 2020-09-12 |0
| 1 | Impression| 2020-10-12 |0
| 1 | Purchase | 2020-10-13 |125$
| 1 | Email | 2020-10-14 |0
| 1 | Impression| 2020-10-15 |0
| 1 | Purchase | 2020-10-30 |122
| 2 | Impression| 2020-10-15 |0
| 2 | Impression| 2020-10-16 |0
| 2 | Email | 2020-10-17 |0
to
| user_id | path | Number of days between First Touch and Purchase | Purchase Amount
| 1 | Impression,Impression,Purchase | 2020-10-13(Purchase) - 2020-09-12 (Impression) |125$
| 1 | Email,Impression, Purchase | 2020-10-30(Purchase) - 2020-10-14(Email) | 122$
| 2 | Impression, Impression, Email | 2020-12-31 (Fixed date) - 2020-10-15(Impression) | 0$
In essence, I am trying to create a new row for each unique user in the table every time a 'Purchase' is encountered in a comma-separated string.
Also, take the difference between the first touch and first purchase for each unique user. When a new row is created we do the same for the same user as show in the example above.
From the little I have gathered I need to use a mixture of cross join and string agg but I tried using a case statement within string agg and was not able to get to the required result.
Is there a better way to do it in SQL (Bigquery).
Thank you
Below is for BigQuery Standard SQL
#standardSQL
select user_id,
string_agg(touch order by date) path,
date_diff(max(date), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
if to apply to sample data from your question - output is
another change, in case there is no Purchase in the touch we calculate the number of days from a fixed window we have set. How can I add this to the query above?
select user_id,
string_agg(touch order by date) path,
date_diff(if(countif(touch = 'Purchase') = 0, '2020-12-31', max(date)), min(date), day) days,
sum(amount) amount
from (
select user_id, touch, date, amount,
countif(touch = 'Purchase') over win grp
from `project.dataset.table`
window win as (partition by user_id order by date rows between unbounded preceding and 1 preceding)
)
group by user_id, grp
with output
Means you need solution which divides row if there is purchase in touch.
Use following query:
Select user_id,
Aggregation function according to your requirement,
Sum(purchase_amount)
From
(Select t.*,
Sum(case when touch = 'Purchase' then 1 else 0 end) over (partition by user_id order by date) as sm
From t) t
Group by user_id, sm
We could approach this as a gaps-and-island problem, where every island ends with a purchase. How do we define the groups? By counting how many purchases we have ahead (current row included) - so with a descending sort in the query.
select user_id, string_agg(touch order by date),
min(date) as first_date, max(date) as max_date,
date_diff(max(date), min(date)) as cnt_days
from (
select t.*,
countif(touch = 'Purchase') over(partition by user_id order by date desc) as grp
from mytable t
) t
group by user_id, grp
You can create a value for each row that corresponds to the number of instances where table.touch = 'Purchase', which can then be used to group on:
with r as (select row_number() over(order by t1.user_id) rid, t1.* from table t1)
select t3.user_id, group_concat(t3.touch), sum(t3.amount), date_diff(max(t3.date), min(t3.date))
from (select
(select sum(r1.touch = 'Purchase' AND r1.rid < r2.rid) from r r1) c1, r2.* from r r2
) t3
group by t3.c1;

How can I filter duplicates/repeated fields in bigquery?

I have a table without primaryKey. And I am trying to get the events of the earliest date grouped by id.
This is what small piece of mytable looks like:
|----------|------------------|-------------|
| id | date | events |
|----------|------------------|-------------|
| 1 |2020-04-11 3:44:20| call |
|----------|------------------|-------------|
| 3 |2020-04-21 7:59:06| appointment |
|----------|------------------|-------------|
| 1 |2020-04-17 1:14:32| appointment |
|----------|------------------|-------------|
| 2 |2020-04-10 3:41:17| feedback |
|----------|------------------|-------------|
| 1 |2020-04-23 1:36:13| appointment |
|----------|------------------|-------------|
| 3 |2020-04-12 4:55:38| call |
|----------|------------------|-------------|
This is the result I am looking for:
|----------|------------------|-------------|
| id | date | events |
|----------|------------------|-------------|
| 1 |2020-04-11 3:44:20| call |
|----------|------------------|-------------|
| 2 |2020-04-10 3:41:17| feedback |
|----------|------------------|-------------|
| 3 |2020-04-12 4:55:38| call |
|----------|------------------|-------------|
I am trying to get events by id only for their respective MIN(date) but the problem is that I have to SELECT events but then I have to add events to GROUP BY so I can't GROUP BY id only as I would like to.
I have tried a lot of different version but here is one:
SELECT id, MIN(date), events
FROM mydataset.mytable
GROUP BY id, events
Please keep in mind that my table is much larger than this.
Any help would be very much appreciated.
You can use aggregation:
select array_agg(t order by date asc limit 1)[ordinal(1)].*
from mydataset.mytable t
group by t.id;
Or the more traditional method of using row_number():
select t.* except (seqnum)
from (select t.*, row_number() over (partition by id order by date) as seqnum
from mydataset.mytable t
) t
where seqnum = 1;
You could modify what you have as an uncorrelated subquery
select *
from mytable
where (id, date) in (select id, min(date)
from mytable
group by id);
If your DB supports window functions you could also do
select distinct id,
min(date) over(partition by id) date,
first_value(events) over (partition by id order by date asc) events
from mytable;
Outputs
+----+---------------------+----------+
| id | date | events |
+----+---------------------+----------+
| 1 | 2020-04-11 03:44:20 | call |
| 2 | 2020-04-10 03:41:17 | feedback |
| 3 | 2020-04-12 04:55:38 | call |
+----+---------------------+----------+
A join to a derived table might perform better, esp. if id and date are indexed:
select m.*
from mytable m
join (select id, min(date) date
from mytable
group by id ) x
on m.id = x.id
and m.date = x.date
;
to built on Gordon's answer with Jones' comment -
Below version does not require using alias and allows use of just id in GROUP BY
#standardSQL
SELECT AS VALUE ARRAY_AGG(t ORDER BY date LIMIT 1)[ORDINAL(1)]
FROM `project.dataset.table` t
GROUP BY id

How to SELECT in SQL based on a value from the same table column?

I have the following table
| id | date | team |
|----|------------|------|
| 1 | 2019-01-05 | A |
| 2 | 2019-01-05 | A |
| 3 | 2019-01-01 | A |
| 4 | 2019-01-04 | B |
| 5 | 2019-01-01 | B |
How can I query the table to receive the most recent values for the teams?
For example, the result for the above table would be ids 1,2,4.
In this case, you can use window functions:
select t.*
from (select t.*, rank() over (partition by team order by date desc) as seqnum
from t
) t
where seqnum = 1;
In some databases a correlated subquery is faster with the right indexes (I haven't tested this with Postgres):
select t.*
from t
where t.date = (select max(t2.date) from t t2 where t2.team = t.team);
And if you wanted only one row per team, then the canonical answer is:
select distinct on (t.team) t.*
from t
order by t.team, t.date desc;
However, that doesn't work in this case because you want all rows from the most recent date.
If your dataset is large, consider the max analytic function in a subquery:
with cte as (
select
id, date, team,
max (date) over (partition by team) as max_date
from t
)
select id
from cte
where date = max_date
Notionally, max is O(n), so it should be pretty efficient. I don't pretend to know the actual implementation on PostgreSQL, but my guess is it's O(n).
One more possibility, generic:
select * from t join (select max(date) date,team from t
group by team) tt
using(date,team)
Window function is the best solution for you.
select id
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
That query will return this table:
| id |
|----|
| 1 |
| 2 |
| 4 |
If you to get it one row per team, you need to use array_agg function.
select team, array_agg(id) ids
from (
select team, id, rank() over (partition by team order by date desc) as row_num
from table
) t
where row_num = 1
group by team
That query will return this table:
| team | ids |
|------|--------|
| A | [1, 2] |
| B | [4] |

Count values checking if consecutive

This is my table:
Event Order Timestamp
delFailed 281475031393706 2018-07-24T15:48:08.000Z
reopen 281475031393706 2018-07-24T15:54:36.000Z
reopen 281475031393706 2018-07-24T15:54:51.000Z
I need to count the number of event 'delFailed' and 'reopen' to calculate #delFailed - #reopen.
The difficulty is that there cannot be two same consecutives events, so that in this case the result will be "0" not "-1".
This is what i have achieved so far (Which is wrong because it gives me -1 instead of 0 due to the fact there are two consecutive "reopen" events )
with
events as (
select
event as events,
orders,
"timestamp"
from main_source_execevent
where orders = '281475031393706'
and event in ('reopen', 'delFailed')
order by "timestamp"
),
count_events as (
select
count(events) as CEvents,
events,
orders
from events
group by orders, events
)
select (
(select cevents from count_events where events = 'delFailed') - (select cevents from count_events where events = 'reopen')
) as nAttempts,
orders
from count_events
group by orders
How can i count once if there are two same consecutive events?
It is a gaps-and-islands problem, you can use make to row number to check rows are two same consecutive events
Explain
one row number created by normal.
another row number created by Event column
SELECT *
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn |
|-----------|-----------------|----------------------|-----|----|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 |
when you create those two row you can get an upper result, then use grp - rn to get calculation the row are or are not same consecutive.
SELECT *,grp-rn
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
| event | Order | timestamp | grp | rn | grp-rn |
|-----------|-----------------|----------------------|-----|----|----------|
| delFailed | 281475031393706 | 2018-07-24T15:48:08Z | 1 | 1 | 0 |
| reopen | 281475031393706 | 2018-07-24T15:54:36Z | 2 | 1 | 1 |
| reopen | 281475031393706 | 2018-07-24T15:54:51Z | 3 | 2 | 1 |
you can see when if there are two same consecutive events grp-rn column will be the same, so we can group by by grp-rn column and get count
Final query.
CREATE TABLE T(
Event VARCHAR(50),
"Order" VARCHAR(50),
Timestamp Timestamp
);
INSERT INTO T VALUES ('delFailed',281475031393706,'2018-07-24T15:48:08.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:36.000Z');
INSERT INTO T VALUES ('reopen',281475031393706,'2018-07-24T15:54:51.000Z');
Query 1:
SELECT
SUM(CASE WHEN event = 'delFailed' THEN 1 END) -
SUM(CASE WHEN event = 'reopen' THEN 1 END) result
FROM (
SELECT Event,COUNT(distinct Event)
FROM (
SELECT *
,ROW_NUMBER() OVER(ORDER BY Timestamp) grp
,ROW_NUMBER() OVER(PARTITION BY Event ORDER BY Timestamp) rn
FROM T
) t1
group by grp - rn,Event
)t1
Results:
| result |
|--------|
| 0 |
I would just use lag() to get the first event in any sequence of similar values. Then do the calculation:
select sum( (event = 'reopen')::int ) as num_reopens,
sum( (event = 'delFailed')::int ) as num_delFailed
from (select mse.*,
lag(event) over (partition by orders order by "timestamp") as prev_event
from main_source_execevent mse
where orders = '281475031393706' and
event in ('reopen', 'delFailed')
) e
where prev_event <> event or prev_event is null;

Select from results of query?

I have a query like this:
SELECT Weighings.Member, MIN(Sessions.DateTime) AS FirstDate, MAX(Sessions.DateTime) AS LastDate
FROM Weighings AS Weighings INNER JOIN
Sessions ON Sessions.SessionGUID = Weighings.Session
WHERE (Sessions.DateTime >= '01/01/2011')
GROUP BY Weighings.Member
ORDER BY Weighings.Member
It returns this:
Member | FirstDate | LastDate
Blah | 01/01/11 | 06/07/11
Blah2 | 02/03/11 | 05/07/11
I need to get the value of a cell Weight_kg in table Weighings for the returned values FirstDate and LastDate to give results like so:
Member | FirstWeight | LastWeight
Blah | 150Kg | 60KG
Blah2 | 70Kg | 72KG
I have tried all combinations of things but just can't work it out, any ideas?
EDIT
Tables:
Sessions
______________________
SessionGUID | DateTime
----------------------
12432524325 | 01/01/11
12432524324 | 01/08/11
12432524323 | 01/15/11
34257473563 | 03/05/11
43634574545 | 06/07/11
Weighings
_____________________________________
Member | Session | Weight_kg
-------------------------------------
vffd8fdg87f | 12432524325 | 150
vffd8fdg87f | 12432524324 | 120
vffd8fdg87f | 12432524323 | 110
ddffv89sdv8 | 34257473563 | 124
32878vfdsv8 | 43634574545 | 75
;with C as
(
select W.Member,
W.Weight_kg,
row_number() over(partition by W.Member order by S.datetime desc) as rnLast,
row_number() over(partition by W.Member order by S.datetime asc) as rnFirst
from Weighings as W
inner join Sessions as S
on S.sessionguid = W.Session and
S.DateTime >= '20110101'
)
select CF.Member,
CF.Weight_kg as FirstWeight,
CL.Weight_kg as LastWeigth
from C as CF
inner join C as CL
on CF.Member = CL.Member
where CF.rnFirst = 1 and
CL.rnLast = 1
Try here: https://data.stackexchange.com/stackoverflow/q/118518/
You can use the RANK..OVER stmt (works only on SQL 2k5+)
select fw.Member, st.Weight, en.Weight
from
(
select Member, Weight, RANK() OVER(PARTITION BY Member ORDER BY Weight) rnk
from Weighings
) st
inner join
(
select Member, Weight, RANK() OVER(PARTITION BY Member ORDER BY WeightDESC) rnk
from Weighings
) en on en.Member= st.Member and st.rnk = 1 and en.rnk = 1
You have two possibilities.
If you want to reuse the first SELECT more times, I'd suggest to sreate temporary table
CREATE TEMPORARY TABLE `tmpTable` AS SELECT /*the first select*/ ;
/*and then*/
SELECT * FROM `tmpTable` /*the second select from the first select*/
If you require the first select only once
SELECT first.*
FROM (SELECT /*the first select*/) AS first