sql, big query: aggregate all entries between two strings in a variable - sql

I have to solve this problem within bigQuery. I have this column in my table:
event | time
_________________|____________________
start | 1
end | 2
random_event_X | 3
start | 4
error_X | 5
error_Y | 6
end | 7
start | 8
error_F | 9
start | 10
random_event_Y | 11
error_z | 12
end | 13
I would like to, from the end event record everything until start appear and then count it. Everything can happen between start and end and outside of it. If there is an end, there is a start, but if there is a start, there is not necessarily an end.
The desire output would be like:
string_agg | count
"start, end" | 1
"start, error_X, error_Y, end" | 1
"start, random_event_Y error_Z, end" | 1
So everything between each start and end if start has an end. So without the random_event_X at time 3, the start at time 8 or the error_F at time 9.
I was not able to find the solution and have struggle understanding how to approach this problem. Any help or advice is welcome.

Below is for BigQuery Standard SQL
#standardSQL
SELECT agg_events, COUNT(1) cnt
FROM (
SELECT STRING_AGG(event ORDER BY time) agg_events, COUNTIF(event IN ('start', 'end')) flag
FROM (
SELECT *, COUNTIF(event = 'start') OVER(PARTITION BY grp1 ORDER BY time) grp2
FROM (
SELECT *, COUNTIF(event = 'end') OVER(ORDER BY time DESC) grp1
FROM `project.dataset.table`
)
)
GROUP BY grp1, grp2
)
WHERE flag = 2
GROUP BY agg_events
If to apply to sample data from your question - result is
Row agg_events cnt
1 start,random_event_Y,error_z,end 1
2 start,error_X,error_Y,end 1
3 start,end 1

SQL tables represent unordered sets -- this is particularly true in massively parallel, columnar databases such as BigQuery.
So, I have to assume that you have some other column that specifies the ordering. If so, you can use a cumulative sum to identify the groups and then aggregation:
select grp,
string_agg(event, ',' order by time)
from (select t.*,
countif(event = 'start') over (order by time) as grp
from t
) t
group by grp
order by min(time);
Note: I would also advise you to use array_agg() instead of string_agg(). Arrays are generally easier to work with than strings.
EDIT:
I see, you only want up to end. In that case, another level of window funtions:
select grp,
string_agg(event, ',' order by <ordering col>)
from (select t.*,
max(case when event = 'end' then time end) over (partition by grp) as max_end_time
from (select t.*,
countif(event = 'start') over (order by <ordering col>) as grp
from t
) t
) t
where max_end_time is null or time <= max_end_time
group by grp
order by min(<ordering col>);

Related

How would I extract only the latest week from a select over statement in Hiveql?

I need some help, I've created a query which keeps a running total of whether an element returns a 1 or 0 against a specific measure with the running total returning to 0 if the measure provides a 0, Example below:
year_week element measure running_total
2020_40 A 1 1
2020_41 A 1 2
2020_42 A 1 3
2020_43 A 0 0
2020_44 A 1 1
2020_45 A 1 2
2020_40 B 1 1
2020_41 B 1 2
2020_42 B 1 3
2020_43 B 1 4
2020_44 B 1 5
2020_45 B 1 6
The above is achieved using this query:
SELECT element,
year_week,
measure,
SUM(measure) OVER (PARTITION BY element, flag_sum ORDER BY year_week ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS running_total
FROM (
SELECT *,
SUM(measure_flag) OVER (PARTITION BY element ORDER BY year_week ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS flag_sum
FROM (
SELECT *,
CASE WHEN measure = 1 THEN 0 ELSE 1 END AS measure_flag
FROM database.table ) x ) y
This is great and works - but I'd like to provide only the latest weeks data for each element. So in the above example it would be:
year_week element measure running_total
2020_45 A 1 2
2020_45 B 1 6
Essentially I need to keep the logic the same but limit the returned data set. I've attempted this however it changes the result from the correct running total to a 1 or 0.
Any help is greatly appreciated!
You can add another level of nesting, and filter the latest record per element with row_number().
I would suggest:
select element, year_week, measure, running_total
from (
select t.*,
row_number() over(partition by element, grp order by year_week) - 1 as running_total
from (
select t.*,
sum(1 - measure) over(partition by element order by year_week) as grp,
row_number() over(partition by element order by year_week desc) as rn
from mytable t
) t
) t
where rn = 1
I simplified the query a little, considering the fact that measure has values 0 and 1 only, as showed in your sample data. If that's not the case, then:
select element, year_week, measure, running_total
from (
select t.*,
sum(measure) over(partition by element, grp order by year_week) as running_total
from (
select t.*,
sum(case when measure = 0 then 1 else 0 end) over(partition by element order by year_week) as grp,
row_number() over(partition by element order by year_week desc) as rn
from mytable t
) t
) t
where rn = 1

increment if not same value of next column in SQL

I am trying to use the Row Number in SQL. However, it's not giving desired output.
Data :
ID Name Output should be
111 A 1
111 B 2
111 C 3
111 C 3
111 A 4
222 A 1
222 A 1
222 B 2
222 C 3
222 B 4
222 B 4
This is a gaps-and-islands problem. As a starter: for the question to just make sense, you need a column that defines the ordering of the rows - I assumed ordering_id. Then, I would recommend lag() to get the "previous" name, and a cumulative sum() that increases everytime the name changes in adjacent rows:
select id, name,
sum(case when name = lag_name then 0 else 1 end) over(partition by id order by ordering_id) as rn
from (
select t.*, lag(name) over(partition by id order by ordering_id) lag_name
from mytable t
) t
SQL Server 2008 makes this much trickier. You can identify the adjacent rows using a difference of rows numbers. Then you can assign the minimum id in each island and use dense_rank():
select t.*,
dense_rank() over (partition by name order by min_ordcol) as output
from (select t.*,
min(<ordcol>) over (partition by name, seqnum - seqnum_2) as min_ordcol
from (select t.*,
row_number() over (partition by name order by <ordcol>) as seqnum,
row_number() over (partition by name, id order by <ordcol>) as seqnum_2
from t
) t
) t;

How to count repeating values in a column in PostgreSQL?

Hi I have a table like below, and I want to count the repeating values in the status column. I don't want to calculate the overall duplicate values. For example, I just want to count how many "Offline" appears until the value changes to "Idle".
This is the result I wanted. Thank you.
This is often called gaps-and-islands.
One way to do it is with two sequences of row numbers.
Examine each intermediate result of the query to understand how it works.
WITH
CTE_rn
AS
(
SELECT
status
,dt
,ROW_NUMBER() OVER (ORDER BY dt) as rn1
,ROW_NUMBER() OVER (PARTITION BY status ORDER BY dt) as rn2
FROM
T
)
SELECT
status
,COUNT(*) AS cnt
FROM
CTE_rn
GROUP BY
status
,rn1-rn2
ORDER BY
min(dt)
;
Result
| status | cnt |
|---------|-----|
| offline | 2 |
| idle | 1 |
| offline | 2 |
| idle | 1 |
WITH
cte1 AS ( SELECT status,
"date",
workstation,
CASE WHEN status = LAG(status) OVER (PARTITION BY workstation ORDER BY "date")
THEN 0
ELSE 1 END changed
FROM test ),
cte2 AS ( SELECT status,
"date",
workstation,
SUM(changed) OVER (PARTITION BY workstation ORDER BY "date") group_num
FROM cte1 )
SELECT status, COUNT(*) "count", workstation, MIN("date") "from", MAX("date") "till"
FROM cte2
GROUP BY group_num, status, workstation;
fiddle

SQL Window Function - Number of Rows since last Max

I am trying to create a SQL query that will pull the number of rows since the last maximum value within a windows function over the last 5 rows. In the example below it would return 2 for row 8. The max value is 12 which is 2 rows from row 8.
For row 6 it would return 5 because the max value of 7 is 5 rows away.
|ID | Date | Amount
| 1 | 1/1/2019 | 7
| 2 | 1/2/2019 | 3
| 3 | 1/3/2019 | 4
| 4 | 1/4/2019 | 1
| 5 | 1/5/2019 | 1
| 6 | 1/6/2019 | 12
| 7 | 1/7/2019 | 2
| 8 | 1/8/2019 | 4
I tried the following:
SELECT ID, date, MAX(amount)
OVER (ORDER BY date ASC ROWS 5 PRECEDING) mymax
FROM tbl
This gets me to the max values but I am unable to efficiently determine how many rows away it is. I was able to get close using multiple variables within the SELECT but this did not seem efficient or scalable.
You can calculate the cumulative maximum and then use row_number() on that.
So:
select t.*,
row_number() over (partition by running_max order by date) as rows_since_last_max
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max
from tbl t
) t;
I think this works for your sample data. It might not work if you have duplicates.
In that case, you can use date arithmetic:
select t.*,
datediff(day,
max(date) over (partition by running_max order by date),
date
) as days_since_most_recent_max5
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max
from tbl t
) t;
EDIT:
Here is an example using row number:
select t.*,
(seqnum - max(case when amount = running_amount then seqnum end) over (partition by running_max order by date)) as rows_since_most_recent_max5
from (select t.*,
max(amount) over (order by date rows between 5 preceding and current row) as running_max,
row_number() over (order by date) as seqnum
from tbl t
) t;
It would be :
select *,ID-
(
SELECT ID
FROM
(
SELECT
ID,amount,
Maxamount =q.mymax
FROM
Table_4
) AS derived
WHERE
amount = Maxamount
) as result
from (
SELECT ID, date,
MAX(amount)
OVER (ORDER BY date ASC ROWS 5 PRECEDING) mymax
FROM Table_4
)as q

SQL: count last equal values

I need to solve this problem in pure SQL:
I have to count all the records with a specific value:
In my table there is a column flag with values 0 or 1. I need to count all the 1 after last 0 and sum the amount column values of those records.
Example:
Flag | Amount
0 | 5
1 | 8
0 | 10
1 | 20
1 | 30
Output:
2 | 50
If last value is 0 I don't need to do anything.
I hasten that I need to perform a fast query (possibly accessing just one time).
I assumed that your example table is logically ordered by Amount. Then you can do this:
select
count(*) as cnt
,sum(Amount) as Amount
from yourTable
where Amount > (select max(Amount) from yourTable where Flag = 0)
If the biggest value is from a row where Flag = 0 then nothing will be returned.
If your table may not contain any zeros, then you are safer with:
select count(*) as cnt, sum(Amount) as Amount
from t
where Amount > all (select Amount from t where Flag = 0)
Or, using window functions:
select count(*) as cnt, sum(amount) as amount
from (select t.*, max(case when flag = 0 then amount end) as flag0_amount
from t
) t
where flag0_amount is null or amount > flag0_amount
I find the solution by myself:
select decode(lv,0,0,tot-prog) somma ,decode(lv,0,0,cnt-myrow) count
from(
select * from
(
select pan,dt,flag,am,
last_value(flag) over() lv,
row_number() OVER (order by dt) AS myrow,
count(*) over() cnt,
case when lead(flag) OVER (ORDER BY dt) != flag then rownum end AS change,
sum(am) over() tot,
sum(am) over(order by dt) prog
from test
where pan=:pan and dt > :dt and flag is not null
order by dt
) t
where change is not null
order by change desc
) where rownum =1