I am working with a data set that looks like this:
MTD | ID | Active
-----------------------
01-APR-16 | A | y
01-MAY-16 | A | y
01-JUN-16 | A | n
01-JUL-16 | A | y
01-AUG-16 | A | n
01-APR-16 | B | n
01-MAY-16 | B | y
01-JUN-16 | B | y
01-JUL-16 | B | y
01-AUG-16 | B | y
I would like to add a count column to the data set that counts the number of times an ID has been active ('y') AFTER the current MTD. The desired output is:
MTD | ID | Active | COUNT
-------------------------------
01-APR-16 | A | y | 2
01-MAY-16 | A | y | 1
01-JUN-16 | A | n | 1
01-JUL-16 | A | y | 0
01-AUG-16 | A | n | 0
01-APR-16 | B | n | 4
01-MAY-16 | B | y | 3
01-JUN-16 | B | y | 2
01-JUL-16 | B | y | 1
01-AUG-16 | B | y | 0
The query I am thinking of is:
SELECT
MTD,
ID,
ACTIVE,
SUM(CASE WHEN MTD > (current records MTD)
AND ACTIVE = 'y' THEN 1 ELSE 0 END)
OVER (PARTITION BY ID)
as COUNT
I'm not sure how to compare each record's MTD to the current record's MTD in the window sum. How can I amend the first line of the case statement?
Thank you,
Ryan Barker
Use count() over() with a range specification so you look at the rows following the current row (for each id) for an active flag y and only count them. This assumes mtd is a date column for the ordering to work.
SELECT
MTD,
ID,
ACTIVE,
COUNT(case when active='y' then 1 end) OVER(partition by ID order by mtd range between 1 following and unbounded following)
FROM your_table
Sample Demo
To me, it looks like you want to sum the number of rows with a "y" in reverse. Something like this:
select t.*,
greatest(sum(case when active = 'y' then 1 else 0 end) over (partition by id order by mtd desc) - 1,
0)
from t;
Your idea is quite close. You just need an order by in the partitioning clause.
Related
Having the following fields in a table...
+---------+---+---+
| myTime | x | y |
+---------+---+---+
| 13:00 | 0 | 0 |
| 13:05 | 2 | 1 |
| 13:10 | 4 | 2 |
| 13:15 | 1 | 3 |
+---------+---+---+
I need to generate a third one (z) as follows...
+---------+---+---+---+
| myTime | x | y | z |
+---------+---+---+---+
| 13:00 | 0 | 0 | 0 |
| 13:05 | 2 | 1 | 1 |
| 13:10 | 4 | 2 | 3 |
| 13:15 | 1 | 3 | 1 |
+---------+---+---+---+
In the first row z will have a value of 0 and in the next ones, z will be calculated as x-y + (previous row's) z.
I've tried using the row number for each record and LAG to try reading values from previous rows...
WITH rows_sorted AS
(SELECT *, ROW_NUMBER() OVER (ORDER BY myTime) AS row_num
FROM table)
SELECT myTime, x, y
IF(row_num = 1, 0, x - y + LAG(z, 1) OVER (ORDER BY row_num)) AS z
FROM rows_sorted
ORDER BY row_num
...but evidently wouldn't work as in LAG(z, 1), z has not been generated yet.
Any suggestion on how such a thing can be done? I'm using standard SQL in Google BigQuery
Thanks in advance
Since the text above oversimplifies the real calculation, here's a closer approach to what I need to achieve:
+---------+----+----+----+
| myTime | x | y | z |
+---------+----+----+----+
| 13:00 | 15 | 22 | 0 |
| 13:05 | 7 | 21 | 0 |
| 13:10 | 7 | 5 | 2 |
| 13:15 | 9 | 16 | 0 |
| 13:20 | 14 | 5 | 9 |
+---------+----+----+----+
Where z for each row is calculated as follows:
WHEN row_number() = 1 THEN z = 0 (already achieved thanks to the
answer below)
WHEN x+(previous row's)z < y THEN z = 0
WHEN x+(previous row's)z >= y THEN z = x+(previous row's)z - y
Hmmm . . . You can get what you want using:
select t.*,
sum(x - y) over (order by mytime) as z
from t;
The first row has values of 0 for all the columns, so this works for your sample data. If you wanted to explicitly set it to 0, then:
select t.*,
(case when row_number() over order by mytime) = 1
then 0
else sum(x - y) over (order by mytime) - first_value(x - y) over (order by mytime)
end) as z
from t;
This subtracts out the value from the first row from the cumulative sum. However, that seems unnecessary.
I've got a table in a Redshift database that contains intervals which are grouped and that potentially overlap, like so:
| interval_id | l | u | group |
| ----------- | -- | -- | ----- |
| 1 | 1 | 10 | A |
| 2 | 2 | 5 | A |
| 3 | 5 | 15 | A |
| 4 | 26 | 30 | B |
| 5 | 28 | 35 | B |
| 6 | 30 | 31 | B |
| 7 | 44 | 45 | B |
| 8 | 56 | 58 | C |
What I would like to do is to determine the length of the union of the intervals within group. That is, for each interval take u - l, sum over all group members and then subtract off the length of the overlaps between the intervals.
Desired result:
| group | length |
| ----- | ------ |
| A | 14 |
| B | 10 |
| C | 2 |
This question has been asked before, alas it seems that all of the solutions in that thread use features that Redshift doesn't support.
This is not difficult but requires multiple steps. The key is to define the "islands" within each group and then aggregate over those. Lots of subquerys, aggregations, and window functions.
select groupId, sum(ul)
from (select groupId, (max(u) - min(l) + 1) as ul
from (select t.*,
sum(case when prev_max_u < l then 1 else 0 end) over (order by l) as grp
from (select t.*,
max(u) over (order by l rows between unbounded preceding and 1 preceding) as prev_max_u
from t
) t
) t
group by groupid, grp
) g
group by groupId;
The idea is to determine if there is an overlap at the beginning of each record. For this purpose, it uses a cumulative max function on all preceding records. Then, it determines if there is an overlap by comparing the previous max with the current l -- a cumulative sum of overlaps defines a group.
The rest is just aggregation. And more aggregation.
Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;
I have a theoretical question, so I'm not interested in alternative solutions. Sorry.
Q: Is it possible to get the window running function values for all previous rows, except current?
For example:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over (partition by x order by i) - y as sum,
max(y) over (partition by x order by i) as max,
count(*) filter (where y > 2) over (partition by x order by i) as cnt
from
t;
Actual result is
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | 1 | 0
2 | 1 | 3 | 1 | 3 | 1
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | 4 | 1
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 8 | 2
(6 rows)
I want to have max and cnt columns behavior like sum column, so, result should be:
i | x | y | sum | max | cnt
---+---+---+-----+-----+-----
1 | 1 | 1 | 0 | | 0
2 | 1 | 3 | 1 | 1 | 0
3 | 1 | 2 | 4 | 3 | 1
4 | 2 | 4 | 0 | | 0
5 | 2 | 2 | 4 | 4 | 1
6 | 2 | 8 | 6 | 4 | 1
(6 rows)
It can be achieved using simple subquery like
select t.*, lag(y,1) over (partition by x order by i) as yy from t
but is it possible using only window function syntax, without subqueries?
Yes, you can. This does the trick:
with
t(i,x,y) as (
values
(1,1,1),(2,1,3),(3,1,2),
(4,2,4),(5,2,2),(6,2,8)
)
select
t.*,
sum(y) over w as sum,
max(y) over w as max,
count(*) filter (where y > 2) over w as cnt
from t
window w as (partition by x order by i
rows between unbounded preceding and 1 preceding);
The frame_clause selects just those rows from the window frame that you are interested in.
Note that in the sum column you'll get null rather than 0 because of the frame clause: the first row in the frame has no row before it. You can coalesce() this away if needed.
SQLFiddle
My objective is to make dynamic group of lines (of product by TYPE & COLOR in fact)
I don't know if it's possible just with one select query.
But : I want to create group of lines (A PRODUCT is a TYPE and a COLOR) as per the number_per_group column and I want to do this grouping depending on the date order (Order By DATE)
A single product with a NB_PER_GROUP number 2 is exclude from the final result.
Table :
-----------------------------------------------
NUM | TYPE | COLOR | NB_PER_GROUP | DATE
-----------------------------------------------
0 | 1 | 1 | 2 | ...
1 | 1 | 1 | 2 |
2 | 1 | 2 | 2 |
3 | 1 | 2 | 2 |
4 | 1 | 1 | 2 |
5 | 1 | 1 | 2 |
6 | 4 | 1 | 3 |
7 | 1 | 1 | 2 |
8 | 4 | 1 | 3 |
9 | 4 | 1 | 3 |
10 | 5 | 1 | 2 |
Results :
------------------------
GROUP_NUMBER | NUM |
------------------------
0 | 0 |
0 | 1 |
~~~~~~~~~~~~~~~~~~~~~~~~
1 | 2 |
1 | 3 |
~~~~~~~~~~~~~~~~~~~~~~~~
2 | 4 |
2 | 5 |
~~~~~~~~~~~~~~~~~~~~~~~~
3 | 6 |
3 | 8 |
3 | 9 |
If you have another way to solve this problem, I will accept it.
What about something like this?
select max(gn.group_number) group_number, ip.num
from products ip
join (
select date, type, color, row_number() over (order by date) - 1 group_number
from (
select op.num, op.type, op.color, op.nb_per_group, op.date, (row_number() over (partition by op.type, op.color order by op.date) - 1) % nb_per_group group_order
from products op
) sq
where sq.group_order = 0
) gn
on ip.type = gn.type
and ip.color = gn.color
and ip.date >= gn.date
group by ip.num
order by group_number, ip.num
This may only work if your nb_per_group values are the same for each combination of type and color. It may also require unique dates, but that could probably be worked around if required.
The innermost subquery partitions the rows by type and color, orders them by date, then calculates the row numbers modulo nb_per_group; this forms a 0-based count for the group that resets to 0 each time nb_per_group is exceeded.
The next-level subquery finds all of the 0 values we mapped in the lower subquery and assigns group numbers to them.
Finally, the outermost query ties each row in the products table to a group number, calculated as the highest group number that split off before this product's date.