How to aggregate the data into these form using SQL or Redshift? - sql

For example: A ATM Machine fault data
s_id: Bank Branch
atm_id: Multiple atm on each branch
start_time: Ticket is created for fault occur
end_time: Ticket is closed
aggregate the overlapping data with group by s_id,atm_id
Raw Data
Output Required

This looks like a gaps-and-islands problem, because you want to identify "islands" of s_id, atm_id, and status that are on adjacent rows.
That suggests the difference of row numbers for this incarnation:
select s_id, atm_id, status_code,
min(start_time), max(end_time)
from (select t.*,
row_number() over (partition by s_id, atm_id order by start_time) as seqnum,
row_number() over (partition by s_id, atm_id, status_code order by start_time) as seqnum_s
from t
) t
group by s_id, atm_id, status_code, (seqnum - seqnum_s);
Why this finds adjacent rows with the same status is a little tricky to explain. However, if you look at the results of the subquery, I think you will see how the difference identifies the adjacent rows that you want to combine together.

Perhaps a self-join with some aggregation
SELECT t1.s_id, t1.atm_id,
MIN(t2.start_time) as start_time,
MAX(t2.end_time) as end_time
FROM YourTable t1
LEFT JOIN YourTable t2
ON t2.s_id = t1.s_id
AND t2.atm_id = t1.atm_id
AND t2.start_time <= t1.end_time
AND t2.end_time >= t1.start_time
GROUP BY t1.s_id, t1.atm_id

Related

How can i group rows on sql base on condition

I am using redshift sql and would like to group users who has overlapping voucher period into a single row instead (showing the minimum start date and max end date)
For E.g if i have these records,
I would like to achieve this result using redshift
Explanation is tat since row 1 and row 2 has overlapping dates, I would like to just combine them together and get the min(Start_date) and max(End_Date)
I do not really know where to start. Tried using row_number to partition them but does not seem to work well. This is what I tried.
select
id,
start_date,
end_date,
lag(end_date, 1) over (partition by id order by start_date) as prev_end_date,
row_number() over (partition by id, (case when prev_end_date >= start_date then 1 else 0) order by start_date) as rn
from users
Are there any suggestions out there? Thank you kind sirs.
This is a type of gaps-and-islands problem. Because the dates are arbitrary, let me suggest the following approach:
Use a cumulative max to get the maximum end_date before the current date.
Use logic to determine when there is no overall (i.e. a new period starts).
A cumulative sum of the starts provides an identifier for the group.
Then aggregate.
As SQL:
select id, min(start_date), max(end_date)
from (select u.*,
sum(case when prev_end_date >= start_date then 0 else 1
end) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and current row
) as grp
from (select u.*,
max(end_date) over (partition by id
order by start_date, voucher_code
rows between unbounded preceding and 1 preceding
) as prev_end_date
from users u
) u
) u
group by id, grp;
Another approach would be using recursive CTE:
Divide all rows into numbered partitions grouped by id and ordered by start_date and end_date
Iterate over them calculating group_start_date for each row (rows which have to be merged in final result would have the same group_start_date)
Finally you need to group the CTE by id and group_start_date taking max end_date from each group.
Here is corresponding sqlfiddle: http://sqlfiddle.com/#!18/7059b/2
And the SQL, just in case:
WITH cteSequencing AS (
-- Get Values Order
SELECT *, start_date AS group_start_date,
ROW_NUMBER() OVER (PARTITION BY id ORDER BY start_date, end_date) AS iSequence
FROM users),
Recursion AS (
-- Anchor - the first value in groups
SELECT *
FROM cteSequencing
WHERE iSequence = 1
UNION ALL
-- Remaining items
SELECT b.id, b.start_date, b.end_date,
CASE WHEN a.end_date > b.start_date THEN a.group_start_date
ELSE b.start_date
END
AS groupStartDate,
b.iSequence
FROM Recursion AS a
INNER JOIN cteSequencing AS b ON a.iSequence + 1 = b.iSequence AND a.id = b.id)
SELECT id, group_start_date as start_date, MAX(end_date) as end_date FROM Recursion group by id, group_start_date ORDER BY id, group_start_date

Redshift - Group Table based on consecutive rows

I am working right now with this table:
What I want to do is to clear up this table a little bit, grouping some consequent rows together.
Is there any form to achieve this kind of result?
The first table is already working fine, I just want to get rid of some rows to free some disk space.
One method is to peak at the previous row to see when the value changes. Assuming that valid_to and valid_from are really dates:
select id, class, min(valid_to), max(valid_from)
from (select t.*,
sum(case when prev_valid_to >= valid_from + interval '-1 day' then 0 else 1 end) over (partition by id order by valid_to rows between unbounded preceding and current row) as grp
from (select t.*,
lag(valid_to) over (partition by id, class order by valid_to) as prev_valid_to
from t
) t
) t
group by id, class, grp;
If the are not dates, then this gets trickier. You could convert to dates. Or, you could use the difference of row_numbers:
select id, class, min(valid_from), max(valid_to)
from (select t.*,
row_number() over (partition by id order by valid_from) as seqnum,
row_number() over (partition by id, class order by valid_from) as seqnum_2
from t
) t
group by id, class, (seqnum - seqnum_2)

Count occurences in a row using aggregate functions

Consider the following relation
column measured_at holds thousands of different timestamps and column cell_id holds the number of the cell tower used at each timestamp. I want to query for each day saved in measured_at, which cell tower has the most occurences (used the most at that day, here is time irrelevant, only the date is to query). This probably can be done using window functions, but I want to do it using only aggregate functions and simple queries.
an output should look like for example:
cell_id measured_at
27997442 2015-12-22
for the above example because on 22-12-2015 tower number 27997442 has been used the most.
You can use aggregation and distinct on. To get the counts:
select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
And then extend this for only one value:
select distinct on (date_trunc(date, measured_at)) date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt
from t
group by dte, cell_id
order by date_trunc(date, measured_at), count(*) desc;
Of course, you can use window functions as well -- and that is a better approach if you want to get ties as well:
select dte, cell_id, cnt
from (select date_trunc(date, measured_at) as dte, cell_id, count(*) as cnt,
rank() over (partition by date_trunc(date, measured_at) order by count(*) desc) as seqnum
from t
group by dte, cell_id
) dc
where seqnum = 1;

count consecutive record with timestamp interval requirement

ref to this post: link, I used the answer provided by #Gordon Linoff:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
row_number() over (partition by taxi order by time) as seqnum,
row_number() over (partition by taxi, client order by time) as seqnum_c
from t
) t
group by t.taxi, t.client, (seqnum - seqnum_c)
having count(*) >= 2
)
group by taxi;
and got my answer perfectly like this:
Tom 3 (AA count as 1, AAA count as 1 and BB count as 1, so total of 3 count)
Bob 1
But now I would like to add one more condition which is the time between two consecutive clients for same taxi should not be longer than 2hrs.
I know that I should probably use row_number() again and calculate the time difference with datediff. But I have no idea where to add and how to do.
So any suggestion?
This requires a bit more logic. In this case, I would use lag() to calculate the groups:
select taxi, count(*)
from (select t.taxi, t.client, count(*) as num_times
from (select t.*,
sum(case when prev_client = client and
prev_time > time - interval '2 hour'
then 1
else 0
end) over (partition by client order by time) as grp
from (select t.*,
lag(client) over (partition by taxi order by time) as prev_client,
lag(time) over (partition by taxi order by time) as prev_time
from t
) t
) t
group by t.taxi, t.client, grp
having count(*) >= 2
)
group by taxi;
Note: You don't specify the database, so this uses ISO/ANSI standard syntax for date/time comparisons. You can adjust this for your actual database.

Running count distinct

I am trying to see how the cumulative number of subscribers changed over time based on unique email addresses and date they were created. Below is an example of a table I am working with.
I am trying to turn it into the table below. Email 1#gmail.com was created twice and I would like to count it once. I cannot figure out how to generate the Running count distinct column.
Thanks for the help.
I would usually do this using row_number():
select date, count(*),
sum(count(*)) over (order by date),
sum(sum(case when seqnum = 1 then 1 else 0 end)) over (order by date)
from (select t.*,
row_number() over (partition by email order by date) as seqnum
from t
) t
group by date
order by date;
This is similar to the version using lag(). However, I get nervous using lag if the same email appears multiple times on the same date.
Getting the total count and cumulative count is straight forward. To get the cumulative distinct count, use lag to check if the email had a row with a previous date, and set the flag to 0 so it would be ignored during a running sum.
select distinct dt
,count(*) over(partition by dt) as day_total
,count(*) over(order by dt) as cumsum
,sum(flag) over(order by dt) as cumdist
from (select t.*
,case when lag(dt) over(partition by email order by dt) is not null then 0 else 1 end as flag
from tbl t
) t
DEMO HERE
Here is a solution that does not uses sum over, neither lag... And does produces the correct results.
Hence it could appear as simpler to read and to maintain.
select
t1.date_created,
(select count(*) from my_table where date_created = t1.date_created) emails_created,
(select count(*) from my_table where date_created <= t1.date_created) cumulative_sum,
(select count( distinct email) from my_table where date_created <= t1.date_created) running_count_distinct
from
(select distinct date_created from my_table) t1
order by 1