Full outer join on a table itself and run some window functions - sql

Background
I have some ETL job processing real-time log files hourly. Whenever the system generates a new event, it will take a snapshot of all historical event summary (if exists) and record it together with the current event. Then the data is loaded into Redshift.
Example
The table looks like something below:
+------------+--------------+---------+-----------+-------+-------+
| current_id | current_time | past_id | past_time | freq1 | freq2 |
+------------+--------------+---------+-----------+-------+-------+
| 2 | time2 | 1 | time1 | 13 | 5 |
| 3 | time3 | 1 | time1 | 13 | 5 |
| 3 | time3 | 2 | time2 | 2 | 1 |
| 4 | time4 | 1 | time1 | 13 | 5 |
| 4 | time4 | 2 | time2 | 2 | 1 |
| 4 | time4 | 3 | time3 | 1 | 1 |
+------------+--------------+---------+-----------+-------+-------+
This is what happened for the above table:
time1: event 1 happened. System took a snapshot, but nothing is recorded.
time2: event 2 happened. System took a snapshot and record event 1.
time3: event 3 happened. System took a snapshot and record event 1 & 2.
time4: event 4 happened. System took a snapshot and record event 1, 2 & 3.
Desired Outcome
I will need to transform the data into the following format in order to do some analysis:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 0 | 0 |
| 2 | time2 | 13 | 5 | -- 13 | 5
| 3 | time3 | 15 | 6 | -- 13 + 2 | 5 + 1
| 4 | time4 | 16 | 7 | -- 15 + 1 | 6 + 1
+----+------------+-------+-------+
Basically, the new freq1 and freq2 are cumulative sum of lagged freq1 and freq2.
My Idea
I am thinking of a self full outer join on current_id and past_id and achieve the following result first:
+----+------------+-------+-------+
| id | event_time | freq1 | freq2 |
+----+------------+-------+-------+
| 1 | time1 | 13 | 5 |
| 2 | time2 | 2 | 1 |
| 3 | time3 | 1 | 1 |
| 4 | time4 | null | null |
+----+------------+-------+-------+
Then I can do a window function of lag over() and then sum over().
Question
Is this the correct approach? Is there a more efficient way to do this? This is just a small sample of the actual data, so performance could be a concern.
My query is always returning a lot of duplicated values, so I am not sure what went wrong.
Solution
Answer from #GordonLinoff is correct for the above use case. I am adding some minor updates in order to get it working on my actual table. The only difference is that my event_id are some 36-character Java UUID and the event_time are timestamp.
select distinct past_id, past_time, 0 as freq1, 0 as freq2
from (
select past_id, past_time,
row_number() over (partition by current_id order by current_time desc) as seqnum
from t
) a
where a.seqnum = 1
union all
select current_id, current_time,
sum(freq1) over (order by current_time rows unbounded preceding) as freq1,
sum(freq2) over (order by current_time rows unbounded preceding) as freq2
from (
select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) b
where b.seqnum = 1;

I'm thinking you want union all along with window functions. Here is an example:
select min(past_id) as id, min(past_time) as event_time, 0 as freq1, 0 as freq2
from t
union all
(select current_id, current_time,
sum(freq1) over (order by current_time),
sum(freq2) over (order by current_time)
from (select current_id, current_time, freq1, freq2,
row_number() over (partition by current_id order by past_id desc) as seqnum
from t
) t
where seqnum = 1
);

The way your data is in your snapshot table, I think the following SQL should give you what you are looking for in the desired outcome that you posted
SELECT 1 AS id
,"time1" AS event_time
,0 AS freq1
,0 AS freq2
UNION
SELECT T.id
,T.current_time AS event_time
,SUM(T.freq1) AS freq1
,SUM(T.freq2) AS freq2
FROM snapshot AS T
GROUP
BY T.id
,T.current_name
The first SELECT in the above UNION is so that you can get the first record for time1 since it does not really have an entry in your base table which holds all the snapshots.. It does not have a FROM in it since we are only selecting variables, if Redshift does not support it you might need to look for something equivalent to the DUAL table in Oracle.
Hope this helps..

Related

BigQuery: delete duplicated row that are not fully duplicated (delete desire row)

I have a table recording customer step on daily basis. The table had Id, date and step column. Some rows contained different steps on the same day for the same Id. Sample as shown below on 5/3/2020 and 5/4/2020 for Id 1:
| Id | Date | Step |
|:-----|:---------|:-----|
| 1 | 5/1/2020 | 1 |
| 1 | 5/2/2020 | 1 |
| 1 | 5/3/2020 | 0 |
| 1 | 5/3/2020 | 5 |
| 1 | 5/4/2020 | 2 |
| 1 | 5/4/2020 | 10 |
| 1 | 5/5/2020 | 1 |
| 2 | 5/1/2020 | 1 |
| 2 | 5/2/2020 | 2 |
| 2 | 5/3/2020 | 0 |
I want to delete rows that contain lesser step, which is 5/3/2020 for 0 step, 5/4/2020 for 2 step for Id 1.
I had tried using row_number() like this:
SELECT
Id,
Date,
step,
ROW_NUMBER() OVER (PARTITION BY Id, Date ORDER BY Id, Date) AS rn
FROM
`dataset.step`
WHERE rn>1
But that will give me rows with higher step, which is not want I want.
I also able to select rows with fewer step like this:
SELECT * FROM
`dataset.step` AS A
INNER JOIN
`dataset.step` AS B
ON A.Id = B.Id
AND A.Date = B.Date
WHERE A.step < B.step
But find no way to use it for delete.
Use below approach
select *
from your_table
qualify 1 = row_number() over win
window win as (partition by id, date order by step desc)
if applied to sample data in your question - output is

How to add records for each user based on another existing row in BigQuery?

Posting here in case someone with more knowledge than may be able to help me with some direction.
I have a table like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201125 | 1 | 0 |
-----------------------------------
| 4 | 20201114 | 2 | 32 |
-----------------------------------
| 5 | 20201116 | 2 | 0 |
-----------------------------------
| 6 | 20201120 | 2 | 23 |
-----------------------------------
However, from this, I need to have a record for each user for each day where if a day is missing for a user, then the last score recorded should be maintained then I would have something like this:
| Row | date |user id | score |
-----------------------------------
| 1 | 20201120 | 1 | 26 |
-----------------------------------
| 2 | 20201121 | 1 | 14 |
-----------------------------------
| 3 | 20201122 | 1 | 14 |
-----------------------------------
| 4 | 20201123 | 1 | 14 |
-----------------------------------
| 5 | 20201124 | 1 | 14 |
-----------------------------------
| 6 | 20201125 | 1 | 0 |
-----------------------------------
| 7 | 20201114 | 2 | 32 |
-----------------------------------
| 8 | 20201115 | 2 | 32 |
-----------------------------------
| 9 | 20201116 | 2 | 0 |
-----------------------------------
| 10 | 20201117 | 2 | 0 |
-----------------------------------
| 11 | 20201118 | 2 | 0 |
-----------------------------------
| 12 | 20201119 | 2 | 0 |
-----------------------------------
| 13 | 20201120 | 2 | 23 |
-----------------------------------
I'm trying to to this in BigQuery using StandardSQL. I have an idea of how to keep the same score across following empty dates, but I really don't know how to add new rows for missing dates for each user. Also, just to keep in mind, this example only has 2 users, but in my data I have more than 1500.
My end goal would be to show something like the average of the score per day. For background, because of our logic, if the score wasn't recorded in a specific day, this means that the user is still in the last score recorded which is why I need a score for every user every day.
I'd really appreciate any help I could get! I've been trying different options without success
Below is for BigQuery Standard SQL
#standardSQL
select date, user_id,
last_value(score ignore nulls) over(partition by user_id order by date) as score
from (
select user_id, format_date('%Y%m%d', day) date,
from (
select user_id, min(parse_date('%Y%m%d', date)) min_date, max(parse_date('%Y%m%d', date)) max_date
from `project.dataset.table`
group by user_id
) a, unnest(generate_date_array(min_date, max_date)) day
)
left join `project.dataset.table` b
using(date, user_id)
-- order by user_id, date
if applied to sample data from your question - output is
One option uses generate_date_array() to create the series of dates of each user, then brings the table with a left join.
select d.date, d.user_id,
last_value(t.score ignore nulls) over(partition by d.user_id order by d.date) as score
from (
select t.user_id, d.date
from mytable t
cross join unnest(generate_date_array(min(date), max(date), interval 1 day)) d(date)
group by t.user_id
) d
left join mytable t on t.user_id = d.user_id and t.date = d.date
I think the most efficient method is to use generate_date_array() but in a very particular way:
with t as (
select t.*,
date_add(lead(date) over (partition by user_id order by date), interval -1 day) as next_date
from t
)
select row_number() over (order by t.user_id, dte) as id,
t.user_id, dte, t.score
from t cross join join
unnest(generate_date_array(date,
coalesce(next_date, date)
interval 1 day
)
) dte;

PARTITION BY in CASE doesn't work with several AND statements

I have a table with 4 columns: hitId, userId, timestamp and Camp.
I need to classify if a hit is a start of a new session or not (1 or 0) using two parameters: 1. the time difference between hits and 2. if the source of the hit is a new campaign.
I need a standard SQL query in BigQuery.
A hit is considered as a start of a new session if one of the following is true:
it's the first hit from its userId
the time difference between the timestamp of the previous hit from
the same userId is more than 30 mins.
the time difference between the timestamp of the previous hit from the same userId is less than 30 mins, but Camp (ad campaign) value is not NULL and occures for the first time for the same userId within the previous 30 min.
So if hit1 from user1 has a Camp equal to Campaign1, and hit2 from user1 has a Camp equal to Campaign1, and time difference between hit1 and hit2 is less than 30 mins, hit1 will be considered as a start of a session, and hit2 won't be considered as a start.
I have a trouble with Campaign part. I tried this code:
I tried this code:
WITH timeDifference AS (
SELECT *,
TIMESTAMP_DIFF(timestamp, LAG(timestamp, 1) OVER
(PARTITION BY userId ORDER BY timestamp), SECOND) AS difference
FROM hitTable
ORDER BY timestamp)
SELECT *,
CASE
WHEN difference >= 30 * 60 THEN 1
WHEN difference IS NULL THEN 1
WHEN difference <= 30 * 60 AND Camp IS NOT NULL AND RANK()
OVER (PARTITION BY userId ORDER BY Camp) = 1 THEN 1
ELSE 0 END AS sess
FROM timeDifference
ORDER BY timestamp;
The condition RANK() OVER (PARTITION BY userId ORDER BY Camp) seems not working, as I receive this table:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 0
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
While I expect to have 1 for sess column for hitId = 00152:
hitId | userId | timestamp | Camp | difference | sess
_______________________________________________________________________
00150 | 858201 | 00:48:35.315 | NULL | NULL | 1
00151 | 858201 | 00:49:35.315 | NULL | 5 | 0
00152 | 858201 | 00:50:35.315 | Search-Ads-US | 10 | 1
00153 | 858201 | 00:53:35.315 | Search-Ads-US | 15 | 0
00154 | 858202 | 00:54:35.315 | Facebook-Ads | NULL | 1
00155 | 858202 | 00:54:55.315 | Facebook-Ads | 9 | 0
00156 | 858202 | 00:57:20.315 | Facebook-Ads | 12 | 0
This RANK() OVER (PARTITION BY userId ORDER BY Camp) returns falsely results in cases where a user had multiple Camps.
Notice your PARTITION BY uses userId while you want to mark sessions within each Camp.
The actual "rank 1" of the RANK() (...) statement for userId 00150 is where the Camp is NULL (hitId 00150) therefore it misses your CASE condition at hitId 00152.
You could try and add 'Camp' to your PARTITION BY as follows:
RANK() OVER (PARTITION BY userId, Camp ORDER BY Camp)
Alternatively, you could replace the RANK() (...) and use LAG(Camp) (... order by timestamp) in addition to the LAG(timestamp) (...) you are calculating.
This will retrieve the Camp value for the row before (call it 'PreviousCampValue'). Then you could add something like WHEN PreviousCampValue != Camp THEN 1
Hope that's helpful

In Redshift, how do I run the opposite of a SUM function

Assuming I have a data table
date | user_id | user_last_name | order_id | is_new_session
------------+------------+----------------+-----------+---------------
2014-09-01 | A | B | 1 | t
2014-09-01 | A | B | 5 | f
2014-09-02 | A | B | 8 | t
2014-09-01 | B | B | 2 | t
2014-09-02 | B | test | 3 | t
2014-09-03 | B | test | 4 | t
2014-09-04 | B | test | 6 | t
2014-09-04 | B | test | 7 | f
2014-09-05 | B | test | 9 | t
2014-09-05 | B | test | 10 | f
I want to get another column in Redshift which basically assigns session numbers to each users session. It starts at 1 for the first record for each user and as you move further down, if it encounters a true in the "is_new_session" column, it increments. Stays the same if it encounters a false. If it hits a new user, the value resets to 1. The ideal output for this table would be:
1
1
2
1
2
3
4
4
5
5
In my mind it's kind of the opposite of a SUM(1) over (Partition BY user_id, is_new_session ORDER BY user_id, date ASC)
Any ideas?
Thanks!
I think you want an incremental sum:
select t.*,
sum(case when is_new_session then 1 else 0 end) over (partition by user_id order by date) as session_number
from t;
In Redshift, you might need the windowing clause:
select t.*,
sum(case when is_new_session then 1 else 0 end) over
(partition by user_id
order by date
rows between unbounded preceding and current row
) as session_number
from t;

Select latest values for group of related records

I have a table that accommodates data that is logically groupable by multiple properties (foreign key for example). Data is sequential over continuous time interval; i.e. it is a time series data. What I am trying to achieve is to select only latest values for each group of groups.
Here is example data:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 1 | 01.01.2016 | 1 |
| A | 2 | 02.01.2016 | 1 |
| A | 3 | 03.01.2016 | 1 |
| A | 4 | 01.01.2016 | 2 |
| A | 5 | 02.01.2016 | 2 |
| A | 6 | 03.01.2016 | 2 |
| B | 1 | 01.01.2016 | 1 |
| B | 2 | 02.01.2016 | 1 |
| B | 3 | 03.01.2016 | 1 |
| B | 4 | 01.01.2016 | 2 |
| B | 5 | 02.01.2016 | 2 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
And here is example of desired output:
+-----------------------------------------+
| code | value | date | relation_id |
+-----------------------------------------+
| A | 3 | 03.01.2016 | 1 |
| A | 6 | 03.01.2016 | 2 |
| B | 3 | 03.01.2016 | 1 |
| B | 6 | 03.01.2016 | 2 |
+-----------------------------------------+
To put this in perspective — for every related object I want to select each code with latest date.
Here is a select I came with. I've used ROW_NUMBER OVER (PARTITION BY...) approach:
SELECT indicators.code, indicators.dimension, indicators.unit, x.value, x.date, x.ticker, x.name
FROM (
SELECT
ROW_NUMBER() OVER (PARTITION BY indicator_id ORDER BY date DESC) AS r,
t.indicator_id, t.value, t.date, t.company_id, companies.sic_id,
companies.ticker, companies.name
FROM fundamentals t
INNER JOIN companies on companies.id = t.company_id
WHERE companies.sic_id = 89
) x
INNER JOIN indicators on indicators.id = x.indicator_id
WHERE x.r <= (SELECT count(*) FROM companies where sic_id = 89)
It works but the problem is that it is painfully slow; when working with about 5% of production data which equals to roughly 3 million fundamentals records this select take about 10 seconds to finish. My guess is that happens due to subselect selecting huge amounts of records first.
Is there any way to speed this query up or am I digging in wrong direction trying to do it the way I do?
Postgres offers the convenient distinct on for this purpose:
select distinct on (relation_id, code) t.*
from t
order by relation_id, code, date desc;
So your query uses different column names than your sample data, so it's hard to tell, but it looks like you just want to group by everything except for date? Assuming you don't have multiple most recent dates, something like this should work. Basically don't use the window function, use a proper group by, and your engine should optimize the query better.
SELECT mytable.code,
mytable.value,
mytable.date,
mytable.relation_id
FROM mytable
JOIN (
SELECT code,
max(date) as date,
relation_id
FROM mytable
GROUP BY code, relation_id
) Q1
ON Q1.code = mytable.code
AND Q1.date = mytable.date
AND Q1.relation_id = mytable.relation_id
Other option:
SELECT DISTINCT Code,
Relation_ID,
FIRST_VALUE(Value) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Value,
FIRST_VALUE(Date) OVER (PARTITION BY Code, Relation_ID ORDER BY Date DESC) Date
FROM mytable
This will return top value for what ever you partition by, and for whatever you order by.
I believe we can try something like this
SELECT CODE,Relation_ID,Date,MAX(value)value FROM mytable
GROUP BY CODE,Relation_ID,Date