Removing duplicate rows with condition - google-bigquery

User sessions are tracked on the system and stored in the following format. Sometimes I get multiple records for the same session id.
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
4 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117
I used to run a sql query with DISTINCT(session_id to keep only one of the multiple records for each session id. BUT I just realized that my query picks the row on the top even when if the bottom row recorded more actions for the same session. So you look at the following table, my query keeps Row 1 & 3, like this;
Row session_id user_actions
1 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Whereas, I would like to keep row 2 and 3, like this;
Row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Is there anyway to do it with a sql query? Thank you!

Below is one of the option for BigQuery Standard SQL
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
You can test / play with above using dummy data from your question as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id
ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC
) = 1 win
FROM `project.dataset.table`
)
WHERE win
ORDER BY row
result is
row session_id user_actions
2 8a88d75c-6385-4e36-8d10-e22ac4d976a3 118,139,141,142,143,146
3 e85731b6-4472-40fb-ab2b-33ebd1278ba9 211,114,117,118,141,142,143,146
Another option would be as below
#standardSQL
WITH `project.dataset.table` AS (
SELECT 1 row, '8a88d75c-6385-4e36-8d10-e22ac4d976a3' session_id, '118,139,141' user_actions UNION ALL
SELECT 2, '8a88d75c-6385-4e36-8d10-e22ac4d976a3', '118,139,141,142,143,146' UNION ALL
SELECT 3, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117,118,141,142,143,146' UNION ALL
SELECT 4, 'e85731b6-4472-40fb-ab2b-33ebd1278ba9', '211,114,117'
)
SELECT session_id,
ARRAY_AGG(user_actions ORDER BY ARRAY_LENGTH(SPLIT(user_actions)) DESC LIMIT 1)[SAFE_OFFSET(0)] user_actions
FROM `project.dataset.table`
GROUP BY session_id
This one looks a little cleaner :o)
You can extend above with for example combining distinct codes from deduplicated entries in case if (for example) some actions are missing in one row but not another and etc.)
Update:
Try below to separate expenses of calculating array_length from ordering in partition:
#standardSQL
SELECT row, session_id, user_actions
FROM (
SELECT
row, session_id, user_actions,
ROW_NUMBER() OVER(PARTITION BY session_id ORDER BY len DESC) = 1 win
FROM (
SELECT *, ARRAY_LENGTH(SPLIT(user_actions)) len
FROM `project.dataset.table`
)
)
WHERE win

Related

How do I select 1 [oldest] row per group of rows, given multiple groups?

Let's say we have the database table below, called USER_JOBS.
I'd like to write an SQL query that reflects this algorithm:
Divide the whole table in groups of rows defined by a common USER_ID (in the example table, the 2 resulting groups are colored yellow & green)
From each group, select the oldest row (according to SCHEDULE_TIME)
From this example table, the desired SQL query would return these 2 rows:
You can use ranking function (supported in most RDBS):
SELECT *
FROM
(
SELECT *
,ROW_NUMBER() OVER (PARTITION BY USER_ID ORDER BY SCHEDULE_TIME DESC) AS RowID
FROM [table]
)
WHERE RowID = 1
WITH Ranked AS (
SELECT
RANK() OVER (PARTITION BY User_ID ORDER BY ScheduleTime DESC) as Ranking,
*
FROM [table_name]
)
SELECT Status, Sob_Type, User_ID, TimeStamp FROM ranking WHERE Ranks = 1;

Select duplicate rows based on time difference and occurrence count

I have a table like this :
As you can see, some records with the same farsi_pelak field have been added(detected) more than 1 time within a few seconds.
That's happened because of some application bug which has been fixed.
Now I need to select and then delete duplicate rows which have been added at the same time (+- few seconds)
And this is my query :
SELECT TOP 100 PERCENT
y.id, y.farsi_pelak , y.detection_date_p , y.detection_time
FROM dbo._tbl_detection y
INNER JOIN
(SELECT TOP 100 PERCENT
farsi_pelak , detection_date_p
FROM dbo._tbl_detection WHERE camera_id = 2
GROUP BY farsi_pelak , detection_date_p
HAVING COUNT(farsi_pelak)>1) dt
ON
y.farsi_pelak=dt.farsi_pelak AND y.detection_date_p =dt.detection_date_p
ORDER BY farsi_pelak , detection_date_p DESC
But I can't calculate the time difference because my detection_time field should not be grouped by.
If you use SQL Server 2012 or later, you can use LAG function to get the values from the "previous" row.
Then calculate the difference between adjacent timestamps and find those rows where this difference is small.
WITH
CTE
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,LAG(detection_time) OVER (PARTITION BY farsi_pelak
ORDER BY detection_date_p, detection_time) AS prev_detection_time
FROM dbo._tbl_detection
)
,CTE_Diff
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,DATEDIFF(second, prev_detection_time, detection_time) AS diff
FROM CTE
)
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,diff
FROM CTE_Diff
WHERE
diff <= 10
;
When you run this query and verify that it returns only rows that you want to delete, you can change the last SELECT to DELETE:
WITH
CTE
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,LAG(detection_time) OVER (PARTITION BY farsi_pelak
ORDER BY detection_date_p, detection_time) AS prev_detection_time
FROM dbo._tbl_detection
)
,CTE_Diff
AS
(
SELECT
id
,farsi_pelak
,detection_date_p
,detection_time
,prev_detection_time
,DATEDIFF(second, prev_detection_time, detection_time) AS diff
FROM CTE
)
DELETE
FROM CTE_Diff
WHERE
diff <= 10
;
I guess you need rownumber to check time as below keeping the earliest time data and discarding the rest detection time for rownums greater than 1
Select y.id, y.farsi_pelak ,
y.detection_date_p , y.detection_time,
row_number() over (partition by
y.farsi_pelak,
y.detection_date_p order by
y.detection_time) rn
from ( the above query) where rn>1

SQL group rows into pairs

I'm trying to add some sort of unique identifier (uid) to partitions made of pairs of rows, i.e. generate some uid/tag for each two rows of (identifier1,identifier2) in a window partition with size = 2 rows.
So, for example, the first 2 rows for ID X would get uid A, the next two rows for the same ID would get uid B and, if there is only one single row left in the partition for ID X, it would get id C.
Here's what I'm trying to accomplish, the picture illustrates the table's structure, I manually added the expectedIdentifier to illustrate the goal:
This is my current SQL, ntile doesn't solve it because the partition size varies:
select
rowId
, ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
, *
from log;
Already tried ntile( (count(*) over partition...) / 2), but that doesn't work.
Generating the UID can be done with md5() or similar, but I'm having trouble tagging the rows as illustrated above (so I can md5 the generated tag/uid)
While count(*) is not supported within a Snowflake window function, count(1) is supported and can be used to create the unique identifier. Below is an example of an integer unique ID matching pairs of rows and handling "odd" row groups:
select
ntile(2) over (partition by firstIdentifier, secondIdentifier order by timestamp asc) as ntile
,ceil(count(1) over( partition by firstIdentifier, secondIdentifier order by timestamp asc) / 2) as id
, *
from log;
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier, timestamp
Here is the Sql Server Version
with log (firstidentifier,secondidentifier, timestamp)
as (
select 15396, 14460, 1 union all
select 15396, 14460, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 19744, 14451, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1 union all
select 15590, 12404, 1
)
select *, char(65 + (row_number() over(partition by
firstidentifier,secondidentifier order by timestamp)-1)/2)
expectedidentifier from log
order by firstidentifier,secondidentifier,timestamp

Query Hive table using ROWNUM

How can I query a Hive table specific to row number.
For example :
Let say I want to print out all records of Hive table from row number 2 to 5.
I actually recently updated the documentation regarding the offset option
... order by ... limit 1,4
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause
This answer seems like what you're asking:
SQL most recent using row_number() over partition
In other words:
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click between 2 and 5

"Group" some rows together before sorting (Oracle)

I'm using Oracle Database 11g.
I have a query that selects, among other things, an ID and a date from a table. Basically, what I want to do is keep the rows that have the same ID together, and then sort those "groups" of rows by the most recent date in the "group".
So if my original result was this:
ID Date
3 11/26/11
1 1/5/12
2 6/3/13
2 10/15/13
1 7/5/13
The output I'm hoping for is:
ID Date
3 11/26/11 <-- (Using this date for "group" ID = 3)
1 1/5/12
1 7/5/13 <-- (Using this date for "group" ID = 1)
2 6/3/13
2 10/15/13 <-- (Using this date for "group" ID = 2)
Is there any way to do this?
One way to get this is by using analytic functions; I don't have an example of that handy.
This is another way to get the specified result, without using an analytic function (this is ordering first by the most_recent_date for each ID, then by ID, then by Date):
SELECT t.ID
, t.Date
FROM mytable t
JOIN ( SELECT s.ID
, MAX(s.Date) AS most_recent_date
FROM mytable s
WHERE s.Date IS NOT NULL
GROUP BY s.ID
) r
ON r.ID = t.ID
ORDER
BY r.most_recent_date
, t.ID
, t.Date
The "trick" here is to return "most_recent_date" for each ID, and then join that to each row. The result can be ordered by that first, then by whatever else.
(I also think there's a way to get this same ordering using Analytic functions, but I don't have an example of that handy.)
You can use the MAX ... KEEP function with your aggregate to create your sort key:
with
sample_data as
(select 3 id, to_date('11/26/11','MM/DD/RR') date_col from dual union all
select 1, to_date('1/5/12','MM/DD/RR') date_col from dual union all
select 2, to_date('6/3/13','MM/DD/RR') date_col from dual union all
select 2, to_date('10/15/13','MM/DD/RR') date_col from dual union all
select 1, to_date('7/5/13','MM/DD/RR') date_col from dual)
select
id,
date_col,
-- For illustration purposes, does not need to be selected:
max(date_col) keep (dense_rank last order by date_col) over (partition by id) sort_key
from sample_data
order by max(date_col) keep (dense_rank last order by date_col) over (partition by id);
Here is the query using analytic functions:
select
id
, date_
, max(date_) over (partition by id) as max_date
from table_name
order by max_date, id
;