BigQuery Standard SQL: Delete Duplicates from Table

BigQuery Standard SQL: Delete Duplicates from Table - google-bigquery

I am using below query to delete duplicates records from bigquery using standard sql. but it is throwing error
with cte as (
select * ,row_number()over (partition by CallRailCallId order by CallRailCallId) as rn
from `encoremarketingtest.EncoreMarketingTest.CallRailCall2` )
delete
from cte
where rn>1
Query Failed
Error: Syntax error: Expected "(" or keyword SELECT but got keyword DELETE at [5:5]
Could anyone help me on the correct approach in BigQuery?

Option #1
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT * EXCEPT(rn)
FROM (
SELECT *, ROW_NUMBER() OVER(PARTITION BY CallRailCallId ORDER BY CallRailCallId) rn
FROM `project.dataset.your_table`
)
WHERE rn = 1
Option #2
CREATE OR REPLACE TABLE `project.dataset.your_table` AS
SELECT row.*
FROM (
SELECT ARRAY_AGG(t ORDER BY CallRailCallId LIMIT 1)[OFFSET(0)] row
FROM `project.dataset.your_table` t
GROUP BY CallRailCallId
)
As you might noticed, above options using DDL(CREATE TABLE) approach and that is where it is possible to use just one known (from your question) column - CallRailCallId
Also, note - ORDER BY CallRailCallId plays no real role there because GROUP BY and PARTITION BY are by exactly same filed. But if you change the field this will control which exactly row (out of few duplicates) to "survive" (For example ORDER BY ts DESC - see below option for what ts might be)
Option #3
This option uses DML(DELETE FROM) but requires some extra column to be used to serve as a tie-breaker
For example you have ts TIMESTAMP field and you want the most recent (based on ts) row to survive
DELETE FROM `project.dataset.your_table`
WHERE STRUCT(CallRailCallId, ts) NOT IN (
SELECT AS STRUCT CallRailCallId, MAX(ts) ts
FROM `project.dataset.your_table`
GROUP BY CallRailCallId
)

Related

MATCH_RECOGNIZE with CTE in Snowflake

I am using MATCH_RECOGNIZE function in a query with a few CTEs. When I run the query, I got the following error:
SQL compilation error: MATCH_RECOGNIZE not supported in this context.
In my query, there are several CTEs before and after the MATCH_RECOGNIZE partially as below.
WITH cte1 AS (
SELECT *
FROM dataset
WHERE ID IS NOT NULL AND STATUS IS NOT NULL ),
cte2 AS (
SELECT *
FROM cte1
QUALIFY FIRST_VALUE(STATUS) OVER (PARTITION BY ID ORDER BY CREATED_AT) = 'created' )
mr as (
SELECT *
FROM cte2
MATCH_RECOGNIZE (
PARTITION BY ID
ORDER BY CREATED_AT
MEASURES MATCH_NUMBER() AS mn,
MATCH_SEQUENCE_NUMBER AS msn
ALL ROWS PER MATCH
PATTERN (c+m+)
DEFINE
c AS status='created'
,m AS status='missing_info'
,p AS status='pending'
) m1
QUALIFY (ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn) = 1)
OR(ROW_NUMBER() OVER(PARTITION BY mn, ID ORDER BY msn DESC)=1)
ORDER BY ID, CREATED_AT ),
cte3 as (
SELECT *
FROM mr
-- some other operations
)
What would be the ideal approach to solve this? e.g. creating a regular view, a materialized view, or a temp table, etc. I tried to create a view but got an error, not sure if it is supported either.
How can I use the result of the MATCH_RECOGNIZE in other later CTEs?
When I add the following, it gives this error:
syntax error line xx at position 0 unexpected 'create'.
create view filtered_idents AS
SELECT *
FROM cte2
MATCH_RECOGNIZE (
)

This seems to be a non-documented limitation (I asked our awesome docs team to fix this).
In the meantime I could suggest to divide the process into steps to use the match_recognize results.
Reproducing error:
with data as (
select $1 company, $2 price_date, $3 price
from values('a',1,10), ('a',2,15)
), cte as (
select *
from data match_recognize(
partition by company
order by price_date
measures match_number() as "MATCH_NUMBER"
all rows per match omit empty matches
pattern(overavg*)
define
overavg as price > avg(price) over (rows between unbounded
preceding and unbounded following)
)
)
select * from cte
-- 002362 (0A000): SQL compilation error: MATCH_RECOGNIZE not supported in this context.
2 step solution:
with data as (
select $1 company, $2 price_date, $3 price
from values('a',1,10), ('a',2,15)
)
select *
from data match_recognize(
partition by company
order by price_date
measures match_number() as "MATCH_NUMBER"
all rows per match omit empty matches
pattern(overavg*)
define
overavg as price > avg(price) over (rows between unbounded
preceding and unbounded following)
)
;
with previous_results as (
select *
from table(result_scan(last_query_id()))
)
select *
from previous_results
;

Kimi, trying out your snippet I'm getting:
SQL compilation error: syntax error line 11 at position 0 unexpected 'mr'. syntax error line 17 at position 6 unexpected 'MEASURES'.
Line 9 seems to be missing a terminating comma.
When I add one and then complete the whole with a simple select statement then I don't get syntax errors anymore, I only get name lookup errors (expected of course).

Avoid duplicate records from a particular column of a table

I have a table as shown in the image.In Number column, the values are appeared more than once (for example 63 appeared twice). I would like to keep only one value. Please see my code:
delete from t1 where
(SELECT *,row_number() OVER (
PARTITION BY
Number
ORDER BY
Date) as rn from t1 where rn > 1)
It shows error. Can anyone please assist.
enter image description here

The column created by row_number() was not accessed by your main query, in order to enable that, you can create a quick sub query and use the desired filter
SELECT *
FROM
(
SELECT *,
row_number() OVER (PARTITION BY Number ORDER BY Date) as rn
FROM t1 ) T
where rn = 1;

The partition by determines how row numbers repeat. The row numbers are assigned per group of partition by keys. So, you can get duplicates.
If you want a unique row number over all rows, just leave out the partition by:
select t1.*
from (select t1.*,
row_number() over (order by date) as rn
from t1
) t1
where rn > 1

if you want to keep only one value, rn = 1 instead of "> 1"

Get the last time a value has changed in Google BigQuery

I have an employee database which contains records about employees. The fields are :
employee_identifier
employee_salary
date_of_the_record
I would like to get, for each record, the date of the last change in employee_salary. Which SQL query could work ?
I have tried with multiple sub-queries, but it does not work.

Below is for BigQuery Standard SQL
#standardSQL
SELECT * EXCEPT(arr),
(SELECT MAX(date_of_the_record) FROM UNNEST(arr)
WHERE employee_salary != t.employee_salary
) AS last_change_in_employee_salary
FROM (
SELECT *, ARRAY_AGG(STRUCT(employee_salary, date_of_the_record)) OVER(win) arr
FROM `project.dataset.employee_database`
WINDOW win AS (PARTITION BY employee_identifier ORDER BY date_of_the_record)
) t

use row_number()
with cte as
(
select *,
row_number()over(partition by employee_identifier order by date_of_the_record desc) rn from table_name
) select * from cte where rn=1

You can also do this without a subquery. If you want all the columns:
SELECT as value ARRAY_AGG(t ORDER BY date_of_the_record DESC LIMIT 1)[ordinal(1)]
FROM t t
GROUP BY employee_identifier;
If you just want the date, use GROUP BY:
SELECT employee_identifier, MAX(date_of_the_record)
FROM t t
GROUP BY employee_identifier;

Query Hive table using ROWNUM

How can I query a Hive table specific to row number.
For example :
Let say I want to print out all records of Hive table from row number 2 to 5.

I actually recently updated the documentation regarding the offset option
... order by ... limit 1,4
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+Select#LanguageManualSelect-LIMITClause

This answer seems like what you're asking:
SQL most recent using row_number() over partition
In other words:
SELECT user_id, page_name, recent_click
FROM (
SELECT user_id,
page_name,
row_number() over (partition by session_id order by ts desc) as recent_click
from clicks_data
) T
WHERE recent_click between 2 and 5

row_number() over() combined with order by

How can I add a sequential row number to a query that is using order by?
Let say I have a request in this form :
SELECT row_number() over(), data
FROM myTable
ORDER BY data
This will produce the desired result as rows are ordered by "data", but the row numbers are also ordered by data. I understand this is normal as my row number is generated before the order by, but how can I generate this row number after the order by?
I did try to use a subquery like this :
SELECT row_number() over(ORDER BY data), *
FROM
(
SELECT data
FROM myTable
ORDER BY data
) As t1
As shown here, but DB2 doesn't seem to support this syntax SELECT ..., * FROM
Thanks !

You also need to use alaias name before '*'
SELECT row_number() over(ORDER BY data), t1.*
FROM
(
SELECT data
FROM myTable
ORDER BY data
) As t1
You don't need a subquery to do this,
SELECT data , row_number() over(ORDER BY data) as rn
FROM myTable
ORDER BY data

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

BigQuery Standard SQL: Delete Duplicates from Table - google-bigquery

Related

MATCH_RECOGNIZE with CTE in Snowflake

Avoid duplicate records from a particular column of a table

Get the last time a value has changed in Google BigQuery

Query Hive table using ROWNUM

row_number() over() combined with order by

Categories

Resources