How to replace timestamp-partitioned table data in BigQuery? - google-bigquery

The problem I'm trying to solve is removing duplicates from a particular partition as referenced by a TIMESTAMP type column. My table is something like the schema below with the timestamp column partition having day-based granularity:
requestID:STRING, ts:TIMESTAMP, recordNo:INTEGER, recordData:STRING
Now I have millions and millions of these and sometimes there are duplicates like this:
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 1, apple
'server1234', '2020-06-10', 2, orange
'server1234', '2020-06-10', 2, orange
The uniqueness of the records is determined by two fields: requestID and recordNo. I'd like to remove the duplicates in the partition where CAST(ts AS DATE) = '2020-06-10'. I can see the distinct records with a simple select:
SELECT DISTINCT * FROM mytable WHERE CAST(ts AS DATE) = '2020-06-10'
There must be a way to combine a delete/update/merge with the select distinct so that I can replace the partition with the de-duplicated data.
Thoughts?

The safest way to do this is to select only the data (de-duplicated) you need out into a new table, delete the data in your permanent table, then insert your de-duplicated data back into the permanent location. BigQuery does not make update/delete methods as easy as some OLTP databases.
If you would prefer a more one-shot approach, here is an example with the data you provided that does the trick.
-- SETUP
CREATE TABLE working.remove_dupes
(
requestID STRING,
ts TIMESTAMP,
recordNo INT64,
recordData STRING
)
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR);
INSERT INTO working.remove_dupes(requestID, ts, recordNo, recordData)
VALUES
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 1, 'apple'),
('server1234', '2020-06-10', 2, 'orange'),
('server1234', '2020-06-10', 2, 'orange');
------------------------------------------------------------------------------------
-- SELECTING ONLY ONE OF THE ENTRIES (NO DUPLICATES)
SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1;
------------------------------------------------------------------------------------
-- REPLACE THE ORIGINAL TABLE, REMOVING DUPLICATES IN THE PROCESS
-- BACK UP YOUR TABLE FIRST!!!!! (MAKE A COPY)
CREATE OR REPLACE TABLE working.remove_dupes
PARTITION BY TIMESTAMP_TRUNC(ts, HOUR)
AS
(SELECT
requestID,
ts,
recordNo,
recordData
FROM (
SELECT
requestID,
ts,
recordNo,
recordData,
ROW_NUMBER() OVER (PARTITION BY requestID, recordNo ORDER BY ts) AS instance_num
FROM
working.remove_dupes
)
WHERE
instance_num = 1);
EDIT: Note that replacing the table can (in my experience) wipe out table metadata (descriptions) and possibly the table partition. I've updated the example to include a table partition setup.

Related

Bigquery - Delete duplicate rows in tables by rank()

We used ETL to synchronize data from cloud storage to Bigquery, just appended latest data to table.
There're might be updated data with same attribute but with different processing timestamp. We just want to keep the latest record in the table.
Due to there's no primary concept inn Bigquery, cannot do upsert action. We want to delete redundant data by applying ranking window function.
We're able to use CREATE OR RELACE method to recreate table with latest information. However, there're over 200GB records in this table, wanna know if we can simply delete useless data?
here's our sample table schema and data,
create table `project.dataset.sample`
(
name string,
process_timestamp timestamp not null,
amount int
)
PARTITION BY
TIMESTAMP_TRUNC(process_timestamp, DAY);
insert into `project.dataset.sample`
values
('Zoe', timestamp('2022-07-09 05:04:13.439780+00'),1 ),
('Zoe', timestamp('2022-07-09 10:53:13.330751+00'),2 ),
('Zoe', timestamp('2022-07-09 18:48:01.089188+00'),3 ),
('Zoe', timestamp('2022-07-10 11:06:01.053347+00'),4 ),
('Zoe', timestamp('2022-07-10 19:11:17.731549+00'),5 ),
('Tess', timestamp('2022-07-10 11:06:01.053347+00'),1 ),
('Tess', timestamp('2022-07-10 19:11:17.731549+00'),2 )
We expected there're two record left after executing the delete SQL,
however, it deleted all record...
DELETE
FROM `project.dataset.sample` ori
WHERE EXISTS (
WITH dedup as (select *,
rank() over(partition by name order by process_timestamp desc) as rank
from `project.dataset.sample`
)
SELECT * FROM r
WHERE ori.name = dedup.name and dedup.rank > 1);
Is there any method to achieve this requirement?
Fixed, update SQL as below
DELETE
FROM `project.dataset.sample` ori
WHERE EXISTS (
WITH dedup as (select *,
rank() over(partition by name order by process_timestamp desc) as rank
from `project.dataset.sample`
)
SELECT * except(rank)
FROM dedup
WHERE ori.name = dedup.name
and ori.process_timestamp = dedup.process_timestamp
and dedup.rank > 1
);

Combining overlapping date ranges without using a cross join in BigQuery

If I have this dataset:
create schema if not exists dbo;
create table if not exists dbo.player_history(team_id INT, player_id INT, active_from TIMESTAMP, active_to TIMESTAMP);
truncate table dbo.player_history;
INSERT INTO dbo.player_history VALUES(1,1,'2020-01-01', '2020-01-08');
INSERT INTO dbo.player_history VALUES(1,2,'2020-06-01', '2020-09-08');
INSERT INTO dbo.player_history VALUES(1,3,'2020-06-10', '2020-10-01');
INSERT INTO dbo.player_history VALUES(1,4,'2020-02-01', '2020-02-15');
INSERT INTO dbo.player_history VALUES(1,5,'2021-01-01', '2021-01-08');
INSERT INTO dbo.player_history VALUES(1,6,'2021-01-02', '2022-06-08');
INSERT INTO dbo.player_history VALUES(1,7,'2021-01-03', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,8,'2021-01-04', '2021-06-08');
INSERT INTO dbo.player_history VALUES(1,9,'2020-01-02', '2021-02-05');
INSERT INTO dbo.player_history VALUES(1,10,'2020-10-01', '2021-04-08');
INSERT INTO dbo.player_history VALUES(1,11,'2020-11-01', '2021-05-08');
and I want to get combine overlapping date ranges, so that I can identify 'islands' where at least one player was active. Then I can do a cross-join and a correlated subquery to get the results as such:
with data_set as (
SELECT
a.team_id
, a.active_from
, ARRAY_AGG(b.active_to ORDER BY b.active_to DESC LIMIT 1)[SAFE_OFFSET(0)] AS active_to
FROM dbo.player_history a
LEFT JOIN dbo.player_history b
on a.team_id = b.team_id
where a.active_from between b.active_from and b.active_to
group by 1,2
)
select team_id
, min(active_from) as active_from
, active_to
from data_set
group by 1,3
order by active_from, active_to
and this gives me the desired results, however with larger data set this approach is not feasible, and BigQuery does not recommend doing joins in such a manner. Looking at the execution plan its mostly the join which causes the slowness. Are there any ways to achieve the desired output in a more efficient way?
You can use a partitioned table to get better performance with large amounts of information. The partitioned tables divide a large table into smaller partitions, thus you can improve query performance. The partitioned tables are based on a TIMESTAMP, DATE, or DATETIME .
An option could be:
Create a partitioned table
Load the data in the partitioned table
Execute the query
You can see this example:
With this query, you are creating a partitioned table and load the data at the same time. Maybe it takes some time the first time you load the data only, but it’ll be much faster when you access the partitioned table.
CREATE TABLE
mydataset.newtable (transaction_id INT64, transaction_date DATE)
PARTITION BY
transaction_date
AS SELECT transaction_id, transaction_date FROM mydataset.mytable
Then execute the query
SELECT transaction_id, transaction_date FROM mydataset.newtable
Where transaction_date between start_date and finish_date
There are some limitations using a partitioned table because it uses the results saved on cache.
Also, you can see this documentation about some points you need to consider to get the best performance when you create a query.
A very fast query to obtain for each team a list of time periodes with at least one player active:
create temporary function test(a array<date>,b array<date>)
returns array<struct<a date,b date>>
language js
as """
var out=[];
var start=a[0];
var end=a[0];
for(var i=0;i<a.length;i++)
{
if(a[i]<=end) {if(end<b[i]) end=b[i]}
else {
var tmp={"a":start,"b":end};
out.push(tmp);
start=a[i];
end=b[i];
}
}
out.push({"a":start,"b":end});
return out;
""";
select team_id, test(array_agg(active_from order by active_from),array_agg(active_to order by active_from))
from
dbo.player_history
group by 1
Your result shows:
It is confusing to show that the start date is within the previous time segment.
If your player are only active for a few years in average this query gives a a list of all dates on which a team consist of only one player or less.
with tbl_lst as (
Select team_id,date_diff(active_to,active_from,day),
generate_date_array(active_from, active_to, INTERVAL 1 DAY) as day_list
from dbo.player_history )
SELECT team_id,day,sum(active_players) as active_players
FROM (
SELECT team_id,day,count(1) as active_players
from tbl_lst,unnest(tbl_lst.day_list) as day
group by 1,2
Union ALL
Select team_id, day,0 from
(Select team_id,min(active_from) as team_START,max(active_from) as team_END
from dbo.player_history
group by 1),
unnest(generate_date_array(team_START, team_END, INTERVAL 1 DAY)) day
)
group by 1,2
having active_players<2
This following query needs 16 stages and is slow, but obtains for every time interval the amount of active players. Two tables are joined and the table data_set are only the dates in the interval, therefore it can be a maximum of 3650 rows for 10 years.
#generate a list of all dates
with dates as (
SELECT active_from as start_date from dbo.player_history
Union ALL SELECT active_to from dbo.player_history
#Union ALL Select timestamp("2050-01-01")
),
# add next date to this list
data_set as (
SELECT Distinct start_date, lead(start_date) over (order by start_date) as end_date
from dates
order by 1
)
# count player at each time
Select team_id, start_date,end_date,
count(player_id) as active_player,
string_agg(cast(player_id as string)) as player_list
from dbo.player_history
RIGHT JOIN
data_Set
on active_from<=start_date and active_to>=end_date
group by 1,2,3
having active_player<2
order by start_date

Pick Max Date from Table

I have a table variable defined thus
DECLARE #DatesTable TABLE
(
Id uniqueidentifier,
FooId uniqueidentifier,
Date date,
Value decimal (26, 10)
)
Id is always unique but FooId is duplicated throughout the table. What I would like to do is to select * from this table for each unique FooId having the max(date). So, if there are 20 rows with 4 unique FooIds then I'd like 4 rows, picking the row for each FooId where the date is the largest.
I've tried using group by but I kept getting errors about various fields not being in the select clause etc.
Use a common table expression with row_number():
;WITH cte AS
(
SELECT Id, FooId, Date, Value,
ROW_NUMBER() OVER(PARTITION BY FooId ORDER BY Date DESC) As rn
FROM #DatesTable
)
SELECT Id, FooId, Date, Value
FROM cte
WHERE rn = 1
Often the most efficient method is a correlated subquery:
select dt.*
from #DatesTable dt
where dt.date = (select max(dt2.date) from #DatesTable dt2 where dt2.fooid = dt.fooid);
However, for this to be efficient, you need an index on (fooid, date). In more recent versions of SQL Server, you can have indexes on table variables. In earlier versions, you can do this using a primary key.

Using ARRAY_AGG() with DISTINCT and ORDER BY with ORDINAL

I have some data that I am trying to aggregate (greatly simplified here). The raw data uses a schema similar to the following:
UserID - STRING
A - RECORD REPEATED
A.Action - STRING
A.Visit - INTEGER
A.Order - INTEGER
MISC - RECORD REPEATED
( other columns omitted here )
There are many actual records due to the "MISC" column, but I'm only trying to focus on the first 5 columns shown above. A sample of the raw data is shown below (note that the values shown are a sample only, many other values exist so these cannot be hard coded into the query) :
Table 0: (Raw data sample)
(empty values under UserID are as shown in BiqQuery - "A" fields are part of a nested record)
My query produces the data shown in Table 1 below. I am trying to use ARRAY_AGG with ORDINAL to select only the first two "Action"s for each user and restructure as shown in TABLE 2.
SELECT
UserId, ARRAY_AGG( STRUCT(A.Action, A.Visit, A.Order)
ORDER BY A.Visit, A.Order, A.Action )
FROM
`table`
LEFT JOIN UNNEST(A) AS A
GROUP BY
UserId
Table 1: (Sample output of above query )
Table 2: (The format needed)
So I need to:
Get distinct "Action" values for each user
Preserve the order ( UserID, Visit, Order )
Show only the 1st and 2nd actions in one row
My attempted query strategy was to ORDER BY UserID, Visit, Order and get DISTINCT values of Action using something like:
UserId,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) FirstAction,
ARRAY_AGG(DISTINCT Action ORDER BY UserID, Visit, Order) SecondAction
However, that approach produces the following error:
Error: An aggregate function that has both DISTINCT and ORDER BY arguments can only ORDER BY columns that are arguments to the function
Any thoughts on how to correct this error (or an alternative approach?)
Not sure why the original query has DISTINCT, if the results shown in table 2 don't need de-duplication.
With that said:
#standardSQL
WITH sample AS (
SELECT actor.login userid, type action
, EXTRACT(HOUR FROM created_at) visit
, EXTRACT(MINUTE FROM created_at) `order`
FROM `githubarchive.day.20171005`
)
SELECT userid, actions[OFFSET(0)] firstaction, actions[SAFE_OFFSET(1)] secondaction
FROM (
SELECT userid, ARRAY_AGG(action ORDER BY visit, `order` LIMIT 2) actions
FROM sample
GROUP BY 1
ORDER BY 1
LIMIT 100
)
Try below.
#standardSQL
SELECT UserId,
ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction,
ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction
FROM `project.dataset.table`
LEFT JOIN UNNEST(A) AS A
GROUP BY UserId
-- ORDER BY UserId
You can test / play with it using dummy data from your question
#standardSQL
WITH `table` AS (
SELECT 'U001' AS UserId, [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
('Register', 1, 1),('Upgrade', 1, 2),('Feedback', 1, 3),('Share', 1, 4),('Share', 2, 1)] AS A UNION ALL
SELECT 'U002', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
('Share', 7, 1),('Share', 7, 2),('Refer', 8, 1),('Feedback', 8, 2),('Feedback', 8, 3)] UNION ALL
SELECT 'U003', [STRUCT<Action STRING, Visit INT64, `Order` INT64 >
('Register', 1, 1),('Share', 1, 2),('Share', 1, 3),('Share', 2, 1),('Share', 2, 2),('Share', 3, 1),('Share', 3, 2)]
)
SELECT UserId,
ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(1)] AS FirstAction,
ARRAY_AGG(Action ORDER BY Visit, `Order`, Action LIMIT 2)[SAFE_ORDINAL(2)] AS SecondAction
FROM `table`
LEFT JOIN UNNEST(A) AS A
GROUP BY UserId
ORDER BY UserId

remove duplicate records with a criteria

I am using a script which requires only unique values. And I have a table which has duplicates like below, i need to keep only unique values (first occurrence) irrespective of what is present inside the brackets.
can I delete the records and keep the unique records using a single query?
Input table
ID Name
1 (Del)testing
2 (Del)test
3 (Delete)testing
4 (Delete)tester
5 (Del)tst
6 (Delete)tst
So the output tables should be something like
Input table
ID Name
1 (Del)testing
2 (Del)test
3 (Delete) tester
4 (Del)tst
SELECT DISTINCT * FROM FOO;
It depends how much data you have to retrieve, if you only have to change Delete -> Del you can try with REPLACE
http://technet.microsoft.com/en-us/library/ms186862.aspx
also grouping functions should help you
I don't think this would be easy query
Assumption: The name column always has all strings in the format given in the sample data.
Try this:
;with cte as
(select *, rank() over
(partition by substring(name, charindex(')',name)+1,len(name)+1 - charindex(')',name))
order by id) rn
from tbl
),
filtered_cte as
(select * from cte
where rn = 1
)
select rank() over (partition by getdate() order by id,getdate()) id , name
from filtered_cte
How this works:
The first CTE cte uses rank() to rank the occurrence of the string outside brackets in the name column.
The second CTE filtered_cte only returns the first row for each occurence of the specified string. In this step, we get the expected results, but not in the desired format.
In this step we partition by and order by the getdate() function. This function is chosen as a dummy to give us continuous values for the id column while using the rank function as we did in step 1.
Demo here.
Note that this solution will return filtered values, but not delete anything in the source table. If you wish, you can delete from the CTE created in step 1 to remove data from the source table.
First use this update to make them uniform
Update table set name = replace(Name, '(Del)' , '(Delete)')
then delete the repetitive names
Delete from table where id in
(Select id from (Select Row_Number() over(Partition by Name order by id) as rn,* from table) x
where rn > 1)
First create the input date table
CREATE TABLE test
(ID int,Name varchar(20));
INSERT INTO test
(`ID`, `Name`)
VALUES
(1, '(Del)testing'),
(2, '(Del)test'),
(3, '(Delete)testing'),
(4, '(Delete)tester'),
(5, '(Del)tst'),
(6, '(Delete)tst');
Select Query
select id, name
from (
select id, name ,
ROW_NUMBER() OVER(PARTITION BY substring(name,PATINDEX('%)%',name)+1,20) ORDER BY name) rn
from test ) t
where rn= 1
order by 1
SQL Fiddle Link
http://www.sqlfiddle.com/#!6/a02b0/34