BigQuery - remove duplicate rows

BigQuery - remove duplicate rows - google-bigquery

I have a table with duplicate rows (sometimes 2,3,4 duplicates) and I need to delete them by leaving only one row (they are all the same, no dates differences).
Is there another way than CREATE OR REPLACE as recommended by Google?
I've already tried with CTE, ROW_NUMBER() over partition, ... but haven't found a way for the moment
Let's say the table looks like this:
id
name
1
test
1
test
1
test
1
test

You can delete duplicate information with some steps without using the create or replace clauses.
I’m using this example data:
select * from `items`
You can follow these steps:
1.Insert the data that you don’t want to delete and mark it with ‘--’ or use the character you want.
insert into `items` (id, data)
select distinct id,concat(data,'--') from `items`
2.- Delete all the data that is not marked in this case with ‘--’
delete from `items` where STRPOS(data,"--")=0;
3.- Update the data deleting the mark we used in this case ‘--’
update `items` set data = substring(data,0,LENGTH(data)-2) where 1=1 ;

Related

Merge update records in a final table

I have a user table in Hive of the form:
User:
Id String,
Name String,
Col1 String,
UpdateTimestamp Timestamp
I'm inserting data in this table from a file which has the following format:
I/U,Timestamp when record was written to file, Id, Name, Col1, UpdateTimestamp
e.g. for inserting a user with Id 1:
I,2019-08-21 14:18:41.002947,1,Bob,stuff,123456
and updating col1 for the same user with Id 1:
U,2019-08-21 14:18:45.000000,1,,updatedstuff,123457
The columns which are not updated are returned as null.
Now simple insertion is easy in hive using load in path in a staging table and then ignoring the first two fields from the stage table.
However, how would I go about the update statements? So that my final row in hive looks like below:
1,Bob,updatedstuff,123457
I was thinking to insert all rows in a staging table and then perform some sort of merge query. Any ideas?

Typically with a merge statement your "file" would still be unique on ID and the merge statement would determine whether it needs to insert this as a new record, or update values from that record.
However, if the file is non-negotiable and will always have the I/U format, you could break the process up into two steps, the insert, then the updates, as you suggested.
In order to perform updates in Hive, you will need the users table to be stored as ORC and have ACID enabled on your cluster. For my example, I would create the users table with a cluster key, and the transactional table property:
create table test.orc_acid_example_users
(
id int
,name string
,col1 string
,updatetimestamp timestamp
)
clustered by (id) into 5 buckets
stored as ORC
tblproperties('transactional'='true');
After your insert statements, your Bob record would say "stuff" in col1:
As far as the updates - you could tackle these with an update or merge statement. I think the key here is the null values. It's important to keep the original name, or col1, or whatever, if the staging table from the file has a null value. Here's a merge example which coalesces the staging tables fields. Basically, if there is a value in the staging table, take that, or else fall back to the original value.
merge into test.orc_acid_example_users as t
using test.orc_acid_example_staging as s
on t.id = s.id
and s.type = 'U'
when matched
then update set name = coalesce(s.name,t.name), col1 = coalesce(s.col1, t.col1)
Now Bob will show "updatedstuff"
Quick disclaimer - if you have more than one update for Bob in the staging table, things will get messy. You will need to have a pre-processing step to get the latest non-null values of all the updates prior to doing the update/merge. Hive isn't really a complete transactional DB - it would be preferred for the source to send full user records any time there's an update, instead of just the changed fields only.

You can reconstruct each record in the table using you can use last_value() with the null option:
select h.id,
coalesce(h.name, last_value(h.name, true) over (partition by h.id order by h.timestamp) as name,
coalesce(h.col1, last_value(h.col1, true) over (partition by h.id order by h.timestamp) as col1,
update_timestamp
from history h;
You can use row_number() and a subquery if you want the most recent record.

which delete statement is better for deleting millions of rows

I have table which contains millions of rows.
I want to delete all the data which is over a week old based on the value of column last_updated.
so here are my two queries,
Approach 1:
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
Approach 2:
l_lastupdated varchar2(255) := to_char(sysdate-nvl(p_days,7),'YYYY-MM-DD');
insert into B(ID) select ID from A where LASTUPDATED < l_lastupdated;
delete from A where id in (select id from B);
which one is better considering performance, safety and locking?

Assuming the delete removes a significant fraction of the data & millions of rows, approach three:
create table tmp
Delete from A where to_date(last_updated,''yyyy-mm-dd'')< sysdate-7;
drop table a;
rename tmp to a;
https://asktom.oracle.com/pls/apex/f?p=100:11:0::::P11_QUESTION_ID:2345591157689
Obviously you'll need to copy over all the indexes, grants, etc. But online redefinition can help with this https://oracle-base.com/articles/11g/online-table-redefinition-enhancements-11gr1
When you get to 12.2, there's another simpler option: a filtered move.
This is an alter table move operation, with an extra clause stating which rows you want to keep:
create table t (
c1 int
);
insert into t values ( 1 );
insert into t values ( 2 );
commit;
alter table t
move including rows where c1 > 1;
select * from t;
C1
2
While you're waiting to upgrade to 12.2+ and if you don't want to use the create-as-select method for some reason then approach 1 is superior:
Both methods delete the same rows from A* => it's the same amount of work to do the delete
Option 1 has one statement; Option 2 has two statements; 2 > 1 => option 2 is more work
*Statement level consistency means you might get different results running the processes. Say another session tries to update an old row that your process will remove.
With just the delete, the update will be blocked until the delete finishes. At which point the row's gone, so the update does nothing.
Whereas if you do the insert first, the other session can update & commit the row before the insert completes. So the update "succeeds". But the delete will then remove it! Which can lead to some unhappy customers...

Your stored dateformat seems suitable for proper sorting, so you could go the other way round and convert sysdate to string:
--this is false today
select * from dual where '2019-06-05' < to_char(sysdate-7, 'YYYY-MM-DD');
--this is true today
select * from dual where '2019-05-05' < to_char(sysdate-7, 'YYYY-MM-DD');
So it would be:
Delete from A where last_updated < to_char(sysdate-7, ''yyyy-mm-dd'');
It has the benefit that your default index (if there is any) will be used.
It has the disadvantage on relying on the String/Varchar ordering which might be changed i.e. bei NLS changes (if i remember right), so in any case you should do a little testing before...
In the long term, you should of cource alter the colum to a proper date-datatype, but I guess that doesn't help you right now ;)

If you are trying to delete most of the rows in the table, I would advise you go with a different approach, namely:
create <new table name> as
select *
from <old table name>
where <predicates for the data you want to keep>;
then
drop table <old table name>;
and finally you can rename the new table back to the old table.
You could always partition the new table (i.e. create the new table with a separate statement containing the partitioning clauses, and then have an insert as select into the new table from the old table).
That way, when you need to delete rows, it's a simple matter of dropping the relevant partition(s).

Postgres: SELECT or INSERT in high concurrent write load DB

We have a DB for which we need a "selsert" (not upsert) function.
The function should take a text value and return a id column of existing row (SELECT) or insert the value and return id of new row (INSERT).
There are multiple processes that will need to perform this functionality (selsert)
I have been experimenting with pg_advisory_lock and ON CONFLICT clause for INSERT but am still not sure what approach would work best (even when looking at some of the other answers).
So far I have come up with following
WITH
selected AS (
SELECT id FROM test.body_parts WHERE (lower(trim(part))) = lower(trim('finger')) LIMIT 1
),
inserted AS (
INSERT INTO test.body_parts (part)
SELECT trim('finger')
WHERE NOT EXISTS ( SELECT * FROM selected )
-- ON CONFLICT (lower(trim(part))) DO NOTHING -- not sure if this is needed
RETURNING id
)
SELECT id, 'inserted' FROM inserted
UNION
SELECT id, 'selected' FROM selected
Will above query (within function) insure consistency in high
concurrency write workloads?
Are there any other issues I must consider (locking?, etc, etc)
BTW, I can insure that there are no duplicate values of (part) by creating unique index. That is not an issue. What I am after is that SELECT returns existing value if another process does INSERT (I hope I am explaining this right)
Unique index would have following definition
CREATE UNIQUE INDEX body_parts_part_ux
ON test.body_parts
USING btree
(lower(trim(part)));

How to Update a Single record despite multiple Occurances of the same ID Number?

I have a table that looks like the below table:
Every time the user loan a book a new record is inserted.
The data in this table is derived or taken from another table which has no dates.
I need to update this tables based on the records in the other table: Meaning I only need to update this table based on what changes.
Example: Lets say the user return the book Starship Troopers and the book return is indicated to Yes.
How do I update just that column?
What I have tried:
I tried using the MERGE Statement but it works only with unique rows of data, meaning you get an error if the same ID appears more than once.
I also tried using a basic UPDATE Statement and a JOIN but that's not going well.
I am asking because I have ran out of ideas.
Thanks for reading

If you need to update BooksReturn in target table based on the same column in source table
UPDATE t
SET t.booksreturn = s.booksreturn
FROM target t JOIN source s
ON t.userid = s.userid
AND t.booksloaned = s.booksloaned
Here is SQLFiddle demo

You can do this by simple Update & Insert statement.....
Two table A & B
From B you want to insert data into A if not exists other wise Update that data....
,First Insert into temp table....
SELECT *
INTO #MYTEMP
FROM B
WHERE BOOKSLOANED NOT IN (SELECT BOOKSLOANED
FROM A)
,Second Check data and insert into A.
INSERT INTO A
SELECT *
FROM #MYTEMP
And at last write one simple update statement which update all data of A. If any change then it also reflect to that data otherwise data as it is.
You can also update from #MYTEMP table.

Row number in Sybase tables

Sybase db tables do not have a concept of self updating row numbers. However , for one of the modules , I require the presence of rownumber corresponding to each row in the database such that max(Column) would always tell me the number of rows in the table.
I thought I'll introduce an int column and keep updating this column to keep track of the row number. However I'm having problems in updating this column in case of deletes. What sql should I use in delete trigger to update this column?

You can easily assign a unique number to each row by using an identity column. The identity can be a numeric or an integer (in ASE12+).
This will almost do what you require. There are certain circumstances in which you will get a gap in the identity sequence. (These are called "identity gaps", the best discussion on them is here). Also deletes will cause gaps in the sequence as you've identified.
Why do you need to use max(col) to get the number of rows in the table, when you could just use count(*)? If you're trying to get the last row from the table, then you can do
select * from table where column = (select max(column) from table).
Regarding the delete trigger to update a manually managed column, I think this would be a potential source of deadlocks, and many performance issues. Imagine you have 1 million rows in your table, and you delete row 1, that's 999999 rows you now have to update to subtract 1 from the id.

Delete trigger
CREATE TRIGGER tigger ON myTable FOR DELETE
AS
update myTable
set id = id - (select count(*) from deleted d where d.id < t.id)
from myTable t
To avoid locking problems
You could add an extra table (which joins to your primary table) like this:
CREATE TABLE rowCounter
(id int, -- foreign key to main table
rownum int)
... and use the rownum field from this table.
If you put the delete trigger on this table then you would hugely reduce the potential for locking problems.
Approximate solution?
Does the table need to keep its rownumbers up to date all the time?
If not, you could have a job which runs every minute or so, which checks for gaps in the rownum, and does an update.
Question: do the rownumbers have to reflect the order in which rows were inserted?
If not, you could do far fewer updates, but only updating the most recent rows, "moving" them into gaps.
Leave a comment if you would like me to post any SQL for these ideas.

I'm not sure why you would want to do this. You could experiment with using temporary tables and "select into" with an Identity column like below.
create table test
(
col1 int,
col2 varchar(3)
)
insert into test values (100, "abc")
insert into test values (111, "def")
insert into test values (222, "ghi")
insert into test values (300, "jkl")
insert into test values (400, "mno")
select rank = identity(10), col1 into #t1 from Test
select * from #t1
delete from test where col2="ghi"
select rank = identity(10), col1 into #t2 from Test
select * from #t2
drop table test
drop table #t1
drop table #t2
This would give you a dynamic id (of sorts)

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas