BigQuery SQL limit update to one row - sql

I have this BigQuery table with three columns and every row can have same values of the previous one.
For example:
| col_a | col_b | col_c
+-------+-------+------------
| 123 | 3 | 2019-12-12
| 123 | 3 | 2019-12-12
| 234 | 11 | 2019-10-12
Now I want to add new column named col_d with a UUID in it.
The problem is that when I try to execute UPDATE command I have no way to update only one row at a time (because some rows have the same values and I want different UUID in each one of them).
Things I tried with no luck :(
LIMIT
UPDATE table
SET col_d = GENERATE_UUID()
LIMIT 1
I thought to get all rows and then traverse them with an update command. But there's not LIMIT on UPDATE commands in BigQuery.
ROW_NUMBER
UPDATE table
SET col_d = ROW_NUMBER() OVER()
But BigQuery doesn't allow to use analytic function in Update command
INSERT
I can query all rows, and insert them with a UUID and then delete all the old ones that has no UUID. that approach will work and it will be my final gateaway but I believe there's a better way so I'm asking here.
Any other idea or advice will be welcome.

Below is for BigQuery Standard SQL and produces different UUID for each and every row no matter how duplicate they are
UPDATE `project.dataset.table`
SET col_d = GENERATE_UUID()
WHERE TRUE
Note: based on your "Insert and then Delete" option - I assume that col_d already exists in your table - otherwise you would not be able to do DELETE FROM table WHERE col_d IS NULL as you mentioned in your comments

You can SELECT the data with a UUID as a fourth column (col_d) and then save that data as a new table.
SELECT col_ac, col_b, col_c, GENERATE_UUID() AS col_d
FROM table
This will generate the output you desire:
| col_a | col_b | col_c | col_d
+-------+-------+-------------+------------------------------------------
| 123 | 3 | 2019-12-12 | e3784e4d-59bb-433b-a9ac-3df318e0f675
| 123 | 3 | 2019-12-12 | 430d034a-6292-4f5e-b1b0-0ee5550af3f6
| 234 | 11 | 2019-10-12 | 3e7e14d2-3077-4030-a704-5a2b7fc3c11e
Since BigQuery does not allow adding a column with data like traditional SQL the following should create a new table with the UUID values added.
CREATE OR REPLACE TABLE table AS
SELECT *, GENERATE_UUID() AS col_d
FROM table
Be warned that the table history may be deleted so back it up first. One should always backup data prior to doing such updates in all cases as undesired outcomes do arise.

because some rows have the same values and I want different UUID in each one of them
This should do what you want:
UPDATE table
SET col_d = GENERATE_UUID()
I don't understand why you would be using limit, if you want to update all rows.
That said, BigQuery has restrictions on UPDATEs, so the CREATE TABLE approach suggested by fromthehills seems more appropriate.

Related

Is there a way to ensure WHERE clause happens after DISTINCT?

Imagine you have a table comments in your database.
The comment table has the columns, id, text, show, comment_id_no.
If a user enters a comment, it inserts a row into the database
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
If a user wants to update that comment it inserts a new row into the db
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ---- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
Notice it keeps the same comment_id_no. This is so we will be able to see the history of a comment.
Now the user decides that they no longer want to display their comment
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
This hides the comment from the end users.
Now a second comment is made (not an update of the first)
| id | comment_id_no | text | show | inserted_at |
| -- | -------------- | ---- | ----- | ----------- |
| 1 | 1 | hi | true | 1/1/2000 |
| 2 | 1 | hey | true | 1/1/2001 |
| 3 | 1 | hey | false | 1/1/2002 |
| 4 | 2 | new | true | 1/1/2003 |
What I would like to be able to do is select all the latest versions of unique commend_id_no, where show is equal to true. However, I do not want the query to return id=2.
Steps the query needs to take...
select all the most recent, distinct comment_id_nos. (should return id=3 and id=4)
select where show = true (should only return id=4)
Note: I am actually writing this query in elixir using ecto and would like to be able to do this without using the subquery function. If anyone can answer this in sql I can convert the answer myself. If anyone knows how to answer this in elixir then also feel free to answer.
You can do this without using a subquery using LEFT JOIN:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
LEFT JOIN Comments AS c2
ON c2.comment_id_no = c.comment_id_no
AND c2.inserted_at > c.inserted_at
WHERE c2.id IS NULL
AND c.show = 'true';
I think all other approaches will require a subquery of some sort, this would usually be done with a ranking function:
SELECT c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM ( SELECT c.id,
c.comment_id_no,
c.text,
c.show,
c.inserted_at,
ROW_NUMBER() OVER(PARTITION BY c.comment_id_no
ORDER BY c.inserted_at DESC) AS RowNumber
FROM Comments AS c
) AS c
WHERE c.RowNumber = 1
AND c.show = 'true';
Since you have tagged with Postgresql you could also make use of DISTINCT ON ():
SELECT *
FROM ( SELECT DISTINCT ON (c.comment_id_no)
c.id, c.comment_id_no, c.text, c.show, c.inserted_at
FROM Comments AS c
ORDER By c.comment_id_no, inserted_at DESC
) x
WHERE show = 'true';
Examples on DB<>Fiddle
I think you want:
select c.*
from comments c
where c.inserted_at = (select max(c2.inserted_at)
from comments c2
where c2.comment_id_no = c.comment_id_no
) and
c.show = 'true';
I don't understand what this has to do with select distinct. You simply want the last version of a comment, and then to check if you can show that.
EDIT:
In Postgres, I would do:
select c.*
from (select distinct on (comment_id_no) c.*
from comments c
order by c.comment_id_no, c.inserted_at desc
) c
where c.show
distinct on usually has pretty good performance characteristics.
As I told in comments I don't advice to pollute data tables with history/auditory stuff.
And no: "double versioning" suggested by #Josh_Eller in his comment isn't a
good solution too: Not only for complicating queries unnecessarily but also for
being much more expensive in terms of processing and tablespace fragmentation.
Take in mind that UPDATE operations never update anything. They instead
write a whole new version of the row and mark the old one as deleted. That's
why vacuum processes are needed to defragment tablespaces in order to
recover that space.
In any case, apart of suboptimal, that approach forces you to implement more
complex queries to read and write data while in fact, I suppose most of the times you will only need to select, insert, update or even delete single row and only eventually, look its history up.
So the best solution (IMHO) is to simply implement the schema you actually need
for your main task and implement the auditory aside in a separate table and
maintained by a trigger.
This would be much more:
Robust and Simple: Because you focus on single thing every time (Single
Responsibility and KISS principles).
Fast: Auditory operations can be performed in an after trigger so
every time you perform an INSERT, UPDATE, or DELETE any possible lock
within the transaction is yet freed because the database engine knows that its outcome won't change.
Efficient: I.e. an update will, of course, insert a new row and mark
the old one as deleted. But this will be done at a low level by the database engine and, more than that: your auditory data will be fully unfragmented (because you only write there: never update). So the overall fragmentation would be always much less.
That being said, how to implement it?
Suppose this simple schema:
create table comments (
text text,
mtime timestamp not null default now(),
id serial primary key
);
create table comments_audit ( -- Or audit.comments if using separate schema
text text,
mtime timestamp not null,
id integer,
rev integer not null,
primary key (id, rev)
);
...and then this function and trigger:
create or replace function fn_comments_audit()
returns trigger
language plpgsql
security definer
-- This allows you to restrict permissions to the auditory table
-- because the function will be executed by the user who defined
-- it instead of whom executed the statement which triggered it.
as $$
DECLARE
BEGIN
if TG_OP = 'DELETE' then
raise exception 'FATAL: Deletion is not allowed for %', TG_TABLE_NAME;
-- If you want to allow deletion there are a few more decisions to take...
-- So here I block it for the sake of simplicity ;-)
end if;
insert into comments_audit (
text
, mtime
, id
, rev
) values (
NEW.text
, NEW.mtime
, NEW.id
, coalesce (
(select max(rev) + 1 from comments_audit where id = new.ID)
, 0
)
);
return NULL;
END;
$$;
create trigger tg_comments_audit
after insert or update or delete
on public.comments
for each row
execute procedure fn_comments_audit()
;
And that's all.
Notice that in this approach you will have always your current comments data
in comments_audit. You could have instead used the OLD register and only
define the trigger in the UPDATE (and DELETE) operations to avoid it.
But I prefer this approach not only because it gives us an extra redundancy (an
accidental deletion -in case it were allowed or the trigger where accidentally
disabled- on the master table, then we would be able to recover all data from
the auditory one) but also because it simplifies (and optimises) querying the
history when it's needed.
Now you only need to insert, update or select (or even delete if you develop a little more this schema, i.e. by inserting a row with nulls...) in a fully transparent manner just like if it weren't any auditory system. And, when you need that data, you only need to query the auditory table instead.
NOTE: Additionally you could want to include a creation timestamp (ctime). In this case it would be interesting to prevent it of being modified in a BEFORE trigger so I omitted it (for the sake of simplicity again) because you can already guess it from the mtimes in the auditory table (even if you are going to use it in your application it would be very advisable to add it).
If you are running Postgres 8.4 or higher, ROW_NUMBER() is the most efficient solution :
SELECT *
FROM (
SELECT c.*, ROW_NUMBER() OVER(PARTITION BY comment_id_no ORDER BY inserted_at DESC) rn
FROM comments c
WHERE c.show = 'true'
) x WHERE rn = 1
Else, this could also be achieved using a WHERE NOT EXISTS condition, that ensures that you are showing the latest comment :
SELECT c.*
FROM comments c
WHERE
c.show = 'true '
AND NOT EXISTS (
SELECT 1
FROM comments c1
WHERE c1.comment_id_no = c.comment_id_no AND c1.inserted_at > c.inserted_at
)
You have to use group by to get the latest ids and the join to the comments table to filter out the rows where show = false:
select c.*
from comments c inner join (
select comment_id_no, max(id) maxid
from comments
group by comment_id_no
) g on g.maxid = c.id
where c.show = 'true'
I assume that the column id is unique and autoincrement in comments table.
See the demo

SQL query to get latest user to update record

I have a postgres database that contains an audit log table which holds a historical log of updates to documents. It contains which document was updated, which field was updated, which user made the change, and when the change was made. Some sample data looks like this:
doc_id | user_id | created_date | field | old_value | new_value
--------+---------+------------------------+-------------+---------------+------------
A | 1 | 2018-07-30 15:43:44-05 | Title | | War and Piece
A | 2 | 2018-07-30 15:45:13-05 | Title | War and Piece | War and Peas
A | 1 | 2018-07-30 16:05:59-05 | Title | War and Peas | War and Peace
B | 1 | 2018-07-30 15:43:44-05 | Description | test 1 | test 2
B | 2 | 2018-07-30 17:45:44-05 | Description | test 2 | test 3
You can see that the Title of document A was changed three times, first by user 1 then by user 2, then again by user 1.
Basically I need to know which user was the last one to update a field on a particular document. So for example, I need to know that User 1 was the last user to update the Title field on document A. I don't really care what time it happened, just the document, field, and user.
So sample output would be something like this:
doc_id | field | user_id
--------+-------------+---------
A | Title | 1
B | Description | 2
Seems like it should be fairly straightforward query to write but I'm having some trouble with it. I would think that group by would be in order but the problem is that if I group by doc_id I lose the user data:
select doc_id, max(created_date)
from document_history
group by doc_id;
doc_id | max
--------+------------------------
B | 2018-07-30 15:00:00-05
A | 2018-07-30 16:00:00-05
I could join these results table back to the document_history table but I would need to do so based on the doc_id and timestamp which doesn't seem quite right. If two people editing a document at the exact same time I would get multiple rows back for that document and field. Maybe that's so unlikely I shouldn't worry about it, but still...
Any thoughts on a way to do this in a single query?
You want to filter the records, so think where, not group by:
select dh.*
from document_history
where dh.created_date = (select max(dh2.created_date) from document_history dh2 where dh2.doc_id = dh.doc_id);
In most databases, this will have better performance than a group by, if you have an index on document_history(doc_id, created_date).
If your DBMS supports window functions (e.g. PostgreSQL, SQL Server; aka analytic function in Oracle) you could do something like this (SQLFiddle with Postgres, other systems might differ slightly in the syntax):
http://sqlfiddle.com/#!17/981af/4
SELECT DISTINCT
doc_id, field,
first_value(user_id) OVER (PARTITION BY doc_id, field ORDER BY created_date DESC) as last_user
FROM get_last_updated
first_value() OVER (... ORDER BY x DESC) orders the window frames/partitions descending and then takes the first value which is your latest time stamp.
I added the DISTINCT to get your expected result. The window function just adds a new column to your SELECT result but within the same partition with the same value. If you do not need it, remove it and then you are able to work with the origin data plus the new won information.

How to delete hive table records ?

how to delete hive table records, we have 100 records there and i need to delete 10 records only,
when i use
dfs -rmr table_name whole table deleted
if any chance to delete in Hbase , send to data in Hbase,
You cannot delete directly from Hive table,
However, you can use a workaround of overwriting into Hive table
insert overwrite into table_name
select * from table_name
where id in (1,2,3,...)
You can't delete data from Hive tables since it is already written in the files in HDFS. You can only drop partitions which deletes directories in HDFS. So best practice is to have partitions if you want to delete in the future.
To delete records in a table, you can use the SQL syntax from your hive client :
DELETE FROM tablename [WHERE expression]
Try with where and your key with in clause
DELETE FROM tablename where id in (select id from tablename limit 10);
Example:-
I had acid transactional table in hive
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 2 | hcc |
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
Now i want to delete only 2 then my delete statement would be
delete from trans where id in (select id from trans limit 1);
Result:-
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
So we have just deleted the first record like this way you can specify limit 10 then hive can delete first 10 records.
you can specify orderby... some other clauses in your subquery if you need to delete only first 10 having specific order(like delete id's from 1 to 10).

Finding & updating duplicate rows

I need to implement a query (or maybe a stored procedure) that will perform soft de-duplication of data in one of my tables. If any two records are similar enough, I need to "squash" them: deactivate one and update another.
The similarity is based on a score. Score is calculated the following way:
from both records, take values of column A,
values equal? add A1 to the score,
values not equal? subtract A2 from the score,
move on to the next column.
As soon as all desired value pairs checked:
is resulting score more then X?
yes – records are duplicate, mark older record as "duplicate"; append its id to a duplicate_ids column to the newer record.
no – do nothing.
How would I approach solving this task in SQL?
The table in question is called people. People records are entered by different admins. The de-duplication process exists to make sure no two same people exists in the system.
The motivation for the task is simple: performance.
Right now the solution is implemented in scripting language via several sub-par SQL queries and logic on top of them. However, the volume of data is expected to grow to tens of millions of records, and script will eventually become very slow (it should run via cron every night).
I'm using postgresql.
It appears that the de-duplication is generally a tough problem.
I found this: https://github.com/dedupeio/dedupe. There's a good description of how this works: https://dedupe.io/documentation/how-it-works.html.
I'm going to explore dedupe. I'm not going to try to implement it in SQL.
If I get you correctly, this could help.
You can use PostgreSQL Window Functions to get all the duplicates and use "weights" to determine which records are duplicated so you can do whatever you like with them.
Here is an example:
-- Temporal table for the test, primary key is id and
-- we have A,B,C columns with a creation date:
CREATE TEMP TABLE test
(id serial, "colA" text, "colB" text, "colC" text,creation_date date);
-- Insert test data:
INSERT INTO test ("colA", "colB", "colC",creation_date) VALUES
('A','B','C','2017-05-01'),('D','E','F','2017-06-01'),('A','B','D','2017-08-01'),
('A','B','R','2017-09-01'),('C','J','K','2017-09-01'),('A','C','J','2017-10-01'),
('C','W','K','2017-10-01'),('R','T','Y','2017-11-01');
-- SELECT * FROM test
-- id | colA | colB | colC | creation_date
-- ----+-------+-------+-------+---------------
-- 1 | A | B | C | 2017-05-01
-- 2 | D | E | F | 2017-06-01
-- 3 | A | B | D | 2017-08-01 <-- Duplicate A,B
-- 4 | A | B | R | 2017-09-01 <-- Duplicate A,B
-- 5 | C | J | K | 2017-09-01
-- 6 | A | C | J | 2017-10-01
-- 7 | C | W | K | 2017-10-01 <-- Duplicate C,K
-- 8 | R | T | Y | 2017-11-01
-- Here is the query you can use to get the id's from the duplicate records
-- (the comments are backwards):
-- third, you select the id of the duplicates
SELECT id
FROM
(
-- Second, select all the columns needed and weight the duplicates.
-- You don't need to select every column, if only the id is needed
-- then you can only select the id
-- Query this SQL to see results:
SELECT
id,"colA", "colB", "colC",creation_date,
-- The weights are simple, if the row count is more than 1 then assign 1,
-- if the row count is 1 then assign 0, sum all and you have a
-- total weight of 'duplicity'.
CASE WHEN "num_colA">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colB">1 THEN 1 ELSE 0 END +
CASE WHEN "num_colC">1 THEN 1 ELSE 0 END as weight
FROM
(
-- First, select using window functions and assign a row number.
-- You can run this query separately to see results
SELECT *,
-- NOTE that it is order by id, if needed you can order by creation_date instead
row_number() OVER(PARTITION BY "colA" ORDER BY id) as "num_colA",
row_number() OVER(PARTITION BY "colB" ORDER BY id) as "num_colB",
row_number() OVER(PARTITION BY "colC" ORDER BY id) as "num_colC"
FROM test ORDER BY id
) count_column_duplicates
) duplicates
-- HERE IS DEFINED WHICH WEIGHT TO SELECT, for the test,
-- id defined the ones that are more than 1
WHERE weight>1
-- The total SQL returns all the duplicates acording to the selected weight:
-- id
-- ----
-- 3
-- 4
-- 7
You can add this query to a stored procedure so you can run it whenever you like. Hope it helps.

Increasing a +1 to the id without changing the content of a column

I have this random table with random contents.
id | name| mission
1 | aaaa | kitr
2 | bbbb | etre
3 | ccccc| qwqw
4 | dddd | qwert
5 | eeee | potentials
6 | ffffffff | toto
What I want is to add in the above table a column with id=3 with different name and different mission BUT the OLD id =3 I want to have an id = 4 with the name and the mission that it had before when it was id=3, and the OLD id =4 become id=5 with the name and mission of id 5 and so on.
its like i want to enter a column inside of the columns and the below column i want to increase there id +1 but the columns rest the same. example below:
id | name| mission
1 | aaaa | kitr
2 | bbbb | etre
3 | zzzzzz| zzzzz
4 | ccccc| qwqw
5 | dddd | qwert
6 | eeee | potentials
7 | ffffffff | toto
why I want to do this ? I have a table that has 2 CLOB. Inside of those CLOBS there are different queries ex: id =1 has clob of creation of a table id=2 inserts for the columns id=3 has creation of another table id=4 has functions
if you add all of this id in one text(or clob) they will have to create then inserts then create then functions. that table it is like a huge script .
Why I am doing this ? The developers are building their application and they want the sql to work in specific order and I have 6 developers and am organizing the data modeling and the performance and how the scripts are running .So the above table is to organize the calling of the scripts that they wany
Simply put, don't do it.
This case highlights why you should never use any business value, i.e. any 'real world values' for a Primary Key.
In your case I would recommend primary keys not be used for any other purposes.
I recommend you add an extra column 'order' and then change THAT column in order to re-order the rows. That way your primary key and all the other records will not need to be touched.
This avoid the issue that your approach would need to change ALL the database records below the current record which seems like a really bad approach. Just imagine trying to undo that update ;)
Some more info here: https://stackoverflow.com/a/8777574/631619
UPDATE random_table r1
SET id =
(SELECT CASE WHEN id > 2 THEN id+1 ELSE id END id FROM random_table r2
WHERE r1.mission=r2.mission
)
Then insert the new value.