How to delete hive table records ? - hive

how to delete hive table records, we have 100 records there and i need to delete 10 records only,
when i use
dfs -rmr table_name whole table deleted
if any chance to delete in Hbase , send to data in Hbase,

You cannot delete directly from Hive table,
However, you can use a workaround of overwriting into Hive table
insert overwrite into table_name
select * from table_name
where id in (1,2,3,...)

You can't delete data from Hive tables since it is already written in the files in HDFS. You can only drop partitions which deletes directories in HDFS. So best practice is to have partitions if you want to delete in the future.

To delete records in a table, you can use the SQL syntax from your hive client :
DELETE FROM tablename [WHERE expression]

Try with where and your key with in clause
DELETE FROM tablename where id in (select id from tablename limit 10);
Example:-
I had acid transactional table in hive
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 2 | hcc |
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
Now i want to delete only 2 then my delete statement would be
delete from trans where id in (select id from trans limit 1);
Result:-
select * from trans;
+-----+-------+--+
| id | name |
+-----+-------+--+
| 1 | hi |
| 3 | hdp |
+-----+-------+--+
So we have just deleted the first record like this way you can specify limit 10 then hive can delete first 10 records.
you can specify orderby... some other clauses in your subquery if you need to delete only first 10 having specific order(like delete id's from 1 to 10).

Related

Find difference between two table in SQL and append back result to source table & update column values only for newly inserted rows

I am new to SQL Server. I want to create a procedure which should check difference between master table & quarterly table and insert the different rows back to the master table and update the corresponding column values.
Master table is like:
|PID | Release_date | Retired_date
|loc12|202108 |
|loc34|202108 |
Quaterly table is like:
|PID | Address | Post_code
|loc12| Srinagar | 5678
|loc34| Girinagar | 6789
|loc45| RRnagar | 7890
|loc56| Bnagar | 9012
Resultant Master table should be like:
|PID | Release_date | Retired_date
|loc12|202108 |
|loc34|202108 |
|loc45|202111 |
|loc56|202111 |
I have tried except but I'm not able to update the master table after inserting the difference. My code is
insert into master(select PID from Master
except
select PID from Quaterly)
update master
set Release_date = '202111'
where PID in (select PID from Master
except
select PID from Quaterly)
TIA
You could do everything in one query, no need to use UPDATE:
INSERT INTO Master(PID, Release_date)
SELECT q.PID, '202111'
FROM Quaterly q
WHERE q.PID NOT IN (SELECT PID FROM Master)
Other approach you can use by leveraging SQL JOINs:
INSERT INTO MASTER
SELECT q2.PID, '202111'
FROM Quaterly q1
LEFT JOIN Quaterly q2
ON q1.PID = q2.PID
WHERE q1.PID IS NULL

Deleting a row from a table based on the existence

This might be a trivial solution. I have searched similar posts regarding this but I couldn't find a proper solution.
I'm trying to delete a row if it exists in a table
I have a table say
Table1
----------------------------
|Database| Schema | Number |
----------------------------
| DB1 | S1 | 1 |
| DB2 | S2 | 2 |
| DB3 | S3 | 3 | <--- Want to delete this row
| DB4 | S4 | 4 |
----------------------------
Here is my query
DELETE FROM Table1
WHERE EXISTS
(SELECT * FROM Table1 WHERE Database = 'DB3' and Schema = 'S3');
When I tried the above SQL, it returned me an empty table, don't understand why it's returning an empty table.
There are similar posts on stack overflow but I couldn't find why I'm getting empty table.
Why are you using a subquery? Just use a where clause:
DELETE FROM Table1
WHERE Database = 'DB3' and Schema = 'S3';
Your code will delete either all rows or no rows. The where condition is saying "delete all rows from this table where this subquery returns at least one row". So, if the subquery returns one row, everything is deleted. Otherwise, nothing is deleted.

How to delete duplicates but keep one when all tuples are identical in duplicates and original? In postgreSQL

Suppose we have below table.
How to delete 2 duplicates and keep one? My code deletes all of them.
+----+-------+
| ID | NAME |
+----+-------+
| 2 | ARK |
| 3 | CAR |
| 9 | PAR |
| 9 | PAR |
| 9 | PAR |
+----+-------+
Ideally, your table should have a unique ID. If not then you can use ctid as a dummy unique ID field as the below query.
ctid represents the physical location of the row version within its table. Note that although the ctid can be used to locate the row version very quickly, a row’s ctid will change if it is updated or moved by VACUUM FULL. Therefore ctid is useless as a long-term row identifier. But it does the job here.
delete from my_table a using my_table b where a=b and a.ctid < b.ctid;
DB fiddle link - https://dbfiddle.uk/?rdbms=postgres_10&fiddle=4888d519e125dc095496a57477a60b9f
You could do it using deletion by row_number
Delete from table t1 where 1<(
Select rn from ( select id, name,
row_number() over (partition by id, name
order by id) rn from table)

BigQuery SQL limit update to one row

I have this BigQuery table with three columns and every row can have same values of the previous one.
For example:
| col_a | col_b | col_c
+-------+-------+------------
| 123 | 3 | 2019-12-12
| 123 | 3 | 2019-12-12
| 234 | 11 | 2019-10-12
Now I want to add new column named col_d with a UUID in it.
The problem is that when I try to execute UPDATE command I have no way to update only one row at a time (because some rows have the same values and I want different UUID in each one of them).
Things I tried with no luck :(
LIMIT
UPDATE table
SET col_d = GENERATE_UUID()
LIMIT 1
I thought to get all rows and then traverse them with an update command. But there's not LIMIT on UPDATE commands in BigQuery.
ROW_NUMBER
UPDATE table
SET col_d = ROW_NUMBER() OVER()
But BigQuery doesn't allow to use analytic function in Update command
INSERT
I can query all rows, and insert them with a UUID and then delete all the old ones that has no UUID. that approach will work and it will be my final gateaway but I believe there's a better way so I'm asking here.
Any other idea or advice will be welcome.
Below is for BigQuery Standard SQL and produces different UUID for each and every row no matter how duplicate they are
UPDATE `project.dataset.table`
SET col_d = GENERATE_UUID()
WHERE TRUE
Note: based on your "Insert and then Delete" option - I assume that col_d already exists in your table - otherwise you would not be able to do DELETE FROM table WHERE col_d IS NULL as you mentioned in your comments
You can SELECT the data with a UUID as a fourth column (col_d) and then save that data as a new table.
SELECT col_ac, col_b, col_c, GENERATE_UUID() AS col_d
FROM table
This will generate the output you desire:
| col_a | col_b | col_c | col_d
+-------+-------+-------------+------------------------------------------
| 123 | 3 | 2019-12-12 | e3784e4d-59bb-433b-a9ac-3df318e0f675
| 123 | 3 | 2019-12-12 | 430d034a-6292-4f5e-b1b0-0ee5550af3f6
| 234 | 11 | 2019-10-12 | 3e7e14d2-3077-4030-a704-5a2b7fc3c11e
Since BigQuery does not allow adding a column with data like traditional SQL the following should create a new table with the UUID values added.
CREATE OR REPLACE TABLE table AS
SELECT *, GENERATE_UUID() AS col_d
FROM table
Be warned that the table history may be deleted so back it up first. One should always backup data prior to doing such updates in all cases as undesired outcomes do arise.
because some rows have the same values and I want different UUID in each one of them
This should do what you want:
UPDATE table
SET col_d = GENERATE_UUID()
I don't understand why you would be using limit, if you want to update all rows.
That said, BigQuery has restrictions on UPDATEs, so the CREATE TABLE approach suggested by fromthehills seems more appropriate.

Delete duplicate equal rows in BigQuery

There is a table with duplicates rows, where all column values are equal:
+------+---------+------------+
| id | value | timestamp |
+------+---------+------------+
| 1 | 500 | 2019-10-12 |
| 2 | 400 | 2019-10-11 |
| 1 | 500 | 2019-10-12 |
+------+---------+------------+
I want to keep one of those equal rows and delete the others. I came up with:
DELETE
FROM
`table` t1
WHERE (
SELECT
ROW_NUMBER() OVER (PARTITION BY id),
FROM
`table` t2
WHERE
t1.id = t2.id
) > 1
However this does not work:
Correlated subqueries that reference other tables are not supported
unless they can be de-correlated, such as by transforming them into an
efficient JOIN.
Any ideas how to remove duplicate rows?
Below is for BigQuery Standard SQL
... where all column values are equal
So you can use simple DISTINCT * and instead of DELETE use CREATE / REPLACE to write back to the same table
#standardSQL
CREATE OR REPLACE TABLE `project.dataset.table`
PARTITION BY date
SELECT DISTINCT *
FROM `project.dataset.table`
In PARTITION BY clause - you should add the fields you use to partition the original table