I need to drop specific rows from a Hive table, which is partitioned. These rows for deletion matches certain conditions, so entire partitions can not be dropped in order to do so. Lets say the table Table has three columns: partner, date and source_key, and it is partitioned by date and source_key.
It is known that there is no operation supported for deletion or update of a particular set of records in hive (See How to delete and update a record in Hive) .
Following this solution I successfully executed the following queries in order to keep only the records which matches some given conditions, say: belong to some given range of date, having source_key='heaven' and the column partner<>'angel'...
Create temporary empty copy of the table Table.
CREATE TABLE IF NOT EXISTS tmpTable LIKE Table;
Fill it with the current rows.
INSERT OVERWRITE TABLE tmpTable
PARTITION (date,source_key)
SELECT * FROM Table
WHERE
date >= '2020-05-01' AND date < '2020-11-30' AND
source_key = 'heaven';
Delete target partitions.
ALTER TABLE Table DROP IF EXISTS
PARTITION (source_key = 'heaven' , date >= '2020-05-01' , date < '2020-11-30' );
Insert the edited partitions to the target table. (couldn't manage to insert OVERWRITE because of syntax errors)
INSERT INTO Table
PARTITION (source_key,date)
SELECT * FROM tmpTable
WHERE
partner <> 'angel';
Drop temporary table.
DROP TABLE IF EXISTS tmpTable;
The query runs fine. Because the table Table is managed, when the partitions are dropped the hdfs files should be dropped within, but something is wrong (perhaps in the last INSERT INTO statement) because after the execution of all these queries the target table Table keeps all records with partner = 'angel' in the given range of dates and with , basically stays the same.
Where is the fault? What is missing? How to accurately delete specific rows matching certain conditions for such a Hive table?
Table partitions can be overwritten directly from select from itself + WHERE filter. The scenario is quite simple, you do not need any temporary table. Make backup table if you are not sure what will happen.
If you want to drop entire partitions (not overwrite), execute
ALTER TABLE TableName DROP IF EXISTS
PARTITION (<partition spec to be dropped>); --check partition spec to be dropped carefully
Skip this if no partitions to be dropped.
Overwrite other partitions with filtered rows:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.allow.move.on.s3=true; --If you are on Qubole/S3
insert overwrite table TableName partition (date, source_key ) --partition spec should match table DDL
select * from TableName
where <condition> --condition should be True for rows which NOT be deleted
Your code is rather confusing because you created temporary table using LIKE but using different partition specification and selecting * (same order of columns like in original table). Order of columns shold match exactly, partition columns are the last ones, also in the same order.
Related
We have a table in our db with copied data that has completely duplicated many rows. Because the id is also duplicated there is nothing we can use to select just the duplicates. I tried using a limit to only delete 1 but redshift gave a syntax error when trying to use limit.
Any ideas how we can delete just one of two rows that have completely identical information?
Use select distinct to create a new table. Then either truncate & copy the data, or drop the original table and rename the new table to the original name:
create table t2 as select distinct * from t;
truncate t;
insert into t from select * from t2;
drop table t2;
Add column a column with unique values. identity(seed, step) looks interesting.
I am working with a temporary table in Netezza that contains the columns id, gender, start_date, and end_date. I want to add a new column to this table that contains a default date of 2019-01-01 for all rows. The table to which I want to add this column is a local temp table, so ALTER TABLE does not work ("Error: Operation not allowed on a temp table"). To get around this, I created a new temp table as follows:
DROP TABLE new_temp_table IF EXISTS;
GO
SELECT id, gender, start_date, end_date, '2019-01-01' default_date
INTO TEMP TABLE new_temp_table
FROM old_temp_table;
GO
This new table is limited to 1000 rows by the SELECT...INTO syntax. My old table has 36 million rows. Is there a solution that would allow me to directly modify the old table to add the new default date column, or some other way to get around the 1000-row limit with SELECT...INTO?
Having a Hive table that's partitioned
CREATE EXTERNAL TABLE IF NOT EXISTS CUSTOMER_PART (
NAME string ,
AGE int ,
YEAR INT)
PARTITIONED BY (CUSTOMER_ID decimal(15,0))
STORED AS PARQUET LOCATION 'HDFS LOCATION'
The first LOAD is done from ORACLE to HIVE via PYSPARK using
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID) SELECT NAME, AGE, YEAR, CUSTOMER_ID FROM CUSTOMER;
Which works fine and creates partition dynamically during the run. Now coming to data loading incrementally everyday creates individual files for a single record under the partition.
INSERT INTO TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3; --Assume this gives me the latest record in the database
Is there a possibility to have the value appended to the existing parquet file under the partition until it reaches it block size, without having smaller files created for each insert.
Rewriting the whole partition is one option but I would prefer not to do this
INSERT OVERWRITE TABLE CUSTOMER_PART PARTITION (CUSTOMER_ID = 3) SELECT NAME, AGE, YEAR FROM CUSTOMER WHERE CUSTOMER_ID = 3;
The following properties are set for the Hive
set hive.execution.engine=tez; -- TEZ execution engine
set hive.merge.tezfiles=true; -- Notifying that merge step is required
set hive.merge.smallfiles.avgsize=128000000; --128MB
set hive.merge.size.per.task=128000000; -- 128MB
Which still doesn't help with daily inserts. Any alternate approach that can be followed would be really helpful.
As Per my knowledge we cant store the single file for daily partition data since data will be stored by different part files for each day partition.
Since you mention that you are importing the data from Oracle DB so you can import the entire data each time from oracle DB and overwrite into HDFS. By this way you can maintain the single part file.
Also HDFS is not recommended for small amount data.
I could think of the following approaches for this case:
Approach1:
Recreating the Hive Table, i.e after loading incremental data into CUSTOMER_PART table.
Create a temp_CUSTOMER_PART table with entire snapshot of CUSTOMER_PART table data.
Run overwrite the final table CUSTOMER_PART selecting from temp_CUSTOMER_PART table
In this case you are going to have final table without small files in it.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created.
Approach2:
Using input_file_name() function by making use of it:
check how many distinct filenames are there in each partition then select only the partitions that have more than 10..etc files in each partition.
Create an temporary table with these partitions and overwrite the final table only the selected partitions.
NOTE you need to make sure there is no new data is being inserted into CUSTOMER_PART table after temp table has been created because we are going to overwrite the final table.
Approach3:
Hive(not spark) offers overwriting and select same table .i.e
insert overwrite table default.t1 partition(partiton_column)
select * from default.t1; //overwrite and select from same t1 table
If you are following this way then there needs to be hive job triggered once your spark job finishes.
Hive will acquire lock while running overwrite/select the same table so if any job which is writing to table will wait.
In Addition: Orc format will offer concatenate which will merge small ORC files to create a new larger file.
alter table <db_name>.<orc_table_name> [partition_column="val"] concatenate;
I have this table, where every column is a VARCHAR (or equivalent):
field001 field002 field003 field004 field005 .... field500
500 VARCHAR columns. No primary keys. And no column is guaranteed to be unique. So the only way to know for sure if two rows are the same is to compare the values of all columns.
(Yes, this should be in TheDailyWTF. No, it's not my fault. Bear with me here).
I inserted a duplicate set of rows by mistake, and I need to find them and remove them.
There's 12 million rows on this table, so I'd rather not recreate it.
However, I do know what rows were mistakenly inserted (I have the .sql file).
So I figured I'd create another table and load it with those. And then I'd do some sort of join that would compare all columns on both tables and then delete the rows that are equal from the first table. I tried a NATURAL JOIN as that looked promising, but nothing was returned.
What are my options?
I'm using Amazon Redshift (so PostgreSQL 8.4 if I recall), but I think this is a general SQL question.
You can treat the whole row as a single record in Postgres (and thus I think in Redshift).
The following works in Postgres, and will keep one of the duplicates
delete from the_table
where ctid not in (select min(ctid)
from the_table
group by the_table); --<< Yes, the group by is correct!
This is going to be slow!
Grouping over so many columns and then deleting with a NOT IN will take quite some time. Especially if a lot of rows are going to be deleted.
If you want to delete all duplicate rows (not keeping any of them), you can use the following:
delete from the_table
where the_table in (select the_table
from the_table
group by the_table
having count(*) > 1);
You should be able to identify all the mistakenly inserted rows using CREATEXID.If you group by CREATEXID on your table as below and get the count you should be able to understand how many rows were inserted in your transaction and remove them using DELETE command.
SELECT CREATEXID,COUNT(1)
FROM yourtable
GROUP BY 1;
One simplistic solution is to recreate the table, e.g.
CREATE TABLE my_temp_table (
-- add column definitions here, just like the original table
);
INSERT INTO my_temp_table SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
or even
CREATE TABLE my_temp_table AS SELECT DISTINCT * FROM original_table;
DROP TABLE original_table;
ALTER TABLE my_temp_table RENAME TO original_table;
It is a trick but probably it helps.
Each row in the table containing the transaction ID in which it row was inserted/updated: System Columns. It is xmin column. So using it you can to find the transaction ID in which you inserted the wrong data. Then just delete the rows using
delete from my_table where xmin = <the_wrong_transaction_id>;
PS: Be careful and try it on the some test table first.
we've a table with 10 Billion rows. This table is Interval Partitioned on date. In a subpartition we need to update the date for 500 million rows that matches the criteria to a new value. This will definetly affect creation of new partition or something because the table is partitioned on the same date. Could anyone give me pointers to a best approach to follow?
Thanks in advance!
If you are going to update partitioning key and the source rows are in a single (sub)partition, then the reasonable approach would be to:
Create a temporary table for the updated rows. If possible, perform the update on the fly
CREATE TABLE updated_rows
AS
SELECT add_months(partition_key, 1), other_columns...
FROM original_table PARITION (xxx)
WHERE ...;
Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION xxx;
Reinsert the updated rows back
INSERT /*+append*/ INTO original_table
SELECT * FROM updated_rows;
In case you have issues with CTAS or INSERT INTO SELECT for 500M rows, consider partitioning the temporary table and moving the data in batches.
hmmm... If you have enough space i would create a "copy" of the source table with the good updated rows, then check the results and drop the source table after it, in the end rename the "copy" to the source. Yes this have a long executing time, but this could be a painless way, of course parallel hint is needed.
You may consider to add a new column (Flag) 'updated' bit that have by fedault the values NULL (Or 0, i preffer NULL) to your table, and using the criticias of dates that you need to update you can update data group by group in the same way described by Kombajn, once the group of data is updated you can affect the value 1 to the flag 'updated' to your group of data.
For exemple lets start by making groups of datas, let consider that the critecia of groups is the year. so lets start to treate data year by year.
Create a temporary table of year 1 :
CREATE TABLE updated_rows
AS
SELECT columns...
FROM original_table PARITION (2001)
WHERE YEAR = 2001
...;
2.Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION 2001;
3.Reinsert the updated rows back
INSERT /*+append*/ INTO original_table(columns....,updated)
SELECT columns...,1 FROM updated_rows;
Hope this will helps you to treat data step by step to prevent waiting all data of the table to be updated in once. You may consider a cursor that loop over years.