I have a partitioned hive table partitioned on column 'part'. The table has two partition values part='good' and part='bad'.
I need to move a record from 'bad' partition into 'good' partition and overwrite 'bad' partition to remove that moved record. To complicate this, I am looking for a way to do it in a single query as exception handling would be difficult otherwise.
I tried to do it with multi-table insert having two insert queries on the same table as below,
from tbl_partition
insert into tbl_partition partition (part='good') select a,b,c where a='a' and part='bad' -- this is where a record is moved from bad to good
insert overwrite table tbl_partition partition (part='bad') select a,b,c where part='bad' and a not in ('a'); -- Overwrite the bad partition excluding already moved record
But the above query always does an insert into, rather than one insert and the other insert overwrite!!
I even tried with a common table expression and used the common table to insert simultaneously into this table with no luck!
Is there any other way this can be achieved in a single query or am I doing something wrong in the above step?
Please note that I am doing this on a HDP cluster with hive 1.2
Related
I have a table with hundreds of millions of rows that I need to essentially create a "duplicate" of each existing row in, doubling its row count. I'm currently using an insert operation (and unlogging the table prior to inserting) which still takes a long while as one transaction. Looking for guidance on if there may be a more efficient way to execute the query below.
INSERT INTO costs(
parent_record, is_deleted
)
SELECT id, is_deleted
FROM costs;
I have to write a merge statement to insert data from temporary table to a partitioned table and i'm getting below error:-
Error report -
SQL Error: ORA-14400: inserted partition key does not map to any partition
I have to do it session wise and as a result, have to use a temporary table which can not be partitioned.
if your inserts datasets into the partioned table, oracle want to place it
into the correct partionen. You must create for the complete period of time
partitions like in example for MONTHY partition:
ALTER TABLE sales ADD
PARTITION sales_q1_2007 VALUES LESS THAN (TO_DATE('01-APR-2007','dd-MON-yyyy')),
PARTITION sales_q2_2007 VALUES LESS THAN (TO_DATE('01-JUL-2007','dd-MON-yyyy')),
PARTITION sales_q3_2007 VALUES LESS THAN (TO_DATE('01-OCT-2007','dd-MON-yyyy')),
PARTITION sales_q4_2007 VALUES LESS THAN (TO_DATE('01-JAN-2008','dd-MON-yyyy'))
;
If you have done this, you can insert the data ass needed.
Good luck,
I'd like to cluster our ingestion-time partitioned tables without having to change the ETL scripts we use to update them. All of our tables are partitioned on the pseudo-field _PARTITIONTIME, now when I try cluster a table with DML I get the following error:
Invalid field name "_PARTITIONTIME". Field names are not allowed to start with the (case-insensitive) prefixes _PARTITION, TABLE, FILE and _ROW_TIMESTAMP
Here's what the DML-script looks like:
CREATE TABLE `table_target`
PARTITION BY DATE(_PARTITIONTIME)
CLUSTER BY a, b, c
AS
SELECT
*, _PARTITIONTIME
FROM
`table_source`
How should I go about this? Is there a way to keep the same pseudo-field as the partition field, should I re-work the partition field, or am I missing something here?
It is Known limitation that:
It is not possible to create an ingestion-time partitioned table from the result of a query. Instead, use a CREATE TABLE DDL statement to create the table, and then use an INSERT DML statement to insert data into it.
In your case, you need to use CREATE TABLE to create target_table with CLUSTER BY first, then migrate data over.
we've a table with 10 Billion rows. This table is Interval Partitioned on date. In a subpartition we need to update the date for 500 million rows that matches the criteria to a new value. This will definetly affect creation of new partition or something because the table is partitioned on the same date. Could anyone give me pointers to a best approach to follow?
Thanks in advance!
If you are going to update partitioning key and the source rows are in a single (sub)partition, then the reasonable approach would be to:
Create a temporary table for the updated rows. If possible, perform the update on the fly
CREATE TABLE updated_rows
AS
SELECT add_months(partition_key, 1), other_columns...
FROM original_table PARITION (xxx)
WHERE ...;
Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION xxx;
Reinsert the updated rows back
INSERT /*+append*/ INTO original_table
SELECT * FROM updated_rows;
In case you have issues with CTAS or INSERT INTO SELECT for 500M rows, consider partitioning the temporary table and moving the data in batches.
hmmm... If you have enough space i would create a "copy" of the source table with the good updated rows, then check the results and drop the source table after it, in the end rename the "copy" to the source. Yes this have a long executing time, but this could be a painless way, of course parallel hint is needed.
You may consider to add a new column (Flag) 'updated' bit that have by fedault the values NULL (Or 0, i preffer NULL) to your table, and using the criticias of dates that you need to update you can update data group by group in the same way described by Kombajn, once the group of data is updated you can affect the value 1 to the flag 'updated' to your group of data.
For exemple lets start by making groups of datas, let consider that the critecia of groups is the year. so lets start to treate data year by year.
Create a temporary table of year 1 :
CREATE TABLE updated_rows
AS
SELECT columns...
FROM original_table PARITION (2001)
WHERE YEAR = 2001
...;
2.Drop original (sub)partition
ALTER TABLE original_table DROP PARTITION 2001;
3.Reinsert the updated rows back
INSERT /*+append*/ INTO original_table(columns....,updated)
SELECT columns...,1 FROM updated_rows;
Hope this will helps you to treat data step by step to prevent waiting all data of the table to be updated in once. You may consider a cursor that loop over years.
I have a table which holds ~1M rows. My application has a list of ~100K IDs which belong to that table (the list being generated by the application layer).
Is there a common-method of how to query all of these IDs? ~100K Select queries? A temporary table which I insert the ~100K IDs to, and Select query via join the required table?
Thanks,
Doori Bar
You could do it in one query, something like
SELECT * FROM large_table WHERE id IN (...)
Insert a comma-separated list of IDs where I put the ...
Unfortunately, there is no easy way that I know of to parametrize this, so you need to be extra-super careful to avoid SQL injection vulnerabilities.
A temporary table which holds the 100k IDs seems like a good solution. Don't insert them one by one though ; INSERT ... VALUES syntax in MySQL accepts the insertion of multiple rows.
By the way, where do you get your 100k IDs, if it's not from the database ? If they come from a preceding request, I'd suggest to have it fill the temporary table.
Edit : For a more portable way of multiple insert :
INSERT INTO mytable (col1, col2) SELECT 'foo', 0 UNION SELECT 'bar', 1
Do those id's actually reference the table with 1M rows?
If so, you could use SELECT * ids FROM <1M table>
where ids is the ID column and where "1M table" is the name of the table which holds the 1M rows.
but I don't think I really understand your question...