I'm doing some automatic script of few queries in hive and we found that we need time to time clear the data from a table and insert the new one. And we are thinking what could be faster?
INSERT OVERWRITE TABLE SOME_TABLE
SELECT * FROM OTHER_TABLE;
or is faster to do like this:
DROP TABLE SOME_TABLE;
CREATE TABLE SOME_TABLE (STUFFS);
INSERT INTO TABLE
SELECT * FROM OTHER_TABLE;
The overhead of running the queries is not an issue. Due to we have the script o creation too. The question is, the INSERT OVERWRITE with billion of rows is faster than DROP + CREATE + INSERT INTO?
For maximum speed I would suggest to 1) issue hadoop fs -rm -r -skipTrash table_dir/* first to remove old data fast without putting files into trash because INSERT OVERWRITE will put all files into Trash and for very big table this will take a lot of time. Then 2) do INSERT OVERWRITE command. This will be faster also because you do not need to drop/create table.
UPDATE:
As of Hive 2.3.0 (HIVE-15880), if the table has TBLPROPERTIES ("auto.purge"="true") the previous data of the table is not moved to Trash when INSERT OVERWRITE query is run against the table. This functionality is applicable only for managed tables. So, INSERT OVERWRITE with auto purge will work faster than rm -skipTrash + INSERT OVERWRITE or DROP+CREATE+INSERT because it will be a single Hive-only command.
One edge consideration is that if your schema changes, INSERT OVERWRITE will fail, while DROP+CREATE+INSERT will not. While this is unlikely to apply in most scenarios, if you're prototyping workflow/table schemas then it might be worth considering.
Related
I need to perform an INSERT operation into a table which is created upon a snapshottable location, is that possible?
(I've been working with Hive 1.1)
CREATE TABLE tablename (x string, y string)
LOCATION'/tmp/snapshots_test/';
INSERT INTO TABLE tablename VALUES('x','y');
where /tmp/snapshots_test/ is set as snapshottable:
hdfs dfsadmin -allowSnapshot /tmp/snapshots_test
I've found out that, if the table is partitioned, it is possible to perform an insert operation even if the location is a snapshottable directory, however, it might not work depending on the Hive version.
Anyway, it is always possible to delete just the file inside the snapshottable directory, so to avoid to delete all the directory through a Hive command.
I have a web app running off a database table that is generated from a csv file.
This table needs to update frequently from the csv. The table needs to match the csv exactly, i.e. if a record no longer exists in the csv it should no longer exist in the table or it should be soft deleted.
What is the proper way to do this?
It seems the easiest way would be:
create temp table
import csv to temp table
drop live table
rename temp table to live table name
This will be scripted inside the app so I don't think the downtime will be much as dropping table and renaming shouldn't take too long but it doesn't seem like the safest way to do things as there is a moment where no table exists.
I tried to instead do:
create temp table
import csv to temp table
update records in live table with data from temp table
delete records in live table that don't exist in temp table
In theory that sounded better but it is extremely slow. The first method just takes a few seconds. The second method the update takes a really long time, I let it run for 10 minutes before cancelling it as it wasn't finished.
I did the update like this:
update table_name as t
set
column1 = t.column1,
column2 = t.column2,
-- etc..
from table_name_temp
What is the proper way to handle this situation?
What you want to do is wrap your simple solution within a transaction. This will ensure that your steps are executed atomically. See: https://www.tutorialspoint.com/sql/sql-transactions.htm for more info.
Postgres support ALTER TABLE .. RENAME.
http://www.postgresqltutorial.com/postgresql-rename-table/
https://dba.stackexchange.com/questions/100779/how-to-atomically-replace-table-data-in-postgresql
The rename table method only works if there are no constraints or triggers involved.
In most cases the new table's contents will not differ too much from the old version; the trick is to suppres updates that don't change anything.
In steps:
create temp table
import csv to temp table
delete records from live table that don't exist in temp table # deletes
delete records from temp table that are EXACTLY THE SAME in live table # idempotent updates
update records in live table with data from temp table # actual updates
insert records into live table from temp table that dont yest exist # inserts
I make a query with INNER JOIN and the result was 12 millions lines.
I like to put this in a table.
I did some tests and when I created the table using clause AS SELECT was more faster than, create the table first and run a INSERT with SELECT after.
I don't understand why.
Somebody can explain for me?
Tks
If you use 'create table as select' (CTAS)
CREATE TABLE new_table AS
SELECT *
FROM old_table
you automatically do a direct-path insert of the data. If you do an
INSERT INTO new_table AS
SELECT *
FROM old_table
you do a conventional insert. You have to use the APPEND-hint, if you want to do a direct path insert instead. So you have to do
INSERT /*+ APPEND */ INTO new_table AS
SELECT *
FROM old_table
to get a similar performance as in 'CREATE TABLE AS SELECT'.
How does the usual conventional insert work?
Oracle checks the free list of the table for an already used block of the table segment that has still free space. If the block isn't in the buffer cache it is read into the buffer cache. Eventually this block is read back to the disk.
During this process undo for the block is written (only a small amount of data is necessary here), data structures are updated, e.g. if necessary, the free list,that esides in the segment header and all these changes are written to the redo-buffer, too.
How does a direct-path insert work?
The process allocates space above the high water mark of the table, that is, beyond the already used space. It writes the data directly to the disk, without using a buffer cache. And it is also written to the redo buffer. When the session is committed, the highwater mark is raised beyond the new written data and this data is now visible to other sessions.
How can I improve CTAS and direct-path inserts?
You can create he tale in NOLOGGING mode, than no redo information is written. If you do this, you should make a backup of the tablespace that contains the table after the insert, otherwisse you can not recover the table if you need this.
You can do the select in parallel
You can do the insert in parallel
If you have to maintain indexes and constraints or even triggers during an insert operation this can slow down your insert operation drastically. So you should avoid this and create indexes after the insert and maybe create constraints with novalidata.
With SELECT STATEMENT The table you create has no primary key, index, identity ... the columns are always allow NULL.
And It does not have to be written to the transaction log (and therefore does not rollback). It's seem like a "Naked Table".
With INSERT ... SELECT then table must be created before so when you create table you can define key, index, identity ... And it will use transaction logs
When applied to large amounts of data, it is very slow.
Is there anyway to write a copy command direct which will copy data from 1 table and populate another table (with some condition will be better)?
what I have observed copy command performance is far more better that INSERT INTO in vertica. So I am trying to replace the INSERT INTO with copy command.
Thanks!!
What you want to do is an INSERT /*+ DIRECT */ INTO table2 SELECT ... FROM table1 WHERE .... The direct hint will make it do a direct load to ROS containers instead of through WOS. If you are doing large bulk loads, this would be fastest. If you are doing many small insert/selects like this, then it would be best to use WOS and leave out the DIRECT.
Another possibility would be to do a CREATE TABLE table2 AS SELECT ... FROM table1 WHERE ....
Finally, if you are really just copying all the data and not filtering (which I know isn't your question, but I'm including this for completeness)... and the tables are partitioned, you can do a COPY_PARTITONS_TO_TABLE which will just create references from the source table's ROS containeres to the target table. Any changes to the new table would reorganize the ROS containers (over time, using the tuple mover, etc. Containers wouldn't get cleaned up unless both tables reorganized them).
We use a DB2 database. Some datawarehouse tables are TRUNCATEd and reloaded every day. We run into deadlock issues when another process is running an INSERT statement against that same table.
Scenario
TRUNCATE is executed on a table.
At the same time another process INSERTS some data in the same table.(The process is based on a trigger and can start at any time )
is there a work around?
What we have thought so far is to prioritize the truncate and then go thruogh with the insert. Is there any way to iplement this. Any help would be appreciated.
You should request a table lock before you execute the truncate.
If you do this you can't get a deadlock -- the table lock won't be granted before the insert finishes and once you have the lock another insert can't occur.
Update from comment:
You can use the LOCK TABLE command. The details depend on your situation but you should be able too get away with SHARED mode. This will allow reads but not inserts (this is the issue you are having I believe.)
It is possible this won't fix your problem. That probably means your insert statement is to complicated -- maybe it is reading from a bunch of other tables or from a federated table. If this is the case, re-architect your solution to include a staging table (first insert into the staging table .. slowly.. then insert into the target table from the staging table).