How to overwrite certain records of a hive table?
eg:
I have a table t1 having 100 records. Now I have to overwrite only 30 of those 100 records.
How can it be done?
Hive supports ACID properties from Hive-0.14. Using Hive transactions you can update or delete your existing records.
There are few rules to be followed for performing update and delete operations. You can refer to this blog on Hive transactions for more details and implementation.
My reputation is too low to directly answer to your comment, but here's a way to accomplish it:
INSERT OVERWRITE TABLE table_a
SELECT * FROM table_a a WHERE a.id NOT IN (SELECT id FROM table_b)
UNION ALL
SELECT * FROM table_b;
Related
I have around 270 million rows indexed on month||id_nr and below update/delete query takes around 4 hours to complete.
I was wondering if there is any other way to do update/delete which will be faster.
Update Query:-
update table_A
set STATUS='Y'
where
month||id_nr in (select distinct month||id_nr from table_A where STATUS='Y');
Delete Query:-
Delete from table_B
where
month||id_nr in (select distinct month||id_nr from table_A where STATUS='Y');
Why the string concatenation? And never try to force the DBMS to make rows distinct in an IN clause. Let the DBMS decide what it considers the best approach to look up the data.
So, just:
where (month, id_nr) in (select month, id_nr from table_A where status = 'Y');
I suppose that id_nr is not a unique ID in table_b, for otherwise you wouln't have to look at it combined with the month. An appropriate index would hence be:
create index idx_b on table_b (id_nr, month);
Or maybe, if you work a lot with the month, it may be a good idea to even partition the table by month. This could speed up queries, updates, and deletes immensely.
For table_a I suggest
create index idx_a on table_a (status, id_nr, month);
which is a covering index. The first column will help find the desired rows quickly; the other two columns will be available without having to read the table row.
Here is the scenario:
TableA - schema does not really matter
TableB - same schema as TableA
SYNONYM ROTable FROM TableA
SYNONYM InsertTable from TableB
select * from ROTable ... (read only)
Trunc InsertTable
BULK insert into InsertTable () ... some data
DROP SYNONYM ROTable
DROP SYNONYM InsertTable
SYNONYM ROTable FROM TableB
SYNONYM InsertTable from TableA
I am not even sure what to call this technique, but I have always called it DOUBLE BUFFERED BULK INSERTS. You have 2 identical tables, you read from one, while you are inserting records into the second. Once you are done inserting records into the second, you switch to reading from the second table and truncate the first table and start your bulk inserts again.
This would be used for cases when you don't want to do UPSERTS on your active table when you have large amounts of UPSERTS to do.
The question is: Is this a viable technique, why or why not and does it have a common name?
Yes this is a viable way to alternate two sets of tables in ETL and reporting.
In Oracle same can be achieved using partition exchanges.
Consider two very large tables, Table A with 20 million rows in, and Table B which has a large overlap with TableA with 10 million rows. Both have an identifier column and a bunch of other data. I need to move all items from Table B into Table A updating where they already exist.
Both table structures
- Identifier int
- Date DateTime,
- Identifier A
- Identifier B
- General decimal data.. (maybe 10 columns)
I can get the items in Table B that are new, and get the items in Table B that need to be updated in Table A very quickly, but I can't get an update or a delete insert to work quickly. What options are available to merge the contents of TableB into TableA (i.e. updating existing records instead of inserting) in the shortest time?
I've tried pulling out existing records in TableB and running a large update on table A to update just those rows (i.e. an update statement per row), and performance is pretty bad, even with a good index on it.
I've also tried doing a one shot delete of the different values out of TableA that exist in TableB and performance of the delete is also poor, even with the indexes dropped.
I appreciate that this may be difficult to perform quickly, but I'm looking for other options that are available to achieve this.
Since you deal with two large tables, in-place updates/inserts/merge can be time consuming operations. I would recommend to have some bulk logging technique just to load a desired content to a new table and the perform a table swap:
Example using SELECT INTO:
SELECT *
INTO NewTableA
FROM (
SELECT * FROM dbo.TableB b WHERE NOT EXISTS (SELECT * FROM dbo.TableA a WHERE a.id = b.id)
UNION ALL
SELECT * FROM dbo.TableA a
) d
exec sp_rename 'TableA', 'BackupTableA'
exec sp_rename 'NewTableA', 'TableA'
Simple or at least Bulk-Logged recovery is highly recommended for such approach. Also, I assume that it has to be done out of business time since plenty of missing objects to be recreated on a new tables: indexes, default constraints, primary key etc.
A Merge is probably your best bet, if you want to both inserts and updates.
MERGE #TableB AS Tgt
USING (SELECT * FROM #TableA) Src
ON (Tgt.Identifier = SRc.Identifier)
WHEN MATCHED THEN
UPDATE SET Date = Src.Date, ...
WHEN NOT MATCHED THEN
INSERT (Identifier, Date, ...)
VALUES (Src.Identifier, Src.Date, ...);
Note that the merge statement must be terminated with a ;
I created a query in which I insert to a empty table the result of a select from other table.
This select itself takes ~20 minutes (30 M Rows, 120 Columns and "Where" conditions, and it's fine), but the insert into takes ~1 hour.
Do you have any suggestions of how to improve it?
What I've done is as in the below example.
Insert Into tableA
Select *
From TableB
Appreciate your help!
Drop all the indexes on TableA , then insert again :
INSERT INTO tableA
SELECT * FROM TableB
Indexes are known to slow down insert statements .
Beneath the already mentioned indexes you could check if TableA has constraints defined (primary key, foreign key, etc.) since constraints are often implemented by indexes.
Furthermore you could check if there are triggers on TableA.
Another test would be to export/unload TableB to a file e.g. TableB.txt and then import/load the file TableB.txt into TableA. (Sorry I don't know the syntax for SQL Server).
Another item to check may be the transaction log. Probably it is possible to change logging to BULK_LOGGED see:
Disable Transaction Log
Best way is to drop table and do these will be faster
Select * INTO TableB from tableA
Also mentioned
here
I have a Hive statement as below:
INSERT INTO TABLE myTable partioned (myDate) SELECT * from myOthertable
myOthertable contains 1 million records and, while executing the above Insert, not all rows are inserted into myTable. As it is a SELECT * query without any WHERE clause ideally the Insert should be done for all the rows from myOthertable into myTable. It ignores some of the rows while inserting.
Can anyone suggest why this is happening?
The issue may be due to ,If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.
So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM myOthertable
INSERT OVERWRITE TABLE myTable(myDate)
SELECT other1, other2 DISTRIBUTE BY myDate;
This link may help