HIVE: Insert into seems to be overwriting the existing table

HIVE: Insert into seems to be overwriting the existing table - hive

I have a data set with around 13,000 records and I am trying to insert another data set with around 13,000 records in to the first table. I am not receiving any error messages, but the resulting table has less that 13,000 records instead of the expected 26,000.
My query looks like this
insert into table base_table
select
*
from second_table

a proper query would be
insert into tablename (field1,field2,field3) select field1,field2,field3 from tablename
do not use ID(auto increment) while inserting from same table,
and if you still face issues kindly share screenshot of error

Turns out I was making a mistake earlier on in my code that was affecting my results. After correcting that issue my original query worked. Thanks to everyone who looked at this

Related

INSERT INTO SELECT and SELECT INTO take much longer than the SELECT

I've got a SELECT statement which takes around 500-600ms to execute. If I use the same SELECT in a INSERT INTO ... SELECT ... or SELECT ... INTO it takes up to 30 seconds.
The table is more like a data copy of a view, for performance reasons which gets truncated and filled with the data from time to time. So my SQL looks like:
TRUNCATE myTable
INSERT INTO myTable (col, col, col) SELECT col, col, col FROM otherTable INNER JOIN ...
I tried multiple things like inserting the data into a temp table so no indexes etc. are on the table (well I also tried dropping the indexes from the original table) but nothing seems to help. If I'm inserting the data into the temp table first (which also takes 30 seconds) and then copy it to the real table, the copy itself is pretty fast (< 1 second).
The query results in ~3800 rows and like 30-40 columns.
The second time executing the Truncate-INSERT INTO/SELECT INTO sql takes less than a second (until I clear all caches). The execution plans look the same, except for the Table Insert which has a cost of 90%.
Also tried to get rid of any implicit conversions but that didnt help either.
Someone knows how this can be possible or how I could find the problem? The problem exists on multiple systems running Sql Server 2014/2016.
Edit: Just saw the execution plan of my SELECT shows an "Excessiv Grant" message as it estimated ~11000 rows but the result is only ~3800 rows. Could that be a reason for the slow insert?

I've just had the same problem. All the data types, sizes & allow-NULLS were the same in my SELECT and target table. I tried changing the table to a HEAP, then a cluster, but it made no difference. The SELECT took around 15 seconds but with the INSERT it took around 4 minutes.
In my case, I ended up using SELECT INTO a temp table, then SELECTing from that into my real table, and it reverted back to 15 seconds or so.
The OP said they tried this and it didn't work, but it may do for some people.

I had identical problem.
Select takes around 900ms to execute insert / select into took more then 2 minutes.
I have re written select to improve performance - just few ms for select but it have great improvement for insert.
Try to simplify query plan as much is possible.
for example if you have multiple joins try to prepare multi - steps solution.

For what it's worth now, I had a similar problem just today. It turned out that the table I was inserting into had INT types, and the table I was selecting from had SMALLINT types. Thus, a type conversion was going on (several times) for each row.
Once I changed the target table to have the same types as the source table, then the insertion and selection took the same order of magnitude.

Insert into combined with select where

Let's say we have a query like this (my actual query is similar to this but pretty long)
insert into t1(id1,c1,c2)
select id1,c1,c2 from t2
where not exists(select * from t1 where t1.id1=t2.id1-1)
Does this query select first and insert all, or insert each selected item one by one?
it matters because I'm trying insert a record depending on the previous inserted records and it doesn't seem to work.

First the select query is ran. So it will select all the rows that match your filter. After that the insert is performed. There is not row by row insertion when you use one operation.
Still if you want to do something recursive that will check after each insert you can use CTEs (Common Table Expressions). http://msdn.microsoft.com/en-us/library/ms190766(v=sql.105).aspx

This runs a select statement one time and then inserts based on that. It is much more efficient that way.
Since you already know what you will be inserting, you should be able to handle this in your select query rather than looking at what you have already inserted.

Optimize this insert SQL Query

I've been trying to find out why my SQLite database is performing relatively slowly (4 seconds to insert 1500 records) and I think I've narrowed it down to this query. Is there a way to optimise this?
"INSERT OR REPLACE INTO MainFrame(WID,PName,PAlias,PModel,FriendID, UniverseID, GalaxyID) VALUES
((SELECT WID FROM Worlds WHERE WName= ?),
#pname,
#palias,
#pmodel,
(SELECT FriendID FROM Friend WHERE FriendName = #eFriend),
(SELECT UniverseID FROM Universes WHERE UniverseName = #eUniverse),
(SELECT GalaxyID FROM Galaxies WHERE GalaxyName = #eGalaxy ))";
As you can see, there are a few Selects being used in an insert query. The reason for this is because the loop inserts data into other tables (WID, FriendID, UniverseID, GalaxyID) so I don't have that data until it's been inserted. I need this data to insert into the MainFrame table but this feels like a brute force approach. Any advice?

Have you narrowed it down to which part of the query is the problem? ie have you run the select on its own to see how quickly it returns. If the select is slow, maybe look at indexes. If select is quick maybe its indexes on the MainFrame table that's slowing insertion.

If your ID fields are autoincrementing, you can get their value right after inserting the respective record by calling sqlite3_last_insert_rowid() in the C API, or the corresponding function in your language.
(Also use one transaction for all inserts.)

Hive queries using partitioned columns does not retrieve all the records. Why?

I have a Hive statement as below:
INSERT INTO TABLE myTable partioned (myDate) SELECT * from myOthertable
myOthertable contains 1 million records and, while executing the above Insert, not all rows are inserted into myTable. As it is a SELECT * query without any WHERE clause ideally the Insert should be done for all the rows from myOthertable into myTable. It ignores some of the rows while inserting.
Can anyone suggest why this is happening?

The issue may be due to ,If the table is large enough the above query wont work seems like due to the larger number of files created on initial map task.
So in that cases group the records in your hive query on the map process and process them on the reduce side. You can implement the same in your hive query itself with the usage of DISTRIBUTE BY. Below is the query .
FROM myOthertable
INSERT OVERWRITE TABLE myTable(myDate)
SELECT other1, other2 DISTRIBUTE BY myDate;
This link may help

Problem updating table using IN clause with huge list of ids

Hi I am having a problem when trying to update a table using an IN clause, I have a big list of clients that should be updated 4500+.
Update table
set columnA = 'value'
where ID in ( biglistofids ) //biglistofids > 4500 ids
I am getting this error
"String or binary data would be truncated."
I tried the same script with fewer ids lets say (2000) and it worked fine.
I have also tried using a temporal table but I got same error.
SELECT Id INTO tmpTable FROM dbo.table WHERE id IN (biglistofids) //create temporal table succesfully
Update table set columnA = 'value' FROM table INNER JOIN tmpTable ON table.ID = tmpTable.ID
Is there any way to handle this, without repeating code for each 2000 records?
Thanks in advance

The "String or binary data would be truncated." has nothing to do with the IN clause.
It means in this line:
set columnA = 'value'
you are setting columnA to something that is too long to be held in columnA.
Maybe certain ids have corresponding data that is too long, and these are not among the first 2000 you have tried.

It looks to me, based on your error, that the actual problem is with one or more of the values you're updating. I'd try validating the input, first. I've done this many ways based on number of records I had, size of the value, type of value, etc., so that will depend on your specific scenario.
The most straight-forward one (not necessarilly the best) is the one you describe. Try to do 2000. If that works, try the next 2000, etc. That is time intensive and clunky and may not be the best for your situation, but I've never seen it fail to identify my problem.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

HIVE: Insert into seems to be overwriting the existing table - hive

a proper query would be insert into tablename (field1,field2,field3) select field1,field2,field3 from tablename do not use ID(auto increment) while inserting from same table, and if you still face issues kindly share screenshot of error

Turns out I was making a mistake earlier on in my code that was affecting my results. After correcting that issue my original query worked. Thanks to everyone who looked at this

Related

INSERT INTO SELECT and SELECT INTO take much longer than the SELECT

Insert into combined with select where

Optimize this insert SQL Query

Hive queries using partitioned columns does not retrieve all the records. Why?

Problem updating table using IN clause with huge list of ids

Categories

Resources