I am trying to insert rows into a table while keeping the existing data, but Hive overwrites whatever is already there. After executing the following, I expect 2 rows:
1 2
3 4
but only see 1 row (3 4) in the table.
CREATE TABLE `my_db.test_table`
(
`x1` STRING
,`x2` STRING
)
LOCATION '/.../test_table'
;
INSERT INTO TABLE `my_db.test_table`
SELECT '1', '2'
;
INSERT INTO TABLE `my_db.test_table`
SELECT '3', '4'
;
According to the Hive Language Manual, an overwrite should only happen with INSERT OVERWRITE, not with INSERT INTO.
What could cause this overwrite?
I found the culprit: it's the backtick / backquote (`). The issue is noted here.
This will perform an overwrite:
INSERT INTO TABLE `my_db.test_table`
SELECT '3', '4'
while this will append:
INSERT INTO TABLE my_db.test_table
SELECT '3', '4'
Hey I tried the same at my end and was able to get the data appended one after another..
And as you stated only when we specify OVERWRITE hive will delete the data that was previously available.
Also you can see that the directory would now have two files, and for each consecutive INSERT INTO statement a new file will be created under the directory.
CREATE TABLE IF NOT EXISTS Test_Table (X1 STRING, X2 STRING) LOCATION '/hive1';
Could you please try doing the same again and let us know if you still are facing the issue ?
Related
i got this error with Oracle SQL with Insert Select query and don't where the error comes from
the SQL Query is:
insert into GroupScenarioAction (ID, ID_UUID, GPSCENARIO_UUID, ACTION, VERSION)
(select DEFAULT , '0', ACTION.ID_UUID, '5310AFAA......', '1', ACTION_ID, '0'
from ACTION where ACTION.id not in (select ACTION FROM GroupScenarioAction where
GPSCENARIO = '1'));
the error is ORA-00936: missing expression Position 129
It is difficult to assist because
you posted relevant data as images (why do you expect us to type all of that so that we could try it?) instead of code (which can easily be copy/pasted and used afterwards)
code you posted (the insert statement itself) uses columns that don't exist in any tables whose description you posted
for example, insert inserts into GroupScenarioAction, but there's no such table there; maybe it is goroohscenarioaction? Or, there's no action_id column in the action table
you're inserting values into 5 columns, but select statement contains 7 columns; that raises ORA-00913: too many values error, you don't even come to the missing expression error
Shortly, as if you tried to do everyhing you could to prevent us from helping you.
One of comments you posted says
It's the primary key so where are those values supposed to come from?
That's the default keyword in
insert into GroupScenarioAction (ID, ...)
(select DEFAULT, ...
-------
this
Looks like the ID column is created as an identity column whose value is autogenerated (i.e. Oracle takes care about it), which also means that you're on Oracle 12c or above (there was no such an option in lower versions). On the other hand create table goroohscenarioaction statement doesn't suggest anything like that.
Anyway: if you do it right, it works. I created two sample tables with a minimum column set, just to make insert work. Also, as I'm on 11gXE (which doesn't support identity columns, I'm inserting a sequence value which is, basically, what identity column uses in the background anyway):
SQL> create table groupscenarioaction
2 (id number,
3 id_uuid raw(255),
4 gpscenario_uuid raw(255),
5 action number,
6 version number
7 );
Table created.
SQL> create table action
2 (id_uuid raw(255),
3 id number
4 );
Table created.
SQL> create sequence seq;
Sequence created.
Insert you posted; I commented out columns that either don't exist or are superfluous. It works; though, didn't insert anything as my table(s) are empty, but it doesn't raise any error:
SQL> insert into GroupScenarioAction
2 (ID, ID_UUID, GPSCENARIO_UUID, ACTION, VERSION)
3 (select 1 /*DEFAULT*/ , '0', ACTION.ID_UUID, '5310AFAA......', '1' --, id /*ACTION_ID*/, '0'
4 from ACTION
5 where ACTION.id not in (select ACTION FROM GroupScenarioAction
6 where gpscenario_uuid/*GPSCENARIO*/ = '1'));
0 rows created.
Beautified:
SQL> insert into groupscenarioaction
2 (id, id_uuid, gpscenario_uuid, action, version)
3 (select seq.nextval, '0', a.id_uuid, '5310AFAA......', '1'
4 from action a
5 where a.id not in (select g.action
6 from groupscenarioaction g
7 where g.gpscenario_uuid = '1'));
0 rows created.
SQL>
Now that you know a little bit more about what's bothering use to help you, and if what I wrote isn't enough, consider editing the original question you posted (simply remove everything that's wrong and write something that is true and we can use).
We are loading data into a fact table, we the original temporary table on Snowflake looks like the following:
Where indicator_nbr fields are questions asked within a survey.
We are using data modelling techniques in building our warehouse database, so the data will be added into a fact table like so:
Then the same for the indicator 2 and 3 and so on if there is other questions.
Each Field with its value will be as a single row. Of course there is other metadata to be added like load_dt and record_src but they are not a problem.
The current script is doing the following:
Get the fields into an array => fields_array = ['indicator_1', 'indicator_2', 'indicator_3']
A loop will run over the array and start adding each field with its value for each row. So imagine we are having 100 rows, we will run 300 inserts, one at a time:
for (var col_num = 0; col_num<fields_array.length; col_num = col_num+1) {
var COL_NAME = fields_array[col_num];
var field_value_query = "INSERT INTO SAT_FIELD_VALUE SELECT md5(id), CURRENT_TIMESTAMP(), NULL, 'SRC', "+COL_NAME+", md5(foreign_key_field) FROM "+TEMP_TABLE_NAME+"";
}
As mentioned in the comment on this post showing the full script, it is better to loop over a string concatenating each from values of the insert query.
There is 2 issues of the suggested solution:
There is a size limit of a query on Snowflake (it should be less than 1 MB);
if we are going to loop over each field and concatenate the from values, we should do a select query as well from the temp table to get the value of the column, so there will be no optimization, or we will reduce the time a little bit but not to much.
EDIT: A possible solution
I was thinking of doing an sql query selecting everything from the temp table, and do hashing and everything and save it into an array after transposing, but I have no idea how to do it.
Not sure if this is what you're looking for but it seems as though you just want to do a pivot:
Setup example scenario
create or replace transient table source_table
(
id number,
indicator_1 varchar,
indicator_2 number,
indicator_3 varchar
);
insert overwrite into source_table
values (1, 'Test', 2, 'DATA'),
(2, 'Prod', 3, 'DATA'),
(3, 'Test', 1, 'METADATA'),
(4, 'Test', 1, 'DATA')
;
create or replace transient table target_table
(
hash_key varchar,
md5 varchar
);
Run insert
select
name_col as hash_key,
md5(id)
from (select
id,
indicator_1,
indicator_2::varchar as indicator_2,
indicator_3
from source_table) unpivot ( val_col for name_col in (indicator_1, indicator_2, indicator_3))
;
This results in a target_table that looks like this:
+-----------+--------------------------------+
|HASH_KEY |MD5 |
+-----------+--------------------------------+
|INDICATOR_1|c4ca4238a0b923820dcc509a6f75849b|
|INDICATOR_2|c4ca4238a0b923820dcc509a6f75849b|
|INDICATOR_3|c4ca4238a0b923820dcc509a6f75849b|
|INDICATOR_1|c81e728d9d4c2f636f067f89cc14862c|
|INDICATOR_2|c81e728d9d4c2f636f067f89cc14862c|
|INDICATOR_3|c81e728d9d4c2f636f067f89cc14862c|
|INDICATOR_1|eccbc87e4b5ce2fe28308fd9f2a7baf3|
|INDICATOR_2|eccbc87e4b5ce2fe28308fd9f2a7baf3|
|INDICATOR_3|eccbc87e4b5ce2fe28308fd9f2a7baf3|
|INDICATOR_1|a87ff679a2f3e71d9181a67b7542122c|
|INDICATOR_2|a87ff679a2f3e71d9181a67b7542122c|
|INDICATOR_3|a87ff679a2f3e71d9181a67b7542122c|
+-----------+--------------------------------+
It is great scenario to use INSERT ALL:
INSERT ALL
INTO dst_tab(hash_key, md5) VALUES (indicator_1, md5)
INTO dst_tab(hash_key, md5) VALUES (indicator_2, md5)
INTO dst_tab(hash_key, md5) VALUES (indicator_3, md5)
SELECT MD5(id) AS md5, indicator_1, indicator_2::STRING AS indicator_2, indicator_3
FROM src_tab;
I have a table partitioned on year,month,day and hour. If I use the following INSERT OVERWRITE to a specific partition it places a file under appropriate directory structure. This file contains the string abc:-
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
) as tbl;
But if I use the following statement, Hive surprisingly creates three new folders under the folder "hour=18".
And there is a file inside each of these three subfolders.
INSERT OVERWRITE TABLE testtable PARTITION(year = 2017, month = 7, day=29, hour=18)
SELECT tbl.c1 FROM
(
select 'abc' as c1
union ALL
select 'xyz' as c1
union ALL
select 'mno' as c1
) as tbl;
When I query the data, it shows the data as expected. But why did it create these 3 new folders? Since the partitioning scheme is only for year,month,day and hour I wouldn't expect Hive to create folders for anything other than these.
Actually it has nothing to do with INSERT OVERWRITE or partitioning.
It's UNION ALL statement that adds additional directories.
Why it bothers you?
You can do some DISTRIBUTE BY shenanigans or set number of reducers to 1 to put this into one file.
Hi guys I had the same issue and thought of sharing.
Union all adds extra subfolder in the table.
The count(*) on the table will give 0 records and the msck repair will error out with the default properties.
After using set hive.msck.path.validator=ignore; MSCK will not error out but will message "Partitions not in metastore"
Only after setting the properties as mentioned above by DogBoneBlues
(SET hive.mapred.supports.subdirectories=TRUE;
SET mapred.input.dir.recursive=TRUE;) The table is returning values.(count(*))
You can use just "union" instead of "union all" if you dont care about duplicates. "union" should not create sub-folders.
I need to check if a row exists or not. If it does not exist, it should be inserted.
This is in postgres and I am trying to insert row through a shell script. When I run the script it does not show error but it does not insert into table even though no matching row is present.
I like the solution they mention here
INSERT INTO table (id, field, field2)
SELECT 3, 'C', 'Z'
WHERE NOT EXISTS (SELECT 1 FROM table WHERE id=3);
From my code (Java) I want to ensure that a row exists in the database (DB2) after my code is executed.
My code now does a select and if no result is returned it does an insert. I really don't like this code since it exposes me to concurrency issues when running in a multi-threaded environment.
What I would like to do is to put this logic in DB2 instead of in my Java code.
Does DB2 have an insert-or-update statement? Or anything like it that I can use?
For example:
insertupdate into mytable values ('myid')
Another way of doing it would probably be to always do the insert and catch "SQL-code -803 primary key already exists", but I would like to avoid that if possible.
Yes, DB2 has the MERGE statement, which will do an UPSERT (update or insert).
MERGE INTO target_table USING source_table ON match-condition
{WHEN [NOT] MATCHED
THEN [UPDATE SET ...|DELETE|INSERT VALUES ....|SIGNAL ...]}
[ELSE IGNORE]
See:
http://publib.boulder.ibm.com/infocenter/db2luw/v9/index.jsp?topic=/com.ibm.db2.udb.admin.doc/doc/r0010873.htm
https://www.ibm.com/support/knowledgecenter/en/SS6NHC/com.ibm.swg.im.dashdb.sql.ref.doc/doc/r0010873.html
https://www.ibm.com/developerworks/community/blogs/SQLTips4DB2LUW/entry/merge?lang=en
I found this thread because I really needed a one-liner for DB2 INSERT OR UPDATE.
The following syntax seems to work, without requiring a separate temp table.
It works by using VALUES() to create a table structure . The SELECT * seems surplus IMHO but without it I get syntax errors.
MERGE INTO mytable AS mt USING (
SELECT * FROM TABLE (
VALUES
(123, 'text')
)
) AS vt(id, val) ON (mt.id = vt.id)
WHEN MATCHED THEN
UPDATE SET val = vt.val
WHEN NOT MATCHED THEN
INSERT (id, val) VALUES (vt.id, vt.val)
;
if you have to insert more than one row, the VALUES part can be repeated without having to duplicate the rest.
VALUES
(123, 'text'),
(456, 'more')
The result is a single statement that can INSERT OR UPDATE one or many rows presumably as an atomic operation.
This response is to hopefully fully answer the query MrSimpleMind had in use-update-and-insert-in-same-query and to provide a working simple example of the DB2 MERGE statement with a scenario of inserting AND updating in one go (record with ID 2 is updated and record ID 3 inserted).
CREATE TABLE STAGE.TEST_TAB ( ID INTEGER, DATE DATE, STATUS VARCHAR(10) );
COMMIT;
INSERT INTO TEST_TAB VALUES (1, '2013-04-14', NULL), (2, '2013-04-15', NULL); COMMIT;
MERGE INTO TEST_TAB T USING (
SELECT
3 NEW_ID,
CURRENT_DATE NEW_DATE,
'NEW' NEW_STATUS
FROM
SYSIBM.DUAL
UNION ALL
SELECT
2 NEW_ID,
NULL NEW_DATE,
'OLD' NEW_STATUS
FROM
SYSIBM.DUAL
) AS S
ON
S.NEW_ID = T.ID
WHEN MATCHED THEN
UPDATE SET
(T.STATUS) = (S.NEW_STATUS)
WHEN NOT MATCHED THEN
INSERT
(T.ID, T.DATE, T.STATUS) VALUES (S.NEW_ID, S.NEW_DATE, S.NEW_STATUS);
COMMIT;
Another way is to execute this 2 queries. It's simpler than create a MERGE statement:
update TABLE_NAME set FIELD_NAME=xxxxx where MyID=XXX;
INSERT INTO TABLE_NAME (MyField1,MyField2) values (xxx,xxxxx)
WHERE NOT EXISTS(select 1 from TABLE_NAME where MyId=xxxx);
The first query just updateS the field you need, if the MyId exists.
The second insertS the row into db if MyId does not exist.
The result is that only one of the queries is executed in your db.
I started with hibernate project where hibernate allows you to saveOrUpdate().
I converted that project into JDBC project the problem was with save and update.
I wanted to save and update at the same time using JDBC.
So, I did some research and I came accross ON DUPLICATE KEY UPDATE :
String sql="Insert into tblstudent (firstName,lastName,gender) values (?,?,?)
ON DUPLICATE KEY UPDATE
firstName= VALUES(firstName),
lastName= VALUES(lastName),
gender= VALUES(gender)";
The issue with the above code was that it updated primary key twice which is true as
per mysql documentation:
The affected rows is just a return code. 1 row means you inserted, 2 means you updated, 0 means nothing happend.
I introduced id and increment it to 1. Now I was incrementing the value of id and not mysql.
String sql="Insert into tblstudent (id,firstName,lastName,gender) values (?,?,?)
ON DUPLICATE KEY UPDATE
id=id+1,
firstName= VALUES(firstName),
lastName= VALUES(lastName),
gender= VALUES(gender)";
The above code worked for me for both insert and update.
Hope it works for you as well.