Table to table insert w/o duplicates in hive - sql

I have table A as truncate and load for every month file and table B will be append
So table A will be file to table in hive
Table B will be tableA Insert and append data
Issue here is table B is straight move select stmt from table A , and chances are it can be inserted with duplicate/ same data
How should I write a select query to insert data from Table A
Both tables will have file-date as the column
Left join A and B is giving wrong counts in this insert tables
And hive is not working for not exists code
Issue Is:
Append table script : partitioned by yearmonth
Insert into table dist.t2
Select
Person_sk,
Np_id,
Yearmonth,
Insert_date
File_date
From table raw.ma
Data in Table raw.ma —this is truncate and reload
File1 data:201902
File2data:201903
File3data:201904
File4data: if 201902 data gets loaded to table — this should not duplicate the file1 data.. it should either not get inserted or should overwrite that partition
Here I need a filter or where condition to append data into dist.t2
Can you please help with this ??
I tried alter drop table partition in hive, but it’s failing in the spark framework
Please help with avoiding duplicate entries insert

Related

Hive mismatched counts after table migration

I need to migrate 2 tables (table A and B) to a new cluster.
I applied the same query on the 2 tables. Table A works fine, but Table B has mismatched counts. There are more counts in the new cluster. After some investigation, I found the extra counts are Null rows. But I can't find the cause of this extra-count issue.
My procedure is as below:
Export Hive table
INSERT OVERWRITE LOCAL DIRECTORY
'/path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0007' null defined as '' stored as textfile
SELECT * FROM export_table_name
WHERE file_date between '2021-01-01' and '2022-01-31'
LIMIT 2100000000;
*One difference between Table A and B: Table B is a lot bigger than A. When I exported Table B, I sliced it half and exported twice. The query was WHERE date between '2021-01-01' and '2021-06-30' and WHERE date between '2021-07-01' and '2021-12-31'
SCP the exported files to the new cluster
Create table schema with
CREATE TABLE myTable_temp(
columns
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
stored as textfile;
Import the files to the temp table (non-partitioned)
load data inpath 'myPath' overwrite into table myTable_temp;
*For table B, I imported twice. The query for the second import was load data inpath 'myPath' into table myTable_temp;
Create table schema + one extra column "partition_key" for the actual table
Inject data from the temp table to the actual table (partitioned)
insert into table myTable partition(partition_key) select *, concat(year(file_date)) partition_key from myTable_temp;

Get actual target table insert count

I'm inserting data into hive external table in append mode. Every time I insert some records in a table, I want to get the count of actual records which are inserted into the hive external table. Is there any way I could find this information in any hive log file?
There can be workaround for this. Not sure about any hive property for this.
Have an additional timestamp column in your table.
Do self join on table on timestamp column.
count the latest records inserted into table. You can check below sample query:-
SELECT count(1) from (
SELECT tbl_alias.* FROM test_table tbl_alias JOIN
( select max(timestamp_date) as max_timestamp_date FROM test_table) max_timestamp_date_table ON
tbl_alias.timestamp_date=max_timestamp_date_table.max_timestamp_date ) outer_table;

Revert backup table data to original table SQL

I have created a backup for my country table.
create table country_bkp as select * from country;
What SQL should I use to restore the country table to it's original state?
I can do
insert into country select * from country_bkp;
but it will just have duplicate entries and probably fail as primary key would be same .
Is there an SQL command to merge data back?
Last alternative would be
DROP TABLE country;
create table country as select * from country_bkp;
but I want to avoid this as all the grants/permissions would get lost by this.
Other cleaner way would be
delete from country ;
insert into country select * from country_bkp;
But I am looking for more of a merge approach without having to clear data from original table.
Instead of dropping the table, which, as you noted, would lose all the permission defitions, you could truncate it to just remove all the data, and then insert-select the old data:
TRUNCATE TABLE country;
INSERT INTO country SELECT * FROM county_bkp;
In my case, INSERT INTO country SELECT * FROM county_bkp; didnt work because:
It wouldnt let me insert in Primary Key column due to
indentity_insert being off by default.
My table had TimeStamp columns.
In that case:
allow identity_insert in the OriginalTable
insert query in which you mention all the columns of OriginalTable (Excluding TimeStamp Columns) and in Values select all columns from BackupTable (Excluding TimeStamp Columns)
restrict identity_insert in the OriginalTable at the end.
EXAMPLE:
Set Identity_insert OriginalTable ON
insert into OriginalTable (a,b,c,d,e, ....) --[Exclude TimeStamp Columns here]
Select a,b,c,d,e, .... from BackupTable --[Exclude TimeStamp Columns here]
Set Identity_insert OriginalTable Off
Only One Solution to Recover Data from Backup table is Rename Original table with random name and than rename Backup table with Original Table name in case if Identity Insert is ON for Original Table.
for example
Original Table - Invoice
Back Up Table - Invoice_back
Now Rename these tables :
Original Table - Invoice_xxx
Back Up Table - Invoice

Best way of selecting 8k+ rows from a table

I have an excel sheet that contains more than 8k IDs. I have a table in SQL server that contains those IDs and related entries. What would be the best way to get those rows? The way I am doing right now is to use export data function from for the specific table using query:
select * from table_name where uID in (ALL 8K IDs)
Since this has to be done multiple times I suggest using bulk insert from the csv file to a temporary sql table and then use inner join with this table.
Assuming your csv file contains the ids in a single row, (i.e 1,34,345,....), something like this should do the trick:
-- create the temporary table
CREATE TABLE #CSVData
(
IdValue int
)
-- create a clustered index for this table (Note: this doesn't need to be unique)
CREATE CLUSTERED INDEX IX_CSVData on #CSVData (IdValue )
-- insert the csv data to the table
BULK INSERT #CSVData
FROM 'c:\csvData.txt'
WITH
(
ROWTERMINATOR = ','
)
-- select the data
SELECT T.*
FROM table_name T
INNER JOIN #CSVData ON(T.uId = IdValue)
-- cleanup (the index will be dropped with the table)
DROP TABLE #CSVData
One more link to look at is This article by Pinal dave on sqlauthority.

Hive insert query like SQL

I am new to hive, and want to know if there is anyway to insert data into Hive table like we do in SQL. I want to insert my data into hive like
INSERT INTO tablename VALUES (value1,value2..)
I have read that you can load the data from a file to hive table or you can import data from one table to hive table but is there any way to append the data as in SQL?
Some of the answers here are out of date as of Hive 0.14
https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DML#LanguageManualDML-InsertingvaluesintotablesfromSQL
It is now possible to insert using syntax such as:
CREATE TABLE students (name VARCHAR(64), age INT, gpa DECIMAL(3, 2));
INSERT INTO TABLE students
VALUES ('fred flintstone', 35, 1.28), ('barney rubble', 32, 2.32);
You can use the table generating function stack to insert literal values into a table.
First you need a dummy table which contains only one line. You can generate it with the help of limit.
CREATE TABLE one AS
SELECT 1 AS one
FROM any_table_in_your_database
LIMIT 1;
Now you can create a new table with literal values like this:
CREATE TABLE my_table AS
SELECT stack(3
, "row1", 1
, "row2", 2
, "row3", 3
) AS (column1, column2)
FROM one
;
The first argument of stack is the number of rows you are generating.
You can also add values to an existing table:
INSERT INTO TABLE my_table
SELECT stack(2
, "row4", 1
, "row5", 2
) AS (column1, column2)
FROM one
;
Slightly better version of the unique2 suggestion is below:
insert overwrite table target_table
select * from
(
select stack(
3, # generating new table with 3 records
'John', 80, # record_1
'Bill', 61 # record_2
'Martha', 101 # record_3
)
) s;
Which does not require the hack with using an already exiting table.
You can use below approach. With this, You don't need to create temp table OR txt/csv file for further select and load respectively.
INSERT INTO TABLE tablename SELECT value1,value2 FROM tempTable_with_atleast_one_records LIMIT 1.
Where tempTable_with_atleast_one_records is any table with atleast one record.
But problem with this approach is that If you have INSERT statement which inserts multiple rows like below one.
INSERT INTO yourTable values (1 , 'value1') , (2 , 'value2') , (3 , 'value3') ;
Then, You need to have separate INSERT hive statement for each rows. See below.
INSERT INTO TABLE yourTable SELECT 1 , 'value1' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 2 , 'value2' FROM tempTable_with_atleast_one_records LIMIT 1;
INSERT INTO TABLE yourTable SELECT 3 , 'value3' FROM tempTable_with_atleast_one_records LIMIT 1;
No. This INSERT INTO tablename VALUES (x,y,z) syntax is currently not supported in Hive.
You could definitely append data into an existing table. (But it is actually not an append at the HDFS level). It's just that whenever you do a LOAD or INSERT operation on an existing Hive table without OVERWRITE clause the new data will be put without replacing the old data. A new file will be created for this newly inserted data inside the directory corresponding to that table. For example :
I have a file named demo.txt which has 2 lines :
ABC
XYZ
Create a table and load this file into it
hive> create table demo(foo string);
hive> load data inpath '/demo.txt' into table demo;
Now,if I do a SELECT on this table it'll give me :
hive> select * from demo;
OK
ABC
XYZ
Suppose, I have one more file named demo2.txt which has :
PQR
And I do a LOAD again on this table without using overwrite,
hive> load data inpath '/demo2.txt' into table demo;
Now, if I do a SELECT now, it'll give me,
hive> select * from demo;
OK
ABC
XYZ
PQR
HTH
Ways to insert data into Hive table:
for demonstration, I am using table name as table1 and table2
create table table2 as select * from table1 where 1=1;
or
create table table2 as select * from table1;
insert overwrite table table2 select * from table1;
--it will insert data from one to another. Note: It will refresh the target.
insert into table table2 select * from table1;
--it will insert data from one to another. Note: It will append into the target.
load data local inpath 'local_path' overwrite into table table1;
--it will load data from local into the target table and also refresh the target table.
load data inpath 'hdfs_path' overwrite into table table1;
--it will load data from hdfs location iand also refresh the target table.
or
create table table2(
col1 string,
col2 string,
col3 string)
row format delimited fields terminated by ','
location 'hdfs_location';
load data local inpath 'local_path' into table table1;
--it will load data from local and also append into the target table.
load data inpath 'hdfs_path' into table table1;
--it will load data from hdfs location and also append into the target table.
insert into table2 values('aa','bb','cc');
--Lets say table2 have 3 columns only.
Multiple insertion into hive table
Yes you can insert but not as similar to SQL.
In SQL we can insert the row level data, but here you can insert by fields (columns).
During this you have to make sure target table and the query should have same datatype and same number of columns.
eg:
CREATE TABLE test(stu_name STRING,stu_id INT,stu_marks INT)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;
INSERT OVERWRITE TABLE test SELECT lang_name, lang_id, lang_legacy_id FROM export_table;
To insert entire data of table2 in table1. Below is a query:
INSERT INTO TABLE table1 SELECT * FROM table2;
You can't do insert into to insert single record. It's not supported by Hive. You may place all new records that you want to insert in a file and load that file into a temp table in Hive. Then using insert overwrite..select command insert those rows into a new partition of your main Hive table. The constraint here is your main table will have to be pre partitioned. If you don't use partition then your whole table will be replaced with these new records.
Enter the following command to insert data into the testlog table with some condition:
INSERT INTO TABLE testlog SELECT * FROM table1 WHERE some condition;
I think in such scenarios you should be using HBASE which facilitates such kind of insertion but it does not provide any SQL kind of query language. You need you use Java API of HBASE like the put method to do such kind of insertion. Moreover HBASE is column oriented no-sql database.
You still can insert into complex type in Hive - it works
(id is int, colleagues array)
insert into emp (id,colleagues) select 11, array('Alex','Jian') from (select '1')
you can add values to specific columns as well, just specify the column names in which you like to add corresponding values:
Insert into Table (Col1, Col2, Col4,col5,Col7) Values ('Va11','Va2','Val4','Val5','Val7');
Make sure the columns you skip dont have not null value type.
There are few properties to set to make a Hive table support ACID properties and to insert the values into tables as like in SQL .
Conditions to create a ACID table in Hive.
The table should be stored as ORC file. Only ORC format can support ACID prpoperties for now.
The table must be bucketed
Properties to set to create ACID table:
set hive.support.concurrency =true;
set hive.enforce.bucketing =true;
set hive.exec.dynamic.partition.mode =nonstrict
set hive.compactor.initiator.on = true;
set hive.compactor.worker.threads= 1;
set hive.txn.manager = org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
set the property hive.in.test to true in hive.site.xml
After setting all these properties , the table should be created with tblproperty 'transactional' ='true'. The table should be bucketed and saved as orc
CREATE TABLE table_name (col1 int,col2 string, col3 int) CLUSTERED BY col1 INTO 4
BUCKETS STORED AS orc tblproperties('transactional' ='true');
Now its possible to inserte values into the table like SQL query.
INSERT INTO TABLE table_name VALUES (1,'a',100),(2,'b',200),(3,'c',300);
Yes we can use Insert query in Hive.
hive> create table test (id int, name string);
INSERT: INSERT...VALUES is available starting in version 0.14.
hive> insert into table test values (1,'mytest');
This is going to work for insert. We have to use values keyword.
Note: User cannot insert data into a complex datatype column (array, map, struct, union) using the INSERT INTO...VALUES clause.