Hive mismatched counts after table migration

Hive mismatched counts after table migration - hive

I need to migrate 2 tables (table A and B) to a new cluster.
I applied the same query on the 2 tables. Table A works fine, but Table B has mismatched counts. There are more counts in the new cluster. After some investigation, I found the extra counts are Null rows. But I can't find the cause of this extra-count issue.
My procedure is as below:
Export Hive table
INSERT OVERWRITE LOCAL DIRECTORY
'/path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0007' null defined as '' stored as textfile
SELECT * FROM export_table_name
WHERE file_date between '2021-01-01' and '2022-01-31'
LIMIT 2100000000;
*One difference between Table A and B: Table B is a lot bigger than A. When I exported Table B, I sliced it half and exported twice. The query was WHERE date between '2021-01-01' and '2021-06-30' and WHERE date between '2021-07-01' and '2021-12-31'
SCP the exported files to the new cluster
Create table schema with
CREATE TABLE myTable_temp(
columns
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
stored as textfile;
Import the files to the temp table (non-partitioned)
load data inpath 'myPath' overwrite into table myTable_temp;
*For table B, I imported twice. The query for the second import was load data inpath 'myPath' into table myTable_temp;
Create table schema + one extra column "partition_key" for the actual table
Inject data from the temp table to the actual table (partitioned)
insert into table myTable partition(partition_key) select *, concat(year(file_date)) partition_key from myTable_temp;

Related

Table to table insert w/o duplicates in hive

I have table A as truncate and load for every month file and table B will be append
So table A will be file to table in hive
Table B will be tableA Insert and append data
Issue here is table B is straight move select stmt from table A , and chances are it can be inserted with duplicate/ same data
How should I write a select query to insert data from Table A
Both tables will have file-date as the column
Left join A and B is giving wrong counts in this insert tables
And hive is not working for not exists code
Issue Is:
Append table script : partitioned by yearmonth
Insert into table dist.t2
Select
Person_sk,
Np_id,
Yearmonth,
Insert_date
File_date
From table raw.ma
Data in Table raw.ma —this is truncate and reload
File1 data:201902
File2data:201903
File3data:201904
File4data: if 201902 data gets loaded to table — this should not duplicate the file1 data.. it should either not get inserted or should overwrite that partition
Here I need a filter or where condition to append data into dist.t2
Can you please help with this ??
I tried alter drop table partition in hive, but it’s failing in the spark framework
Please help with avoiding duplicate entries insert

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

I want to create a table in Hive using a select statement which takes a subset of a data from another table. I used the following query to do so :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada';
When I looked into the HDFS location of this table, there are no field separators.
But I need to create a table with filtered data from another table along with a field separator. For example I am trying to do something like :
create table sample_db.out_table as
select * from sample_db.in_table where country = 'Canada'
ROW FORMAT SERDE
FIELDS TERMINATED BY '|';
This is not working though. I know the alternate way is to create a table structure with field names and the "FIELDS TERMINATED BY '|'" command and then load the data.
But is there any other way to combine the two into a single query that enables me to create a table with filtered data from another table and also with a field separator ?

Put row format delimited .. in front of AS select
do it like this
Change the query to yours
hive> CREATE TABLE ttt row format delimited fields terminated by '|' AS select *,count(1) from t1 group by id ,name ;
Query ID = root_20180702153737_37802c0e-525a-4b00-b8ec-9fac4a6d895b
here is the result
[root#hadoop1 ~]# hadoop fs -cat /user/hive/warehouse/ttt/**
2|\N|1
3|\N|1
4|\N|1

As you can see in the documentation, when using the CTAS (Create Table As Select) statement, the ROW FORMAT statement (in fact, all the settings related to the new table) goes before the SELECT statement.

Drop Table Hive

I need to remove a few lines from a parquet table (table_a) in Hive. If I create a new table (Table_b), and insert into it:
Insert Overwrite table table_b
select * from table_a
Where (my conditions to exclude the right fields here)
Are both tables now using the same HDFS file? If I drop table_a with purge, will both table's data disappear?

You can do describe formatted <table name> to check the hdfs path of the table.
To your question, if you've not specified any location while creating the table, the hdfs path of table a, and hdfs path of table b will be different
And if you drop table after loading the data to table b, you'll not lose data in table b

Inserting system timestamp into a timestamp field in hive table

I m using Hive 0.8.0 version. I wanted to insert the system timestamp into a timestamp field while loading data into a hive table.
In Detail:
I have a file with 2 fields like below:
id name
1 John
2 Merry
3 Sam
Now i wanted to load this file on hive table along with the extra column "created_date". So i have created hive table with the extra filed like below:
CREATE table mytable(id int,name string, created_date timestamp) row format delimited fields terminated by ',' stored as textfile;
If i load the data file i used the below query:
LOAD DATA INPATH '/user/user/data/' INTO TABLE mytable;
If i run the above query the "created_date" field will be NULL. But i wanted that field should be inserted with the system timestamp instead of null while loading the data into hive table. Is it possible in hive. How can i do it?

You can do this in two steps. First load data from the file into a temporary table without the timestamp. Then insert from the temp table into the actual table, and generate the timestamp with the unix_timestamp() UDF:
create table temptable(id int, name string)
row format delimited fields terminated by ','
stored as textfile;
create table mytable(id int, name string, created_date timestamp)
row format delimited fields terminated by ','
stored as textfile;
load data inpath '/user/user/data/' into table temptable;
insert into table mytable
select id, name, unix_timestamp()
from temptable;

Insertion in a table stored as sequenceFile

I am loading data in one table, say abc in hive as sequence file. Then, I wrote an insert statement which inserts data in other table stored as sequence file. The data in second table does not come out to be in sequence file format.
Updated queries:
load data local inpath '/../f.out' into table abc deliminated by '\034';
insert into table new_table partition(x = 2, y = 9) select * from abc where col like '%prq%';
P.S. None of the tables are external

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Hive mismatched counts after table migration - hive

Related

Table to table insert w/o duplicates in hive

Hive - Create Table statement with 'select query' and 'fields terminated by' commands

Drop Table Hive

Inserting system timestamp into a timestamp field in hive table

Insertion in a table stored as sequenceFile

Categories

Resources