Hive : Add partition column data in subquery - hive

I have 2 hive tables which have the exact same schema, except for a date column. One of them has the date column, which is what its partitioned by and the other does not have a date column and is not partitioned by anything.
the 2 tables are:
staging (no date column and not partitioned)
main (date column present and is partitioned by date)
I want to copy over data from staging to main. I am trying this query
INSERT OVERWRITE TABLE main
PARTITION (dt='2019-04-30')
SELECT col_a,
col_b,
col_c,
col_d,
col_e,
'2019-04-30' FROM staging
Both staging and main tables have col_a, col_b, col_c, col_d and col_e. dt is the field which only the main table has. But this throws this error:
main requires that the data to be inserted have the same number of columns as the target table: target table has 6 column(s) but the inserted data has 7 column(s), including 1 partition column(s) having constant value(s).;'
Any idea how I can fix this ?

Well, turns out all I had to do was this -
INSERT OVERWRITE TABLE main
PARTITION (dt='2019-04-30')
SELECT * FROM staging

Look this,
target table has 6 column(s) but the inserted data has 7 column(s),
including 1 partition column(s) having constant value(s).;'
It said, you your table has 6 columns but insert has 7 columns,
please check your target table , show create table staging , if or not has the correct column numbers.
This often happends after the table structure is not successfuly modified.

Related

Hive mismatched counts after table migration

I need to migrate 2 tables (table A and B) to a new cluster.
I applied the same query on the 2 tables. Table A works fine, but Table B has mismatched counts. There are more counts in the new cluster. After some investigation, I found the extra counts are Null rows. But I can't find the cause of this extra-count issue.
My procedure is as below:
Export Hive table
INSERT OVERWRITE LOCAL DIRECTORY
'/path/'
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\u0007' null defined as '' stored as textfile
SELECT * FROM export_table_name
WHERE file_date between '2021-01-01' and '2022-01-31'
LIMIT 2100000000;
*One difference between Table A and B: Table B is a lot bigger than A. When I exported Table B, I sliced it half and exported twice. The query was WHERE date between '2021-01-01' and '2021-06-30' and WHERE date between '2021-07-01' and '2021-12-31'
SCP the exported files to the new cluster
Create table schema with
CREATE TABLE myTable_temp(
columns
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '\u0007'
stored as textfile;
Import the files to the temp table (non-partitioned)
load data inpath 'myPath' overwrite into table myTable_temp;
*For table B, I imported twice. The query for the second import was load data inpath 'myPath' into table myTable_temp;
Create table schema + one extra column "partition_key" for the actual table
Inject data from the temp table to the actual table (partitioned)
insert into table myTable partition(partition_key) select *, concat(year(file_date)) partition_key from myTable_temp;

How to create partitions (year,month,day) in hive from date column which have MM/dd/yyyy format

Data loaded on a daily basis.
Need to create a partition with the date column.
Date
3/15/2021 8:02:32 AM
12/21/2020 12:20:41 PM
You need to convert the table into a partition to the table. Then change the loading sql so that it inserts into the table properly.
Create a new table identical to original table and make sure the exclude partition column from list of columns and add it in partitioned by like below.
create table new_tab() partitioned by ( partition_dt string );
Load data into new_tab from original table. Make sure last column in your select clause is the partitioned col.
set hive.exec.dynamic.partition.mode=nonstrict;
insert into new_table partition(partition_dt )
select src.*, from_unixtime(unix_timestamp(dttm_column),'MM/dd/yyyy') as partition_dt from original_table src;
Drop original table and rename new_table as original table.
drop table original_table ;
alter table new_table rename to original_table ;

Table to table insert w/o duplicates in hive

I have table A as truncate and load for every month file and table B will be append
So table A will be file to table in hive
Table B will be tableA Insert and append data
Issue here is table B is straight move select stmt from table A , and chances are it can be inserted with duplicate/ same data
How should I write a select query to insert data from Table A
Both tables will have file-date as the column
Left join A and B is giving wrong counts in this insert tables
And hive is not working for not exists code
Issue Is:
Append table script : partitioned by yearmonth
Insert into table dist.t2
Select
Person_sk,
Np_id,
Yearmonth,
Insert_date
File_date
From table raw.ma
Data in Table raw.ma —this is truncate and reload
File1 data:201902
File2data:201903
File3data:201904
File4data: if 201902 data gets loaded to table — this should not duplicate the file1 data.. it should either not get inserted or should overwrite that partition
Here I need a filter or where condition to append data into dist.t2
Can you please help with this ??
I tried alter drop table partition in hive, but it’s failing in the spark framework
Please help with avoiding duplicate entries insert

Split Hive table on subtables by field value

I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?
You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast

How Insert a columns of unpartitoned table into a partitioned table in Hive?

A table 'A'is there which is partitioned. The another table 'B' is not partitioned . How to insert the values of B into A? Will error be thrown?
Yes, you can insert from a non-partitioned table to a partitioned table. You will either have to define the partition you want to insert into or have Hive do it dynamically.
For example, to dynamically insert into partitions, you could run something similar to:
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT INTO TABLE A PARTITION (partition) SELECT col1, col2, ..., colN, partition FROM B WHERE .... ;
More information regarding Hive Partitions with dynamic inserts can be found here : https://cwiki.apache.org/confluence/display/Hive/DynamicPartitions. Take note, the last column in your SELECT is what is used for the partition insert. Another thing to note is that you need the number of columns to match between the two tables, otherwise you will have to fill in NULLs.