copy data from one table to another partitioning table - hive

%hive
INSERT INTO NEWPARTITIONING partition(year(L_SHIPDATE)) select * from LINEITEM;
I want to copy the data from line item to the partitioning table NEWPARTITIONING but I got the following error:
line 1:54 cannot recognize input near ')' 'select' '*' in statement.
Don't understand why this error occurs. Can anyone give me some ideas

Hive supports DYNAMIC or STATIC partition loading.
Partition specification allows only column name or column list (for dynamic partition load), if you need function, then calculate it in the select, see example below:
set hive.exec.dynamic.partition=true;
set hive.exec.dynamic.partition.mode=nonstrict;
insert into table NEWPARTITIONING partition (partition_column)
select i.col1,
...
i.colN,
year(L_SHIPDATE) as partition_column --Partition should be the last in column list
from LINEITEM i
Or you can specify static partition in the form partition(partition_column='value'), in this case you do not need to select partition expression:
insert into table NEWPARTITIONING partition (partition_column='2020-01-01')
select i.col1,
...
i.colN
from LINEITEM i
where year(L_SHIPDATE) = '2020-01-01'
In both cases - STATIC and DYNAMIC, Hive does not support functions in partition specification. Functions can be calculated in the query (dynamic load) or calculated in a wrapper shell and passed as a parameter to the script (for static partition).

Related

How to modify CTAS query to append query results to table based on if new partition doesn't exist? - Athena

I have a query that I want to execute daily that's to be partitioned by the date it's executed. The results of this query should be appended to a the same table.
My idea was ideally having something similar to the CREATE TABLE IF NOT EXISTS command for adding data by a new partition every day to the existing table if the partition doesn't already exist, but I can't figure out how I'd be able to integrate this in my query.
My query:
CREATE TABLE IF NOT EXISTS db_name.table_name
WITH (
external_location = 's3://my-query-results-location/',
format = 'PARQUET',
parquet_compression = 'SNAPPY',
partitioned_by = ARRAY['date_executed'])
AS
SELECT
{columns_that_I_am_selecting_here_including_'date_executed'}
What this does is create a new table for the first day it's executed but nothing happens for subsequent days, I'm assuming because of the CREATE TABLE IF NOT EXISTS validating that the table already exists and not proceeding with the logic.
Is there a way to modify my query to create a table for the first day executed and append the results by a new partition for each subsequent day?
I'm quite sure ALTER TABLE table_name ADD [IF NOT EXISTS] PARTITION would not apply to my use case here as I'm running a CTAS query.
You can simply use INSERT INTO existing_table SELECT....
Presumably your table is already partitioned, so include that partition column in the SELECT and Amazon Athena will automatically put the data in the correct directory.
For example, you might include hte column like this: SELECT ... CURRENT_DATE as date_executed
See: INSERT INTO - Amazon Athena

Insert into hive table with dynamic partition only writing first partition to disk and not all

I am trying to write data into a hive table and failing. I get a error at the end of Cycle_dt =null and only one partition being writing. It is the first day's.
set hive.auto.convert.join=true;
set hive.optimize.mapjoin.mapreduce=true;
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.dynamic.partition=true;
set mapred.map.tasks = 100;
Insert into table dynamic.dynamic_test_avro_v1 partition(cycle_dt)
Select date_time as CYCLE_TS, case when evar1 is not null or length(trim(evar1)) > 0 then cast(unbase64(substring(evar1,6,12)) as string) end NRNM ,
prop14 as state, evar8 as FLOW_TYPE, prop25 as KEY, pagename PAGE_NM,
partition_dt as cycle_dt from source.std_avro_v1 WHERE
(partition_dt = '2016-10-02' AND partition_dt < '2016-10-07')
AND (
evar8='google');
I am unsure what is going on here. I have a date range setup to only get those date's as partitions.
From hive documentation:
In the dynamic partition inserts, users can give partial partition specifications, which means just specifying the list of partition column names in the PARTITION clause. The column values are optional. If a partition column value is given, we call this a static partition, otherwise it is a dynamic partition. Each dynamic partition column has a corresponding input column from the select statement. This means that the dynamic partition creation is determined by the value of the input column. The dynamic partition columns must be specified last among the columns in the SELECT statement and in the same order in which they appear in the PARTITION() clause.
So, in your query, partition_dt is the value of the dynamic partition. However, you impose the following constraints: (partition_dt = '2016-10-02' AND partition_dt < '2016-10-07') which translate to partition_dt = '2016-10-02' and eventually it creates a single partition.
You probably wanted a range of dates: (partition_dt >= '2016-10-02' AND partition_dt < '2016-10-07')

SemanticException Partition spec {col=null} contains non-partition columns

I am trying to create dynamic partitions in hive using following code.
SET hive.exec.dynamic.partition = true;
SET hive.exec.dynamic.partition.mode = nonstrict;
create external table if not exists report_ipsummary_hourwise(
ip_address string,imp_date string,imp_hour bigint,geo_country string)
PARTITIONED BY (imp_date_P string,imp_hour_P string,geo_coutry_P string)
row format delimited
fields terminated by '\t'
stored as textfile
location 's3://abc';
insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
Where report_ipsummary_hourwise_Temp table contains following columns,
ip_address,imp_date,imp_hour,geo_country.
I am getting this error
SemanticException Partition spec {imp_hour_p=null, imp_date_p=null,
geo_country_p=null} contains non-partition columns.
Can anybody suggest why this error is coming ?
You insert sql have the geo_country_P column but the target table column name is geo_coutry_P. miss a n in country
I was facing the same error. It's because of the extra characters present in the file.
Best solution is to remove all the blank characters and reinsert if you want.
It could also be https://issues.apache.org/jira/browse/HIVE-14032
INSERT OVERWRITE command failed with case sensitive partition key names
There is a bug in Hive which makes partition column names case-sensitive.
For me fix was that both column name has to be lower-case in the table
and PARTITION BY clause's in table definition has to be lower-case. (they can be both upper-case too; due to this Hive bug HIVE-14032 the case just has to match)
It says while copying the file from result to hdfs jobs could not recognize the partition location. What i can suspect you have table with partition (imp_date_P,imp_hour_P,geo_country_P) whereas job is trying to copy on imp_hour_p=null, imp_date_p=null, geo_country_p=null which doesn't match..try to check hdfs location...the other point what i can suggest not to duplicate column name and partition twice
insert overwrite table report_ipsummary_hourwise PARTITION (imp_date_P,imp_hour_P,geo_country_P)
SELECT ip_address,imp_date,imp_hour,geo_country,
imp_date as imp_date_P,
imp_hour as imp_hour_P,
geo_country as geo_country_P
FROM report_ipsummary_hourwise_Temp;
The highlighted fields should be the same name available in the report_ipsummary_hourwise file

How to insert generated id into a results table

I have the following query
SELECT q.pol_id
FROM quot q
,fgn_clm_hist fch
WHERE q.quot_id = fch.quot_id
UNION
SELECT q.pol_id
FROM tdb2wccu.quot q
WHERE q.nr_prr_ls_yr_cov IS NOT NULL
For every row in that result set, I want to create a new row in another table (call it table1) and update pol_id in the quot table (from the above result set) with the generated primary key from the inserted row in table1.
table1 has two columns. id and timestamp.
I'm using db2 10.1.
I've tried numerous things and have been unsuccessful for quite a while. Thanks!
Simple solution: create a new table for the result set of your query, which has an identity column in it. Then, after running your query, update the pol_id field with the newly generated ID in your result table.
Alteratively, you can do it more manually by using the the ROW_NUMBER() OLAP function, which I often found convenient for creating IDs. For this it is convenient to use a stored procedure which does the following:
get the maximum old id from Table1 and write it into a variable old_max_id.
after generating the result set, write the row-numbers into the table1, maybe by something like
INSERT INTO TABLE1
SELECT ROW_NUMBER() OVER (PARTITION BY <primary-key> ORDER BY <whatever-you-want>)
+ OLD_MAX_ID
, CURRENT TIMESTAMP
FROM (<here comes your SQL query>)
Either write the result set into a table or return a cursor to it. Here you should either use the same ROW_NUMBER statement as above or directly use the ID from Table1.

Hive - Partition Column Equal to Current Date

I am trying to insert into a Hive table from another table that does not have a column for todays date. The partition I am trying to create is at the date level. What I am trying to do is something like this:
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date = from_unixtime(unix_timestamp() - (86400*2) , 'yyyy-MM-dd'))
SELECT * FROM table_1;
But when I run this I get the following error:
"cannot recognize input near 'from_unixtime' '(' 'unix_timestamp' in constant"
If I query a table and make one of the columns that it work just fine. Any idea how to set the partition date to current system date in HiveQL?
Thanks in advance,
Craig
What you want here is Hive dynamic partitioning. This allows the decision for which partition each record is inserted into be determined dynamically as the record is selected. In your case, that decision is based on the date when you run the query.
To use dynamic partitioning your partition clause has the partition field(s) but not the value. The value(s) that maps to the partition field(s) is the value(s) at the end of the SELECT, and in the same order.
When you use dynamic partitions for all partition fields you need to ensure that you are using nonstrict for your dynamic partition mode (hive.exec.dynamic.partition.mode).
In your case, your query would look something like:
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date)
SELECT
*
, from_unixtime(unix_timestamp() - (86400*2) , 'yyyy-MM-dd')
FROM table_1;
Instead of using unix_timestamp() and from_unixtime() functions, current_date() can used to get current date in 'yyyy-MM-dd' format.
current_date() is added in hive 1.2.0. official documentation
revised query will be :
SET hive.exec.dynamic.partition.mode=nonstrict;
INSERT OVERWRITE TABLE table_2_partition
PARTITION (p_date)
SELECT
*
, current_date()
FROM table_1;
I hope you are running a shell script and then you can store the current date in a variable. Then you create an empty table in Hive using beeline with just partition column. Once done then while inserting the data into partition table you can add that variable as your partition column and data will be inserted.