Hive: Creating an external table: Show partitions does nothing - hive

I am creating an external table in my own database :
create external table test1 (
event_uid string,
event_type_id int,
event_category_id int,
event_date string,
log_date string,
server string,
server_type string,
process_id int,
device_id string,
account_id string,
ip_address string,
category_id string,
content_id string,
entitlement_id string,
product_id string,
sku string,
title_id string,
service_id string,
order_id bigint,
transaction_id bigint,
company_code string,
product_code string,
key_value_pairs map<string,string>,
process_run_id string)
partitioned by (A string, B string, C string)
location '/data/a1/pnt/lte/formatted/evt'
When I try SHOW PARTITIONS TEST, I just get OK as an output.
However, there is a table with the same same DDL and the same location in another database which is giving results when I do SHOW PARITITIONS TEST. I have also tried MSCK REPAIR TABLE TEST which displays partitions .
Please suggest

When using partitions no actual partitions are created when you execute your DDL. The partitions are created when you load data into your table. So you need to load data and then you would be able to see the partitions with the show partitions statement.

When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE test1 ADD PARTITION (A=2016, B=07, C=19)
hive> LOCATION '/data/a1/pnt/lte/formatted/evt/somedatafor_20160719'
hive> ;
When we specify LOCATION '/data/a1/pnt/lte/formatted/evt' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create for next data set(say here for C=20), we can simply place the data file in that location like '/data/a1/pnt/lte/formatted/evt/A=2016/B=07/C=20/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE test1;
The above statement will sync up the partition to the hive meta store of the table "test1".

Related

Athena returns blank response for Partitioned data, what am I missing?

I have created a table using partition. I tried two ways for my s3 bucket folder as following but both ways I get no records found when I query with where clause containing partition clause.
My S3 bucket looks like following. part*.csv is what I want to query in Athena. There are other folders at same location along side output, within output.
s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/part-filename121231.csv
s3://bucket-rootname/XYZ-CASE/report/678d1234-2c3a-481b-a1eb-5169d2a97747/output/part-filename213123.csv
my table looks like following
Version 1:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`id` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",id="f78dea49-2c3a-481b-a1eb-5169d2a97747") location "s3://bucket-rootname/casename=ABC-CASE/report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
select * from mytable_trial1 where casename='ABC-CASE' and report='report' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747' and foldername='output';
Version 2:
CREATE EXTERNAL TABLE `mytable_trial1`(
`status` string,
`ref` string)
PARTITIONED BY (
`casename` string,
`report` string,
`id` string,
`foldername` string)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION
's3://bucket-rootname/'
TBLPROPERTIES (
'has_encrypted_data'='false',
'skip.header.line.count'='1')
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/casename=ABC-CASE/report=report/id=f78dea49-2c3a-481b-a1eb-5169d2a97747/foldername=output/";
select * from mytable_trial1 where casename='ABC-CASE' and id='f78dea49-2c3a-481b-a1eb-5169d2a97747'
Show partitions shows this partition but no records found with where clause.
I worked with the AWS Support and we were able to narrow down the issue. Version 2 was right one to use since it has four partitions like my S3 bucket. Also, the Alter table command had issue with location. I used hive format location which was incorrect since my actual S3 location is not hive format. So correcting the command to following worked for me.
ALTER TABLE mytable_trial1 add partition (casename="ABC-CASE",report="report",id="f78dea49-2c3a-481b-a1eb-5169d2a97747",foldername="output") location "s3://bucket-rootname/ABC-CASE/report/f78dea49-2c3a-481b-a1eb-5169d2a97747/output/";
Preview table now shows my entries.

Create partitions using athena alter table statement

This "create table" statement is working correctly.
CREATE EXTERNAL TABLE default.no_details_2018_csv (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/2018/'
tblproperties ("parquet.compress"="SNAPPY");
The data for the year 2018 available in parquet format can be found in that bucket / folder.
1) How do I add partitions to this table? I need to add the year 2019 data to the same table by referring to the new location of s3://some_bucket/athena-parquet/no_details/2019/ The data for both years is available in parquet (snappy) format.
2) Is it possible to partition by month instead of years? In other words is it OK to have 24 partitions instead of 2? Will the new target table will also have parquet format just like source data? The code_2 column mentioned above looks like this "20181013133839". I need to use first 4 characters for yearly (or 6 for monthly) partitions.
First table needs be created as EXTERNAL TABLE Check this
Sample -
CREATE EXTERNAL TABLE default.no_details_table (
`id` string,
`client_id` string,
`client_id2` string,
`id_1` string,
`id_2` string,
`client_id3` string,
`code_1` string,
`code_2` string,
`code_3` string
)
PARTITIONED BY (year string)
STORED AS PARQUET
LOCATION 's3://some_bucket/athena-parquet/no_details/'
tblproperties ("parquet.compress"="SNAPPY");
You can add a partition as
ALTER TABLE default.no_details_table ADD PARTITION (year='2018') LOCATION 's3://some_bucket/athena-parquet/no_details/2018/';
If you want to have more partitions for each month or day, create table with
PARTITIONED BY (day string)
But you need to put data of a day to path -
s3://some_bucket/athena-parquet/no_details/20181013/

Hive insert overwrite and Insert into are very slow with S3 external table

I am using AWS EMR. I have created external tables pointing to S3 location.
The "INSERT INTO TABLE" and "INSERT OVERWRITE" statements are very slow when using destination table as external table pointing to S3. The main issue is that Hive first writes data to a staging directory and then moves the data to the original location.
Does anyone have a better solution for this? Using S3 is really slowing down our jobs.
Cloudera recommends to use the setting hive.mv.files.threads. But looks like the setting is not available in Hive provided in EMR or Apache Hive.
Ok am trying to provide more details.
Below is my source table structure
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE DOUBLE,
O_ORDERDATE DATE,
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY '|'
STORED AS TEXTFILE
LOCATION 's3://raw-tpch/orders/';
Below is the structure of destination table.
CREATE EXTERNAL TABLE ORDERS (
O_ORDERKEY INT,
O_CUSTKEY INT,
O_ORDERSTATUS STRING,
O_TOTALPRICE decimal(12,2),
O_ORDERPRIORITY STRING,
O_CLERK STRING,
O_SHIPPRIORITY INT,
O_COMMENT STRING)
partitioned by (O_ORDERDATE string)
STORED AS PARQUET
LOCATION 's3://parquet-tpch/orders/';
The source table contains orders data for 2400 days. Size of table is 100 GB.so destination table is expected to have 2400 partitions. I have executed below insert statement.
set hive.exec.dynamic.partition.mode=nonstrict;
set hive.exec.reducers.bytes.per.reducer=500000000;
set hive.optimize.sort.dynamic.partition=true;
set hive.exec.max.dynamic.partitions=10000;
set hive.exec.max.dynamic.partitions.pernode=2000;
set hive.load.dynamic.partitions.thread=20;
set hive.mv.files.thread=25;
set hive.blobstore.optimizations.enabled=false;
set parquet.compression=snappy;
INSERT into TABLE orders_parq partition(O_ORDERDATE)
SELECT O_ORDERKEY, O_CUSTKEY,
O_ORDERSTATUS, O_TOTALPRICE,
O_ORDERPRIORITY, O_CLERK,
O_SHIPPRIORITY, O_COMMENT,
O_ORDERDATE from orders;
The query completes it map and reduce part in 10 min but takes lot of time to move data from /tmp/hive/hadoop/b0eac2bb-7151-4e29-9640-3e7c15115b60/hive_2018-02-15_15-02-32_051_5904274475440081364-1/-mr-10001 to destination s3 path.
If i enable the parameter "set hive.blobstore.optimizations.enabled=false;"
it takes time for moving data from hive staging directory to destination table directory.
Surprisingly i found one more issue even though i set my compression as snappy the output table size is 108GB more that raw input text file which is 100 GB.

Filter Dynamic Partitioning in Apache Hive

Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;

Hive Partitioning Ineffective while using Hive Functions in Queries

I've been working on Hive partitioning from the past few days. Here is an example I created :-
Table - Transactions (Non - Partitioned Managed Table):
CREATE TABLE TRANSACTIONS (
txnno INT,
txndate STRING,
custid INT,
amount DOUBLE,
category STRING,
product STRING,
city STRING,
state STRING,
spendby STRING)
row format delimited fields terminated by ',';
Loaded the data inside this table using the load command.
Created another table as follows :-
Table - Txnrecordsbycat (Partitioned Managed Table):
CREATE TABLE TXNRECORDSBYCAT(txnno INT, txndate STRING, custno INT, amount DOUBLE, product STRING, city STRING, state STRING, spendby STRING)
partitioned by (category STRING)
clustered by (state) INTO 10 buckets
row format delimited fields terminated by ',';
Used the following query to load the data from Transactions table to Txnrecordsbycat table.
FROM TRANSACTIONS txn INSERT OVERWRITE TABLE TXNRECORDSBYCAT PARTITION(category) select txn.txnno,txn.txndate,txn.custid, txn.amount, txn.product,txn.city, txn.state, txn.spendby,txn.category DISTRIBUTE BY CATEGORY;
Now as long as I'm firing simple queires like select * from Transactions and select * from trxrecordsbycat, I can see my queries being efficient (i.e take less execution time) on the partitioned table as compared to non-partitioned table.
However, as soon as my queries become a little complex, something like select count(*) from table, the query becomes less efficient (i.e takes more time) on partitioned table.
Any idea whey this might be happening ?
Many Thanks for help.