I've been working on Hive partitioning from the past few days. Here is an example I created :-
Table - Transactions (Non - Partitioned Managed Table):
CREATE TABLE TRANSACTIONS (
txnno INT,
txndate STRING,
custid INT,
amount DOUBLE,
category STRING,
product STRING,
city STRING,
state STRING,
spendby STRING)
row format delimited fields terminated by ',';
Loaded the data inside this table using the load command.
Created another table as follows :-
Table - Txnrecordsbycat (Partitioned Managed Table):
CREATE TABLE TXNRECORDSBYCAT(txnno INT, txndate STRING, custno INT, amount DOUBLE, product STRING, city STRING, state STRING, spendby STRING)
partitioned by (category STRING)
clustered by (state) INTO 10 buckets
row format delimited fields terminated by ',';
Used the following query to load the data from Transactions table to Txnrecordsbycat table.
FROM TRANSACTIONS txn INSERT OVERWRITE TABLE TXNRECORDSBYCAT PARTITION(category) select txn.txnno,txn.txndate,txn.custid, txn.amount, txn.product,txn.city, txn.state, txn.spendby,txn.category DISTRIBUTE BY CATEGORY;
Now as long as I'm firing simple queires like select * from Transactions and select * from trxrecordsbycat, I can see my queries being efficient (i.e take less execution time) on the partitioned table as compared to non-partitioned table.
However, as soon as my queries become a little complex, something like select count(*) from table, the query becomes less efficient (i.e takes more time) on partitioned table.
Any idea whey this might be happening ?
Many Thanks for help.
Related
I have an external Hive table as follows :-
CREATE external TABLE sales (
ItemNbr STRING,
itemShippedQty INT,
itemDeptNbr SMALLINT,
gateOutUserId STRING,
code VARCHAR(3),
trackingId STRING,
baseDivCode STRING
)
PARTITIONED BY (countryCode STRING, sourceNbr INT, date STRING)
STORED AS PARQUET
LOCATION '/user/sales/';
where table is partitioned by 3 columns ( countryCode, sourceNbr , date). I know that if i query based on these 3 partition columns, my query would be faster.
I have some queries on other query pattern :-
If i add non-partitioned column along with partitioned column like countryCode, sourceNbr , date , ItemNbr as part of where condition when executing sql query , will it scan the full table or it will scan only inside the folder based on countryCode, sourceNbr , date and look for itemNbr attribute value specified in where condition?
Giving all columns is necessary to filter the record or
sub-filter also works like if i give only first 2 columns
(countryCode, sourceNbr ) as part of where condition. In this case
it would scan the full table or it would search only inside folders
based on 2 columns condition (countryCode, sourceNbr ) ?
Partition pruning works in all your cases, no matter all partition columns are in WHERE or only partial, other filters do not affect partition pruning.
To check it use EXPLAIN EXTENDED command, see https://stackoverflow.com/a/50859735/2700344
I have a bunch of gzipped files in HDFS under directories of the form /home/myuser/salesdata/some_date/ALL/<country>.gz , for instance /home/myuser/salesdata/20180925/ALL/us.gz
The data is of the form
<country> \t count1,count2,count3
So essentially it's first tab separated and then I need to extract the comma separated values into separate columns
I'd like to create an external table, partitioning this by country, year, month and day. The size of the data is pretty huge, potentially 100s of TB and so I'd like to have an external table itself, rather than having to duplicate the data by importing it into a standard table.
Is it possible to achieve this by using only an external table?
considering your country is separated by tab '\t' and other fields separated by , this is what you can do.
You can create a temporary table which has first columns as string and rest as array.
create external table temp.test_csv (country string, count array<int>)
row format delimited
fields terminated by "\t"
collection items terminated by ','
stored as textfile
location '/apps/temp/table';
Now if you drop your files into the /apps/temp/table location you should be able to to select the data as mentioned below.
select country, count[0] as count_1, count[1] count_2, count[2] count_3 from temp.test_csv
Now to create partitions create another table, as mentioned below.
drop table temp.test_csv_orc;
create table temp.test_csv_orc ( count_1 int, count_2 int, count_3 int)
partitioned by(year string, month string, day string, country string)
stored as orc;
And load the data from temporary table into this one.
insert into temp.test_csv_orc partition(year="2018", month="09", day="28", country)
select count[0] as count_1, count[1] count_2, count[2] count_3, country from temp.test_csv
I have taken country as Dynamic Parition as it's coming from file however others aren't so it's static.
Trying to create a Hive table but due to the folder structure it's going to take hours just to partition.
Below is an example of what I am currently using to create the table, but it would be really helpful if I could filter the partioning.
In the below I need every child_company, just one year, every month, and just one type of report.
Is there any way to do something like set hcat.dynamic.partitioning.custom.pattern = '${child_company}/year=${2016}/${month}/report=${inventory}'; When partitioning to avoid the need to read through all folders (> 300k)?
Language: Hive
Version: 1.2
Interface: Quobole
use my_database;
set hcat.dynamic.partitioning.custom.pattern = '${child_company}/${year}/${month}/${report}';
drop table if exists table_1;
create external table table_1
(
Date_Date string,
Product string,
Quantity int,
Cost int
)
partitioned by
(
child_company string,
year int,
month int,
report string
)
row format delimited fields terminated by '\t'
lines terminated by '\n'
location 's3://mycompany-myreports/parent/partner_company-12345';
alter table table_1 recover partitions;
show partitions table_1;
Create Statement:
CREATE EXTERNAL TABLE tab1(usr string)
PARTITIONED BY (year string, month string, day string, hour string, min string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY '\n'
LOCATION '/tmp/hive1';
Data:
select * from tab1;
jhon,2017,2,20,10,11
jhon,2017,2,20,10,12
jhon,2017,2,20,10,13
Now I need to alter tab1 table to have only 3 partitions (year string, month string, day string) without manually copying/modifying files. I have thousands of files, so I should alter only table defination without touching files?
Please let me know how to do this?
if this is something that you will do one time, I would suggest create a new table with the expected partitions and insert the table from the older table to the new one using dynamic partitioning. This will also avoid keep small files in your partitions. The other option is create a new table pointing to the old location with the expected partitions and use the following properties
TBLPROPERTIES ("hive.input.dir.recursive" = "TRUE",
"hive.mapred.supports.subdirectories" = "TRUE",
"hive.supports.subdirectories" = "TRUE",
"mapred.input.dir.recursive" = "TRUE");
after that, you can run the msck repair table to recognize the partitions.
I am creating an external table in my own database :
create external table test1 (
event_uid string,
event_type_id int,
event_category_id int,
event_date string,
log_date string,
server string,
server_type string,
process_id int,
device_id string,
account_id string,
ip_address string,
category_id string,
content_id string,
entitlement_id string,
product_id string,
sku string,
title_id string,
service_id string,
order_id bigint,
transaction_id bigint,
company_code string,
product_code string,
key_value_pairs map<string,string>,
process_run_id string)
partitioned by (A string, B string, C string)
location '/data/a1/pnt/lte/formatted/evt'
When I try SHOW PARTITIONS TEST, I just get OK as an output.
However, there is a table with the same same DDL and the same location in another database which is giving results when I do SHOW PARITITIONS TEST. I have also tried MSCK REPAIR TABLE TEST which displays partitions .
Please suggest
When using partitions no actual partitions are created when you execute your DDL. The partitions are created when you load data into your table. So you need to load data and then you would be able to see the partitions with the show partitions statement.
When we create an EXTERNAL TABLE with PARTITION, we have to ALTER the EXTERNAL TABLE with the data location for that given partition. However, it need not be the same path as we specify while creating the EXTERNAL TABLE.
hive> ALTER TABLE test1 ADD PARTITION (A=2016, B=07, C=19)
hive> LOCATION '/data/a1/pnt/lte/formatted/evt/somedatafor_20160719'
hive> ;
When we specify LOCATION '/data/a1/pnt/lte/formatted/evt' (though its optional) while creating an EXTERNAL TABLE we can take some advantage of doing repair operations on that table. So when we want to copy the files through some process like ETL into that directory, we can sync up the partition with the EXTERNAL TABLE instead of writing ALTER TABLE statement to create another new partition.
If we already know the directory structure of the partition that HIVE would create for next data set(say here for C=20), we can simply place the data file in that location like '/data/a1/pnt/lte/formatted/evt/A=2016/B=07/C=20/data.txt' and run the statement as shown below:
hive> MSCK REPAIR TABLE test1;
The above statement will sync up the partition to the hive meta store of the table "test1".