Querying based on Partition and non-partition column in Hive - hive

I have an external Hive table as follows :-
CREATE external TABLE sales (
ItemNbr STRING,
itemShippedQty INT,
itemDeptNbr SMALLINT,
gateOutUserId STRING,
code VARCHAR(3),
trackingId STRING,
baseDivCode STRING
)
PARTITIONED BY (countryCode STRING, sourceNbr INT, date STRING)
STORED AS PARQUET
LOCATION '/user/sales/';
where table is partitioned by 3 columns ( countryCode, sourceNbr , date). I know that if i query based on these 3 partition columns, my query would be faster.
I have some queries on other query pattern :-
If i add non-partitioned column along with partitioned column like countryCode, sourceNbr , date , ItemNbr as part of where condition when executing sql query , will it scan the full table or it will scan only inside the folder based on countryCode, sourceNbr , date and look for itemNbr attribute value specified in where condition?
Giving all columns is necessary to filter the record or
sub-filter also works like if i give only first 2 columns
(countryCode, sourceNbr ) as part of where condition. In this case
it would scan the full table or it would search only inside folders
based on 2 columns condition (countryCode, sourceNbr ) ?

Partition pruning works in all your cases, no matter all partition columns are in WHERE or only partial, other filters do not affect partition pruning.
To check it use EXPLAIN EXTENDED command, see https://stackoverflow.com/a/50859735/2700344

Related

Split Hive table on subtables by field value

I have a Hive table foo. There are several fields in this table. One of them is some_id. Number of unique values in this fields in range 5,000-10,000. For each value (in example it 10385) I need to perform CTAS queries like
CREATE TABLE bar_10385 AS
SELECT * FROM foo WHERE some_id=10385 AND other_id=10385;
What is the best way to perform this bunch of queries?
You can store all these tables in the single partitioned one. This approach will allow you to load all the data in single query. Query performance will not be compromised.
Create table T (
... --columns here
)
partitioned by (id int); --new calculated partition key
Load data using one query, it will read source table only once:
insert overwrite table T partition(id)
select ..., --columns
case when some_id=10385 AND other_id=10385 then 10385
when some_id=10386 AND other_id=10386 then 10386
...
--and so on
else 0 --default partition for records not attributed
end as id --partition column
from foo
where some_id in (10385,10386) AND other_id in (10385,10386) --filter
Then you can use this table in queries specifying partition:
select from T where id = 10385; --you can create a view named bar_10385, it will act the same as your table. Partition pruning works fast

BigQuery - Append missing records from one table to another

I have two tables - 'todays_data' and 'full_data' with same schema(Id string, name string, Age string). Records in 'todays_data' may or may not be available in 'full_data'. I need to identify the new records(new Id) in 'todays_data' and append it to 'full_data'(Id is the reference key). How to achieve this using 1)Web-UI SQL statement and 2)bq command
Below is a query you should run with a full_data table as a Destination Table and with Append to table as a Write Preference
SELECT id, name, age
FROM todays_data
WHERE NOT id IN (
SELECT id
FROM full_data
GROUP BY id
)
See more for how to achieve this for WebUI and in Commmand line in Storing results in a permanent table

Table partitioning with procedure input parameter

I'm trying to partitioning my table on ID which I got from procedure parameter.
For example my table ddl:
CREATE TABLE bigtable
( ID number )
As input procedure parameter I got eg. number: 130 , So I'm trying to create partition:
Alter table bigtable
add partition part_random_number values(random number);
Of course as random number I mean eg. 120,56 etc : )
But I got an error that object is not partitioned. So I tried to first defined partition clause in crate table statement:
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
But i doesn't work, It works when I defined some partition eg.
CREATE TABLE bigtable
( ID number )
PARTITION BY list (ID)
( partition type values(130);
)
But I would like to avoid it... Is there any other solution?
As result I would like to have table partitioned by procedure input parameterers.
A partitioned table has to have at least one partition. Just create it with a dummy partition and add the ones you actually need using your procedure.

Hive Partitioning Ineffective while using Hive Functions in Queries

I've been working on Hive partitioning from the past few days. Here is an example I created :-
Table - Transactions (Non - Partitioned Managed Table):
CREATE TABLE TRANSACTIONS (
txnno INT,
txndate STRING,
custid INT,
amount DOUBLE,
category STRING,
product STRING,
city STRING,
state STRING,
spendby STRING)
row format delimited fields terminated by ',';
Loaded the data inside this table using the load command.
Created another table as follows :-
Table - Txnrecordsbycat (Partitioned Managed Table):
CREATE TABLE TXNRECORDSBYCAT(txnno INT, txndate STRING, custno INT, amount DOUBLE, product STRING, city STRING, state STRING, spendby STRING)
partitioned by (category STRING)
clustered by (state) INTO 10 buckets
row format delimited fields terminated by ',';
Used the following query to load the data from Transactions table to Txnrecordsbycat table.
FROM TRANSACTIONS txn INSERT OVERWRITE TABLE TXNRECORDSBYCAT PARTITION(category) select txn.txnno,txn.txndate,txn.custid, txn.amount, txn.product,txn.city, txn.state, txn.spendby,txn.category DISTRIBUTE BY CATEGORY;
Now as long as I'm firing simple queires like select * from Transactions and select * from trxrecordsbycat, I can see my queries being efficient (i.e take less execution time) on the partitioned table as compared to non-partitioned table.
However, as soon as my queries become a little complex, something like select count(*) from table, the query becomes less efficient (i.e takes more time) on partitioned table.
Any idea whey this might be happening ?
Many Thanks for help.

Is there a way to partition an existing text file with Impala without pre-splitting the files into the partitioned directories?

Say I have a single file "fruitsbought.csv" that contains many records that contain a date field.
Is it possible to "partition" for better performance by creating the "fruits" table based on that text file, while creating a partition in which all the rows in fruitsbought.txt that would match that partition, say if I wanted to do it by year and month, to be created?
Or do I have to as part of a separate process, create a directory for each year and then put the appropriate ".csv" files that are filtered down for that year into the directory structure on HDFS prior to creating the table in impala-shell?
I heard that you can create an empty table, set up partitions, then use "Insert" statements that happen to contain the partition that that record goes into. Though in my current case, I already have a single "fruitsbought.csv" that contains every record I want in it that I like how I can just make that into a table right there (though it does not have parititionig).
Do I have to develop a separte process to presplit the one file into the multiple files sorted under the right partition? (The one file is very very big).
Create external table using fruitsbought.csv example (id is just example, ...- mean rest of columns in table):
CREATE EXTERNAL TABLE fruitsboughexternal
(
id INT,
.....
mydate STRING
) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION 'somelocationwithfruitsboughtfile/';
Create table with partition on date
CREATE TABLE fruitsbought(id INT, .....)
PARTITIONED BY (year INT, month INT, day INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
Import data to fruitsbought table, partition parameters have to be last in select (of course mydate have to be in format understand by impala like 2014-06-20 06:05:25)
INSERT INTO fruitsbought PARTITION(year, month, day) SELECT id, ..., year(mydate), month(mydate), day(mydate) FROM fruitsboughexternal;