hive explain plan not showing partition - hive

I have a table which contains 251M records and size is 2.5gb.
I created a partition on two columns which I am doing condition in predicate.
But the explain plan is not showing it is reading partition even though I have partitioned. With selecting from partition column, I am inserting to another table.
Is there a particular order I have to mention the condition in predicate ?
How should I improve performance.
explain
SELECT
'123' AS run_session_id
, tbl1.transaction_id
, tbl1.src_transaction_id
, tbl1.transaction_created_epoch_time
, tbl1.currency
, tbl1.event_type
, tbl1.event_sub_type
, tbl1.estimated_total_cost
, tbl1.actual_total_cost
, tbl1.tfc_export_created_epoch_time
, tbl1.authorizer
, tbl1.acquirer
, tbl1.processor
, tbl1.company_code
, tbl1.country_of_account
, tbl1.merchant_id
, tbl1.client_id
, tbl1.ft_id
, tbl1.transaction_created_date
, tbl1.event_pst_time
, tbl1.extract_id_seq
, tbl1.src_type
, ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC) AS seq_num -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way
FROM db.xx_events tbl1 --<hiveFinalDB>-- -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
WHERE id_seq >= 215
AND id_seq <= 275
AND vent in('SPT','PNR','PNE','PNER','ACT','NTE');
Now how do I improve performance. The partition column is (id_seq,vent)

explain dependency select ...
Demo
create table mytable (i int)
partitioned by (dt date)
;
alter table mytable add
partition (dt=date '2017-06-18')
partition (dt=date '2017-06-19')
partition (dt=date '2017-06-20')
partition (dt=date '2017-06-21')
partition (dt=date '2017-06-22')
;
explain dependency
select *
from mytable
where dt >= date '2017-06-20'
;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"input_tables":[{"tablename":"local_db#mytable","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"local_db#mytable#dt=2017-06-20"},{"partitionName":"local_db#mytable#dt=2017-06-21"},{"partitionName":"local_db#mytable#dt=2017-06-22"}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
"input_partitions":[
{"partitionName":"local_db#mytable#dt=2017-06-20"},{"partitionName":"local_db#mytable#dt=2017-06-21"},{"partitionName":"local_db#mytable#dt=2017-06-22"}]}

Related

how to remove duplicated records depend on column not able to sort in hive

I have :
table test contain :
unique_id string , file_name string , mount bigint
sample of date :
uniqu_id , file_name , mount
1 , test.txt , 15
1 , test_R_file.txt , 50
3 , test_567.txt , 30
3 , test_567_R_file.txt , 100
what I want to do :
I need query to insert overwrite the table where I need to keep for each duplicated uniqu_id one record and this record should be the ones that has (R in the file name column)
the issue :
test table is extrnal table in hive (that mean it not support update and delete operation ) so I want insert overwrite to remove duplicated records for each uniqu_id in the table (in case I have 2 records for the same unique_id only the record that has (R) in file name record should stay ) , I was think to use ranking but the idea I do not have column to order on it to knew what record should I keep and what record should I remove I just has the file_name column who should I check it in case I have 2 record has the same unique_id to knew which record should I keep and which should I remove
You can sort by boolean expression does R exists in the filename or not using CASE expression. Also you can convert boolean to int in CASE and add more conditions to the CASE as well as add more orderby expressions, comma separated. You can sort by boolean because True is greater than False.
Demo:
with mytable as (--demo dataset, use your table instead of this CTE
select 1 unique_id , 'test.txt' file_name , 15 mount union all
select 1 , 'test_R_file.txt' , 50 union all
select 3 , 'test_567.txt' , 30 union all
select 3 , 'test_567_R_file.txt' , 100
)
select unique_id, file_name, mount
from
(
select unique_id, file_name, mount,
row_number() over(partition by unique_id
order by file_name rlike '_R_' desc --True is greater than False
--order by something else if necessary
) rn
from mytable t
) s
where rn=1
Result:
unique_id file_name mount
1 test_R_file.txt 50
3 test_567_R_file.txt 100
Use rank instead of row_number if there are possible multiple records with R per unique_id and you want to keep all of them. Rank will assign 1 to all records with R, row_number will assign 1 to the only such record per unique_id.

Combining Columns into One Table in SQl

I have two tables (Table 1 and Table 2) that include information on a company's insurance policies. There are thousands of rows and around 30 columns in each table. I need to create a table that combines certain columns from each table into a new table.
From Table 1 I need:
InvestmentCode, IndexType, Amount, FundID, PolicyNumber
From Table 2 I need:
PolicyNumber, FundValue, DepositDate, FundID
I want to merge the tables by FundID and Policynumber
Actually creating one more table would be data redundancy (because you already have data present and you are just copying) ,
you can always create view for this , for your query - view will be something as below
CREATE OR REPLACE VIEW <view_name> AS
select T1.InvestmentCode , T1.IndexType , T1.Amount , T1.FundID , T1.PolicyNumber ,
T2.FundValue , T2.DepositDate from Table1 T1 , Table2 T2
where T1.FundID = T22.FundID
and T2.PolicyNumber = T2.PolicyNumber
WITH READ ONLY

How do I add a partition to my Hive table?

I'm creating a table in Hive but unsure of the syntax to add a partition. Here is a simplified version of the create table statement:
CREATE TABLE sales.newtable AS
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash
;
Create partitioned table AS SELECT supported only in Hive since 3.2.0, see (HIVE-20241).
For previous Hive version, create table separately, then load data using INSERT.
See manual here: Create Table As Select (CTAS)
CREATE-TABLE-AS-SELECT does not support partitioning(Not sure about latest version).
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does
not support partitioning in the target table
Instead you can create new sales.newtable, But keep in mind partition column should be the last column in you table definition and last column in your insert query as well.
lets say if emailhash is your partition column in table
and then insert
set hive.exec.dynamic.partition=true;
insert overwrite table sales.newtable PARTITION(emailhash)
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash;

Updating a 3 Million record table using LEAD() function

I need to update a EDW_END_DATE column in a Dimension table using the LEAD() function and the table has 3 Million records , the Oracle query seems to be running forever .
UPDATE
Edwstu.Stu_Class_D A
SET
EDW_END_DATE =
(
SELECT
Edw_New_End_Dt
FROM
(
SELECT
LEAD(Edw_Begin_Date-1,1,'31-DEC-2099') over ( PARTITION BY
Acad_Term_Code ,Class_Number Order By Edw_Begin_Date ASC) AS
Edw_New_End_Dt,
STU_CLASS_KEY
FROM
Edwstu.Stu_Class_D)
B
WHERE
A.STU_CLASS_KEY = B.STU_CLASS_KEY
);
Try to update it using MERGE statement:
MERGE INTO EDWSTU.STU_CLASS_D A
USING (
SELECT
LEAD(EDW_BEGIN_DATE - 1, 1, '31-DEC-2099') OVER(
PARTITION BY ACAD_TERM_CODE, CLASS_NUMBER
ORDER BY
EDW_BEGIN_DATE ASC
) AS EDW_NEW_END_DT,
STU_CLASS_KEY
FROM
EDWSTU.STU_CLASS_D
)
B ON ( A.STU_CLASS_KEY = B.STU_CLASS_KEY )
WHEN MATCHED THEN
UPDATE SET A.EDW_END_DATE = B.EDW_NEW_END_DT;
Cheers!!
You are updating all the rows in the table. This is generally an expensive operation due to locking and logging.
You might consider regenerating the table entirely. Note: before doing this, backup the table.
-- create the table with the results you want
create table temp_stu_class_d as
select d.*, lead(Edw_Begin_Date - 1, 1, date '20199-12-31') as Edw_New_End_Dt
from Edwstu.Stu_Class_D d;
-- remove the contents of the current table
truncate table Edwstu.Stu_Class_D
insert into Edwstu.Stu_Class_D ( . . . , Edw_End_Dt) -- list the columns here
select . . . , Edw_New_End_Dt -- and here
from temp_stu_class_d;
The insert is generally much more efficient than logging each update.

SQL Partition Elimination

I am currently testing a partitioning configuration, using actual execution plan to identify RunTimePartitionSummary/PartitionsAccessed info.
When a query is run with a literal against the partitioning column the partition elimination works fine (using = and <=). However if the query is joined to a lookup table, with the partitioning column <= to a column in the lookup table and restricting the lookup table with another criteria (so that only one row is returned, the same as if it was a literal) elimination does not occur.
This only seems to happen if the join criteria is <= rather than =, even though the result is the same. Reversing the logic and using between does not work either, nor does using a cross applied function.
Edit: (Repro Steps)
OK here you go!
--Create sample function
CREATE PARTITION FUNCTION pf_Test(date) AS RANGE RIGHT FOR VALUES ('20110101','20110102','20110103','20110104','20110105')
--Create sample scheme
CREATE PARTITION SCHEME ps_Test AS PARTITION pf_Test ALL TO ([PRIMARY])
--Create sample table
CREATE TABLE t_Test
(
RowID int identity(1,1)
,StartDate date NOT NULL
,EndDate date NULL
,Data varchar(50) NULL
)
ON ps_Test(StartDate)
--Insert some sample data
INSERT INTO t_Test(StartDate,EndDate,Data)
VALUES
('20110101','20110102','A')
,('20110103','20110104','A')
,('20110105',NULL,'A')
,('20110101',NULL,'B')
,('20110102','20110104','C')
,('20110105',NULL,'C')
,('20110104',NULL,'D')
--Check partition allocation
SELECT *,$PARTITION.pf_Test(StartDate) AS PartitionNumber FROM t_Test
--Run simple test (inlcude actual execution plan)
SELECT
*
,$PARTITION.pf_Test(StartDate)
FROM t_Test
WHERE StartDate <= '20110103' AND ISNULL(EndDate,getdate()) >= '20110103'
--<PartitionRange Start="1" End="4" />
--Run test with join to a lookup (with CTE for simplicity, but doesnt work with table either)
WITH testCTE AS
(
SELECT convert(date,'20110101') AS CalendarDate,'A' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110102') AS CalendarDate,'B' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110103') AS CalendarDate,'C' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110104') AS CalendarDate,'D' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110105') AS CalendarDate,'E' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110106') AS CalendarDate,'F' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110107') AS CalendarDate,'G' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110108') AS CalendarDate,'H' AS SomethingInteresting
UNION ALL
SELECT convert(date,'20110109') AS CalendarDate,'I' AS SomethingInteresting
)
SELECT
C.CalendarDate
,T.*
,$PARTITION.pf_Test(StartDate)
FROM t_Test T
INNER JOIN testCTE C
ON T.StartDate <= C.CalendarDate AND ISNULL(T.EndDate,getdate()) >= C.CalendarDate
WHERE C.SomethingInteresting = 'C' --<PartitionRange Start="1" End="6" />
--So all 6 partitions are scanned despite only 2,3,4 being required, as per the simple select.
--edited to make resultant ranges identical to ensure fair test
It makes sense for the query to scan all the partitions.
All partitions are involved in the predicate T.StartDate <= C.CalendarDate, because the query planner can't possibly know which values C.CalendarDate might take.