How do I add a partition to my Hive table? - sql

I'm creating a table in Hive but unsure of the syntax to add a partition. Here is a simplified version of the create table statement:
CREATE TABLE sales.newtable AS
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash
;

Create partitioned table AS SELECT supported only in Hive since 3.2.0, see (HIVE-20241).
For previous Hive version, create table separately, then load data using INSERT.
See manual here: Create Table As Select (CTAS)

CREATE-TABLE-AS-SELECT does not support partitioning(Not sure about latest version).
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does
not support partitioning in the target table
Instead you can create new sales.newtable, But keep in mind partition column should be the last column in you table definition and last column in your insert query as well.
lets say if emailhash is your partition column in table
and then insert
set hive.exec.dynamic.partition=true;
insert overwrite table sales.newtable PARTITION(emailhash)
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash;

Related

Snowflake: Trying to insert into a table with values generated via multiple selects

I am trying to migrate a query from Redshift to snowflake. This query is used to populate a row in a table and the values are generated from various other tables. This worked fine in redshift, but I have been trying to check snowflake documentation and can't find if it is supported in anyway in snowflake. Anyone has any idea how this could be rewritten
insert into
etl_audit_table
(
batch_id,
batch_run_date,
table_name,
staging_row_count,
dwh_load_type
)
values
(
( select
batch_id
from
etl_cntrl.batch_history_table
where
batch_status='running'
)
,
current_date,
'tablename',
( select
count(*)
from
table
where
lastmodifieddate between '2021-07-31 00:00:00' and '2021-08-17 00:00:00'
),
'Slow Changing Dimension'
)
When trying to execute on snowflake I get the following error
11:46:54 FAILED [INSERT - 0 rows, 0.658 secs] [Code: 2014, SQL State: 22000] SQL compilation error:
Invalid expression [(SELECT 1 AS "BATCH_ID" FROM TABLE (GENERATOR)ROWCOUNT => 1, rowCount => 1) GENERATOR)] in VALUES clause
I was able to make it work by rewriting the query as follows
insert into
etl_audit_table
(
batch_id,
batch_run_date,
table_name,
staging_row_count,
dwh_load_type
)
select
( select
batch_id
from
etl_cntrl.batch_history_table
where
batch_status='running'
)
,
current_date,
'tablename',
( select
count(*)
from
table
where
lastmodifieddate between '2021-07-31 00:00:00' and '2021-08-17 00:00:00'
),
'Slow Changing Dimension'

Combining Columns into One Table in SQl

I have two tables (Table 1 and Table 2) that include information on a company's insurance policies. There are thousands of rows and around 30 columns in each table. I need to create a table that combines certain columns from each table into a new table.
From Table 1 I need:
InvestmentCode, IndexType, Amount, FundID, PolicyNumber
From Table 2 I need:
PolicyNumber, FundValue, DepositDate, FundID
I want to merge the tables by FundID and Policynumber
Actually creating one more table would be data redundancy (because you already have data present and you are just copying) ,
you can always create view for this , for your query - view will be something as below
CREATE OR REPLACE VIEW <view_name> AS
select T1.InvestmentCode , T1.IndexType , T1.Amount , T1.FundID , T1.PolicyNumber ,
T2.FundValue , T2.DepositDate from Table1 T1 , Table2 T2
where T1.FundID = T22.FundID
and T2.PolicyNumber = T2.PolicyNumber
WITH READ ONLY

Avoiding duplicates during insert

I am working on a stored procedure that currently builds our fact table every hour. Currently, during hourly refresh it truncates the table and Inserts new data every time. I am trying to change that to only delete rows that are not needed and append new rows. I have gotten the delete part, but currently, as the ID column (Primary Key) is created upon Insertion, I am not sure how to avoid the insertion of duplicate records, which is what I am currently seeing.
Currently, the stored procedure inserts the primary key (ID) upon insert. I've taken out the truncate table query and replaced that with a delete query. Now I need to work on avoiding duplicates during the insert.
--INSERT DATA FROM TEMP TABLE TO FACTBP
INSERT INTO dbo.FactBP
SELECT
[SOURCE]
,[DC_ORDER_NUMBER]
,[CUSTOMER_PURCHASE_ORDER_ID]
,[BILL_TO]
,[CUSTOMER_MASTER_RECORD_TYPE]
,[SHIP_TO]
,[CUSTOMER_NAME]
,[SALES_ORDER]
,[ORDER_CARRIER]
,[CARRIER_SERVICE_ID]
,[CREATE_DATE]
,[CREATE_TIME]
,[ALLOCATION_DATE]
,[REQUESTED_SHIP_DATE]
,[ADJ_REQ_SHIP]
,[CANCEL_DATE]
,[DISPATCH_DATE]
,[RELEASED_DATE]
,[RELEASED_TIME]
,[PRIORITY_ORDER]
,[SHIPPING_LOAD_NUMBER]
,[ORDER_HDR_STATUS]
,[ORDER_STATUS]
,[DELIVERY_NUMBER]
,[DCMS_ORDER_TYPE]
,[ORDER_TYPE]
,[MATERIAL]
,[QUALITY]
,[MERCHANDISE_SIZE_1]
,[SPECIAL_PROCESS_CODE_1]
,[SPECIAL_PROCESS_CODE_2]
,[SPECIAL_PROCESS_CODE_3]
,[DIVISION]
,[DIVISION_DESC]
,[ORDER_QTY]
,[ORDER_SELECTED_QTY]
,[CARTON_PARCEL_ID]
,[CARTON_ID]
,[SHIP_DATE]
,[SHIP_TIME]
,[PACKED_DATE]
,[PACKED_TIME]
,[ADJ_PACKED_DATE]
,[FULL_CASE_PULL_STATUS]
,[CARRIER_ID]
,[TRAILER_ID]
,[WAVE_NUMBER]
,[DISPATCH_RELEASE_PRIORITY]
,[CARTON_TOTE_COUNT]
,[PICK_PACK_METHOD]
,[RELEASED_QTY]
,[SHIP_QTY]
,[MERCHANDISE_STYLE]
,[PICK_WAREHOUSE]
,[PICK_AREA]
,[PICK_ZONE]
,[PICK_AISLE]
,EST_DEL_DATE
,null
--,[ID]
FROM #TEMP_FACT
--code for avoiding duplicates
--CLEAR ALL DATA FROM FACTBP
DELETE FROM dbo.FactBP
WHERE SHIP_DATE < DATEADD(s,-1,DATEADD(mm,
DATEDIFF(m,0,GETDATE())-2,0)) and SHIP_DATE IS NOT NULL
You need to check against the natural key. Since you're talking about a fact table, the natural key is probably the combination of a lot of fields. If we assume SOURCE and DC_ORDER_NUMBER make up the natural key, this should work:
INSERT INTO dbo.FactBP
SELECT
t.[SOURCE]
, t.[DC_ORDER_NUMBER]
, t.[CUSTOMER_PURCHASE_ORDER_ID]
, t.[BILL_TO]
, t.[CUSTOMER_MASTER_RECORD_TYPE]
, t.[SHIP_TO]
, t.[CUSTOMER_NAME]
, t.[SALES_ORDER]
, t.[ORDER_CARRIER]
, t.[CARRIER_SERVICE_ID]
, t.[CREATE_DATE]
, t.[CREATE_TIME]
, t.[ALLOCATION_DATE]
, t.[REQUESTED_SHIP_DATE]
, t.[ADJ_REQ_SHIP]
, t.[CANCEL_DATE]
, t.[DISPATCH_DATE]
, t.[RELEASED_DATE]
, t.[RELEASED_TIME]
, t.[PRIORITY_ORDER]
, t.[SHIPPING_LOAD_NUMBER]
, t.[ORDER_HDR_STATUS]
, t.[ORDER_STATUS]
, t.[DELIVERY_NUMBER]
, t.[DCMS_ORDER_TYPE]
, t.[ORDER_TYPE]
, t.[MATERIAL]
, t.[QUALITY]
, t.[MERCHANDISE_SIZE_1]
, t.[SPECIAL_PROCESS_CODE_1]
, t.[SPECIAL_PROCESS_CODE_2]
, t.[SPECIAL_PROCESS_CODE_3]
, t.[DIVISION]
, t.[DIVISION_DESC]
, t.[ORDER_QTY]
, t.[ORDER_SELECTED_QTY]
, t.[CARTON_PARCEL_ID]
, t.[CARTON_ID]
, t.[SHIP_DATE]
, t.[SHIP_TIME]
, t.[PACKED_DATE]
, t.[PACKED_TIME]
, t.[ADJ_PACKED_DATE]
, t.[FULL_CASE_PULL_STATUS]
, t.[CARRIER_ID]
, t.[TRAILER_ID]
, t.[WAVE_NUMBER]
, t.[DISPATCH_RELEASE_PRIORITY]
, t.[CARTON_TOTE_COUNT]
, t.[PICK_PACK_METHOD]
, t.[RELEASED_QTY]
, t.[SHIP_QTY]
, t.[MERCHANDISE_STYLE]
, t.[PICK_WAREHOUSE]
, t.[PICK_AREA]
, t.[PICK_ZONE]
, t.[PICK_AISLE]
, t.EST_DEL_DATE
, null
--,[ID]
FROM #TEMP_FACT t
left outer join dbo.FactBP f on f.[SOURCE] = t.[SOURCE]
and f.[DC_ORDER_NUMBER] = t.[DC_ORDER_NUMBER]
where f.[SOURCE] is null
Adjust the join and the WHERE clause to match the natural key of the table.
You should also take another look at your DELETE script. Do you really want to delete all records with a SHIP_DATE < 2019-07-31 23:59:59.000? Or should that be <=? Maybe this will work better (and simpler):
DELETE FROM dbo.FactBP
WHERE SHIP_DATE < cast(dateadd(day, 1, EOMONTH(getdate(), -3)) as datetime2)
and SHIP_DATE IS NOT NULL

hive explain plan not showing partition

I have a table which contains 251M records and size is 2.5gb.
I created a partition on two columns which I am doing condition in predicate.
But the explain plan is not showing it is reading partition even though I have partitioned. With selecting from partition column, I am inserting to another table.
Is there a particular order I have to mention the condition in predicate ?
How should I improve performance.
explain
SELECT
'123' AS run_session_id
, tbl1.transaction_id
, tbl1.src_transaction_id
, tbl1.transaction_created_epoch_time
, tbl1.currency
, tbl1.event_type
, tbl1.event_sub_type
, tbl1.estimated_total_cost
, tbl1.actual_total_cost
, tbl1.tfc_export_created_epoch_time
, tbl1.authorizer
, tbl1.acquirer
, tbl1.processor
, tbl1.company_code
, tbl1.country_of_account
, tbl1.merchant_id
, tbl1.client_id
, tbl1.ft_id
, tbl1.transaction_created_date
, tbl1.event_pst_time
, tbl1.extract_id_seq
, tbl1.src_type
, ROW_NUMBER() OVER(PARTITION by tbl1.transaction_id ORDER BY tbl1.event_pst_time DESC) AS seq_num -- while writing back to the pfit events table, write each event so that event_pst_time populates in right way
FROM db.xx_events tbl1 --<hiveFinalDB>-- -- DB variables wont work, so need to change the DB accrodingly for testing and PROD deployment
WHERE id_seq >= 215
AND id_seq <= 275
AND vent in('SPT','PNR','PNE','PNER','ACT','NTE');
Now how do I improve performance. The partition column is (id_seq,vent)
explain dependency select ...
Demo
create table mytable (i int)
partitioned by (dt date)
;
alter table mytable add
partition (dt=date '2017-06-18')
partition (dt=date '2017-06-19')
partition (dt=date '2017-06-20')
partition (dt=date '2017-06-21')
partition (dt=date '2017-06-22')
;
explain dependency
select *
from mytable
where dt >= date '2017-06-20'
;
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| Explain |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| {"input_tables":[{"tablename":"local_db#mytable","tabletype":"MANAGED_TABLE"}],"input_partitions":[{"partitionName":"local_db#mytable#dt=2017-06-20"},{"partitionName":"local_db#mytable#dt=2017-06-21"},{"partitionName":"local_db#mytable#dt=2017-06-22"}]} |
+---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
"input_partitions":[
{"partitionName":"local_db#mytable#dt=2017-06-20"},{"partitionName":"local_db#mytable#dt=2017-06-21"},{"partitionName":"local_db#mytable#dt=2017-06-22"}]}

How to query newest item from table with duplicate items?

I have to deal with data that is being dumped to a "log" table within SQL Server. Unfortunately can't make changes. Basically a process is run daily which dumps some duplicate items into a table.
Table 1:
import_id: guid
import_at: datetime
Table 2:
item_id: guid
import_id: guid (foreign key)
item_url: varchar(1000)
item_name: varchar(50)
item_description: varchar(1000)
Sometimes Table 2 will have a duplicate item_url. I only want to get the list of item_id and item_url from the newest import.
The query below will return one row per item_url, the one with the latest import_at value:
WITH all_items AS (
SELECT
t1.import_id
, t1.import_at
, t2.item_id
, t2.item_url
, t2.item_name
, t2.item_description
, ROW_NUMBER() OVER(PARTITION BY item_url ORDER BY t1.import_at DESC) AS item_url_rank
FROM dbo.table1 AS t1
JOIN dbo.table1 AS t2 ON
t2.import_id = t1.import_id
)
SELECT
t1.import_id
, import_at
, item_id
, item_url
, item_name
, item_description
WHERE
item_url_rank = 1;