Avoiding duplicates during insert - sql

I am working on a stored procedure that currently builds our fact table every hour. Currently, during hourly refresh it truncates the table and Inserts new data every time. I am trying to change that to only delete rows that are not needed and append new rows. I have gotten the delete part, but currently, as the ID column (Primary Key) is created upon Insertion, I am not sure how to avoid the insertion of duplicate records, which is what I am currently seeing.
Currently, the stored procedure inserts the primary key (ID) upon insert. I've taken out the truncate table query and replaced that with a delete query. Now I need to work on avoiding duplicates during the insert.
--INSERT DATA FROM TEMP TABLE TO FACTBP
INSERT INTO dbo.FactBP
SELECT
[SOURCE]
,[DC_ORDER_NUMBER]
,[CUSTOMER_PURCHASE_ORDER_ID]
,[BILL_TO]
,[CUSTOMER_MASTER_RECORD_TYPE]
,[SHIP_TO]
,[CUSTOMER_NAME]
,[SALES_ORDER]
,[ORDER_CARRIER]
,[CARRIER_SERVICE_ID]
,[CREATE_DATE]
,[CREATE_TIME]
,[ALLOCATION_DATE]
,[REQUESTED_SHIP_DATE]
,[ADJ_REQ_SHIP]
,[CANCEL_DATE]
,[DISPATCH_DATE]
,[RELEASED_DATE]
,[RELEASED_TIME]
,[PRIORITY_ORDER]
,[SHIPPING_LOAD_NUMBER]
,[ORDER_HDR_STATUS]
,[ORDER_STATUS]
,[DELIVERY_NUMBER]
,[DCMS_ORDER_TYPE]
,[ORDER_TYPE]
,[MATERIAL]
,[QUALITY]
,[MERCHANDISE_SIZE_1]
,[SPECIAL_PROCESS_CODE_1]
,[SPECIAL_PROCESS_CODE_2]
,[SPECIAL_PROCESS_CODE_3]
,[DIVISION]
,[DIVISION_DESC]
,[ORDER_QTY]
,[ORDER_SELECTED_QTY]
,[CARTON_PARCEL_ID]
,[CARTON_ID]
,[SHIP_DATE]
,[SHIP_TIME]
,[PACKED_DATE]
,[PACKED_TIME]
,[ADJ_PACKED_DATE]
,[FULL_CASE_PULL_STATUS]
,[CARRIER_ID]
,[TRAILER_ID]
,[WAVE_NUMBER]
,[DISPATCH_RELEASE_PRIORITY]
,[CARTON_TOTE_COUNT]
,[PICK_PACK_METHOD]
,[RELEASED_QTY]
,[SHIP_QTY]
,[MERCHANDISE_STYLE]
,[PICK_WAREHOUSE]
,[PICK_AREA]
,[PICK_ZONE]
,[PICK_AISLE]
,EST_DEL_DATE
,null
--,[ID]
FROM #TEMP_FACT
--code for avoiding duplicates
--CLEAR ALL DATA FROM FACTBP
DELETE FROM dbo.FactBP
WHERE SHIP_DATE < DATEADD(s,-1,DATEADD(mm,
DATEDIFF(m,0,GETDATE())-2,0)) and SHIP_DATE IS NOT NULL

You need to check against the natural key. Since you're talking about a fact table, the natural key is probably the combination of a lot of fields. If we assume SOURCE and DC_ORDER_NUMBER make up the natural key, this should work:
INSERT INTO dbo.FactBP
SELECT
t.[SOURCE]
, t.[DC_ORDER_NUMBER]
, t.[CUSTOMER_PURCHASE_ORDER_ID]
, t.[BILL_TO]
, t.[CUSTOMER_MASTER_RECORD_TYPE]
, t.[SHIP_TO]
, t.[CUSTOMER_NAME]
, t.[SALES_ORDER]
, t.[ORDER_CARRIER]
, t.[CARRIER_SERVICE_ID]
, t.[CREATE_DATE]
, t.[CREATE_TIME]
, t.[ALLOCATION_DATE]
, t.[REQUESTED_SHIP_DATE]
, t.[ADJ_REQ_SHIP]
, t.[CANCEL_DATE]
, t.[DISPATCH_DATE]
, t.[RELEASED_DATE]
, t.[RELEASED_TIME]
, t.[PRIORITY_ORDER]
, t.[SHIPPING_LOAD_NUMBER]
, t.[ORDER_HDR_STATUS]
, t.[ORDER_STATUS]
, t.[DELIVERY_NUMBER]
, t.[DCMS_ORDER_TYPE]
, t.[ORDER_TYPE]
, t.[MATERIAL]
, t.[QUALITY]
, t.[MERCHANDISE_SIZE_1]
, t.[SPECIAL_PROCESS_CODE_1]
, t.[SPECIAL_PROCESS_CODE_2]
, t.[SPECIAL_PROCESS_CODE_3]
, t.[DIVISION]
, t.[DIVISION_DESC]
, t.[ORDER_QTY]
, t.[ORDER_SELECTED_QTY]
, t.[CARTON_PARCEL_ID]
, t.[CARTON_ID]
, t.[SHIP_DATE]
, t.[SHIP_TIME]
, t.[PACKED_DATE]
, t.[PACKED_TIME]
, t.[ADJ_PACKED_DATE]
, t.[FULL_CASE_PULL_STATUS]
, t.[CARRIER_ID]
, t.[TRAILER_ID]
, t.[WAVE_NUMBER]
, t.[DISPATCH_RELEASE_PRIORITY]
, t.[CARTON_TOTE_COUNT]
, t.[PICK_PACK_METHOD]
, t.[RELEASED_QTY]
, t.[SHIP_QTY]
, t.[MERCHANDISE_STYLE]
, t.[PICK_WAREHOUSE]
, t.[PICK_AREA]
, t.[PICK_ZONE]
, t.[PICK_AISLE]
, t.EST_DEL_DATE
, null
--,[ID]
FROM #TEMP_FACT t
left outer join dbo.FactBP f on f.[SOURCE] = t.[SOURCE]
and f.[DC_ORDER_NUMBER] = t.[DC_ORDER_NUMBER]
where f.[SOURCE] is null
Adjust the join and the WHERE clause to match the natural key of the table.
You should also take another look at your DELETE script. Do you really want to delete all records with a SHIP_DATE < 2019-07-31 23:59:59.000? Or should that be <=? Maybe this will work better (and simpler):
DELETE FROM dbo.FactBP
WHERE SHIP_DATE < cast(dateadd(day, 1, EOMONTH(getdate(), -3)) as datetime2)
and SHIP_DATE IS NOT NULL

Related

How to display (recursive) data-set in a particular manner?

my brain may not be working today... but I'm trying to get a dataset to be arranged in a particular way. It's easier to show what I mean.
I have a dataset like this:
CREATE TABLE #EXAMPLE (
ID CHAR(11)
, ORDER_ID INT
, PARENT_ORDER_ID INT
);
INSERT INTO #EXAMPLE VALUES
('27KJKR8K3TP', 19517, 0)
, ('27KJKR8K3TP', 10615, 0)
, ('27KJKR8K3TP', 83364, 19517)
, ('27KJKR8K3TP', 96671, 10615)
, ('TXCMK9757JT', 92645, 0)
, ('TXCMK9757JT', 60924, 92645);
SELECT * FROM #EXAMPLE;
DROP TABLE #EXAMPLE;
The PARENT_ORDER_ID field refers back to other orders on the given ID. E.g. ID TXCMK9757JT has order 60924 which is a child order of 92645, which is a separate order on the ID. The way I need this dataset to be arranged is like this:
CREATE TABLE #EXAMPLE (
ID CHAR(11)
, ORDER_ID INT
, CHILD_ORDER_ID INT
);
INSERT INTO #EXAMPLE VALUES
('27KJKR8K3TP', 19517, 19517)
, ('27KJKR8K3TP', 19517, 83364)
, ('27KJKR8K3TP', 10615, 10615)
, ('27KJKR8K3TP', 10615, 96671)
--, ('27KJKR8K3TP', 83364, 83364)
--, ('27KJKR8K3TP', 96671, 96671)
, ('TXCMK9757JT', 92645, 92645)
, ('TXCMK9757JT', 92645, 60924)
--, ('TXCMK9757JT', 60924, 60924)
;
SELECT * FROM #EXAMPLE;
DROP TABLE #EXAMPLE;
In this arrangement of the data set, instead of PARENT_ORDER_ID field there is CHILD_ORDER_ID, which basically lists every single ORDER_ID falling under a given ORDER_ID, including itself. I ultimately would like to have the CHILD_ORDER_ID field be the key for the data set, having only unique values (so that's why I've commented out the CHILD_ORDER_IDs that would only contain themselves, because they have a parent order ID which already contains them).
Any advice on how to achieve the described transformation of the data set would be greatly appreciated! I've tried recursive CTEs and different join statements but I'm not quite getting what I want. Thank you!
You can try to use CTE recursive first, then you will get a result to show all Id hierarchy then use CASE WHEN judgment the logic.
;WITH CTE AS (
SELECT ID,ORDER_ID,PARENT_ORDER_ID
FROM #EXAMPLE
WHERE PARENT_ORDER_ID = 0
UNION ALL
SELECT c.Id,e.ORDER_ID,e.PARENT_ORDER_ID
FROM CTE c
INNER JOIN #EXAMPLE e
ON c.ORDER_ID = e.PARENT_ORDER_ID AND c.Id = e.Id
)
SELECT ID,
(CASE WHEN PARENT_ORDER_ID = 0 THEN ORDER_ID ELSE PARENT_ORDER_ID END) ORDER_ID,
ORDER_ID CHILD_ORDER_ID
FROM CTE
ORDER BY ID
sqlfiddle

How do I add a partition to my Hive table?

I'm creating a table in Hive but unsure of the syntax to add a partition. Here is a simplified version of the create table statement:
CREATE TABLE sales.newtable AS
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash
;
Create partitioned table AS SELECT supported only in Hive since 3.2.0, see (HIVE-20241).
For previous Hive version, create table separately, then load data using INSERT.
See manual here: Create Table As Select (CTAS)
CREATE-TABLE-AS-SELECT does not support partitioning(Not sure about latest version).
FAILED: SemanticException [Error 10068]: CREATE-TABLE-AS-SELECT does
not support partitioning in the target table
Instead you can create new sales.newtable, But keep in mind partition column should be the last column in you table definition and last column in your insert query as well.
lets say if emailhash is your partition column in table
and then insert
set hive.exec.dynamic.partition=true;
insert overwrite table sales.newtable PARTITION(emailhash)
SELECT report_date
, SUM(cost_amt) AS cost_amt
, SUM(vendor_revenue_amt) AS vendor_revenue_amt
, SUM(gcr_amt) AS gcr_amt
, first_name
, last_name
, emailhash
FROM bi_reports.datorama_affiliate_mart AS orders
WHERE report_date >= '2019-01-01'
AND data_stream_name <> 'uds_order'
GROUP BY report_date
, first_name
, last_name
, emailhash;

How to get last data in double data and delete the other with query

I have the data in table A with double or more. And i want get the last data input and delete the old data. How?
I try select with distinc and inner join, but when execute to delete the old and last data is include. so i have some issue.
Select * from A where po in (select max(po) from B)
The result is data invalid.
First i make new table result from distinc table A.
create table c AS
select distinct po, plan_ref from A group by po, plan_ref
After that, copy the table A. Let say it begin table B.
delete all the data in table A.
`INSERT INTO A (id, po
, plan_ref
, cust_order
, cust_code
, cust_name
, destination
, art_no
, art_name
, cust_reqdate
, posdd
, podd
, ship_date
, container
, ship
, plant_cell
, cbm
, remark
, upload_by
, upload_date)
select MAX(a.id), MAX (a.po)
, MAX(a.plan_ref)
, MAX(a.cust_order)
, MAX(a.cust_code)
, MAX(a.cust_name)
, MAX(a.destination)
, MAX(a.art_no)
, MAX(a.art_name)
, MAX(a.cust_reqdate)
, MAX(a.posdd)
, MAX(a.podd)
, MAX(a.ship_date)
, MAX(a.container)
, MAX(a.ship)
, MAX(a.plant_cell)
, MAX(a.cbm)
, MAX(a.remark)
, MAX(a.upload_by)
, MAX(a.upload_date) from C b
inner join B a on a.plan_ref = b.plan_ref
AND a.po = b.po GROUP BY a.po`
If you want to delete all but the maximum po (as suggested by your code), then you can do:
delete from a
where po < (select max(a2.po) from a a2);
More commonly, you would want to keep the maximum po based on some other column. For that, use a correlated subquery:
delete from a
where po < (select max(a2.po) from a a2 where a2.? = a.?); -- ? is for the grouping column

Why SQL Job Scheduling is not moving all the records some time?

I set the job schedule on daily basis. When I check the records on the next day, the records were not moved completely on someday. I don't know why.
Here is my query
INSERT INTO [HQWebMatajer].[dbo].[F_ItemDailySalesParent]
(
[ItemID]
,[StoreID]
,[ItemLookupCode]
,[DepartmentID]
,[CategoryID]
,[SupplierID]
,[Time]
,[Qty]
,[ExtendedPrice]
,[ExtendedCost]
)
SELECT
[ItemID]
,[StoreID]
,[ItemLookupCode]
,[DepartmentID]
,[CategoryID]
,[SupplierID]
,[Time]
,[Qty]
,[ExtendedPrice]
,[ExtendedCost]
FROM
[HQMatajer].[dbo].[JC_ItemDailySalesParent]
where time=convert(Date,getdate()-1)
Total records found on [JC_ItemDailySalesParent] = 21027 and total records on [F_ItemDailySalesParent] = 18741 on 06-March-2017
If you think i might have missed some column or something else!.. Then I will execute the same query by changing where condition to where time=convert(Date,getdate()). Then it's executed the complete record without missing.
Note: Both tables are job schedule. [JC_ItemDailySalesParent] table will run at 2 am. F_ItemDailySalesParent will run at early morning 6 o ck.
Don't ask me why two tables with same record. That's for different purpose.
Thanks,
Always try to have a unique identifier in the table. Assuming combination of ItemID, StoreID, ItemLookupCode, DepartmentID, CategoryID, SupplierID, Time are unique:
INSERT INTO HQWebMatajer.dbo.F_ItemDailySalesParent
( ItemID
, StoreID
, ItemLookupCode
, DepartmentID
, CategoryID
, SupplierID
, Time
, Qty
, ExtendedPrice
, ExtendedCost
)
SELECT ItemID
, StoreID
, ItemLookupCode
, DepartmentID
, CategoryID
, SupplierID
, Time
, Qty
, ExtendedPrice
, ExtendedCost
FROM HQMatajer.dbo.JC_ItemDailySalesParent src
LEFT JOIN HQWebMatajer.dbo.F_ItemDailySalesParent tgt
ON tgt.ItemID = src.ItemID
AND tgt.StoreID = src.StoreID
AND tgt.ItemLookupCode = src.ItemLookupCode
AND tgt.DepartmentID = src.DepartmentID
AND tgt.CategoryID = src.CategoryID
AND tgt.SupplierID = src.SupplierID
AND tgt.Time = src.Time
WHERE tgt.ItemID IS NULL;

How to query newest item from table with duplicate items?

I have to deal with data that is being dumped to a "log" table within SQL Server. Unfortunately can't make changes. Basically a process is run daily which dumps some duplicate items into a table.
Table 1:
import_id: guid
import_at: datetime
Table 2:
item_id: guid
import_id: guid (foreign key)
item_url: varchar(1000)
item_name: varchar(50)
item_description: varchar(1000)
Sometimes Table 2 will have a duplicate item_url. I only want to get the list of item_id and item_url from the newest import.
The query below will return one row per item_url, the one with the latest import_at value:
WITH all_items AS (
SELECT
t1.import_id
, t1.import_at
, t2.item_id
, t2.item_url
, t2.item_name
, t2.item_description
, ROW_NUMBER() OVER(PARTITION BY item_url ORDER BY t1.import_at DESC) AS item_url_rank
FROM dbo.table1 AS t1
JOIN dbo.table1 AS t2 ON
t2.import_id = t1.import_id
)
SELECT
t1.import_id
, import_at
, item_id
, item_url
, item_name
, item_description
WHERE
item_url_rank = 1;