Delete Duplicate in I$ table in ODI - sql

We have a load plan in ODI. We get a repeating error in some on our scenarios due to duplicate records in the I% table. What we do is manually run the script every time the load plan fails.
DELETE FROM adw12_dw. I$_1558911580_4
WHERE (EFFECTIVE_FROM_DT, DATASOURCE_NUM_ID, INTEGRATION_ID) IN
( SELECT EFFECTIVE_FROM_DT,
DATASOURCE_NUM_ID,
INTEGRATION_ID
FROM adw12_dw . I$_1558911580_4
GROUP BY EFFECTIVE_FROM_DT,
DATASOURCE_NUM_ID,
INTEGRATION_ID
HAVING COUNT (1) > 1)
AND ROWID NOT IN
( SELECT MIN (ROWID)
FROM adw12_dw . I$_1558911580_4
GROUP BY EFFECTIVE_FROM_DT,
DATASOURCE_NUM_ID,
INTEGRATION_ID
HAVING COUNT (1) > 1)
commit;
Is there a way to automate the deletion of duplicate records in the Integration table?

If you have duplicates in the source, best would be to handle that in the logic of the mapping.
What could work is to add an expression component to add a row_rank column using an analytical function to rank the duplicates : row_number() over (partition by EFFECTIVE_FROM_DT, DATASOURCE_NUM_ID, INTEGRATION_ID order by ROWID).
You can then add a filter with the condition row_rank = 1.
If you prefer to do a delete after inserting, you can edit the IKM and add the delete step before loading the target table.
You could also divide the integration in 3 different steps :
a mapping that would load a staging table instead of your final target table, with the duplicate
an ODI procedure that would perform the delete to remove the duplicates in the staging table
a mapping that would load the data from the staging area to the target table

Maybe your duplicate counts over 2 because we must execute the delete query recursively. For example:
CREATE OR REPLACE PROCEDURE delete_dublicates
IS
BEGIN
DELETE FROM TABLE1 WHERE ID IN
(
SELECT max(ID) FROM TABLE1
GROUP BY USER_ID, TYPE_ID
HAVING count(*) > 1
);
IF (SQL%ROWCOUNT > 0) THEN
delete_dublicates;
END IF;
END delete_dublicates;

Related

Update a single row in a table in SQL

So, I am creating a new table that gets populated from another table. NewTableA.ColA is getting populated from an existing OldTableB.ColB
Source query that populates NewTableA.ColA:
SELECT TOP (1) EXEC_END_TIME
FROM CR_STAT_EXECUTION AS cse
WHERE (EXEC_NAME = 'ETL')
ORDER BY EXEC_END_TIME DESC
Destination Table (NewTableA.ColA) When scripted out:
SELECT TOP 1 [EXEC_END_TIME]
FROM [SSISHelper].[dbo].[ETLTimeCheck]
ORDER BY EXEC_END_TIME DESC
The problem I am facing is, I only want to have 1 row in the NewTableA.ColA that updates the current value in the ColA from the other table. I already setup an SSIS job to populate the table every day from OldTableB.ColB... I just couldn't figure out how I can only update 1 row from OldTableB.ColB?
Thanks.
Use IF condition in SQL:
Example:
IF EXISTS (SELECT * FROM EXEC_END_TIME WHERE COLUMNX='xValue')
BEGIN
(...update...i guess)
END
ELSE
BEGIN
(...insert...i guess)
END

IF Conditional to Run Schedule Query

I'm using BigQuery. I have a query-scheduler to generate a table (RESULT TABLE) that depends on another table (SOURCE TABLE). The case is, this source table doesn't always have data, there's a possibility that this source table is empty.
I want to Schedule the Query to make the RESULT TABLE only if there's data in SOURCE TABLE.
The example would be:
IF COUNT(1) FROM data.source_table > 0 THEN RUN:
SELECT *
FROM data.source_table
LEFT JOIN data.other_source_table
ELSE [Don't Run]
Thanks in Advance
The syntax is
IF condition THEN [sql_statement_list]
[ELSEIF condition THEN sql_statement_list]
[ELSEIF condition THEN sql_statement_list]...
[ELSE sql_statement_list]
END IF;
So for your case it's
IF COUNT(1) FROM data.source_table > 0
THEN
SELECT *
FROM data.source_table
LEFT JOIN data.other_source_table;
END IF;
For more details, you can read https://cloud.google.com/bigquery/docs/reference/standard-sql/scripting#if
At the moment you can't set a destination table when using BigQuery Scripting. It means that solutions based on IF statement will not work for your case.
Besides that, it seems that when you set a destination table, BigQuery creates the table before your query's execution, which means that independently of the results, the table will be created.
The query below is only SQL. In other words, it doesn't contains scripting. If you use it to create a scheduled query and set a destination table, you will see that even when the sub query is not run an empty table will be created.
SELECT
*
FROM
UNNEST(
(SELECT
(
CASE (SELECT COUNT(1) FROM data.source_table) > 0
WHEN TRUE
THEN (
SELECT ARRAY(
SELECT AS STRUCT *
FROM data.source_table
LEFT JOIN data.other_source_table)
)
END
)
)
)
As a workaround, you could keep your existing scheduled query and create another scheduled query just like below to run some minutes after the first one:
IF (SELECT count(1) FROM `dataset.destination_table`) = 0
THEN DROP TABLE `dataset.destination_table`;
END IF
To summarize, your solution would be:
Run a scheduled query that will create a destination table,
A few minutes later, run a scheduled query that will check if the created table is empty. If so, the table will be deleted.
I hope it helps

Removing duplicates and keeping one copy

I have been going through the threads about removing duplicates from a table and keeping one copy .I have seen an illustration in the case one have a table with composite key.anyone with the idea ?
table contr with composite key checkno,salary_month,sal_year
delete (select * from CONTR t1
INNER JOIN
(select CHECKNO, SALARY_YEAR,SALARY_MONTH FROM CONTR
group by CHECKNO, SALARY_YEAR,SALARY_MONTH HAVING COUNT(*) > 1) dupes
ON
t1.CHECKNO = dupes.CHECKNO AND
t1.SALARY_YEAR= dupes.SALARY_YEAR AND
t1.SALARY_MONTH=dupes.SALARY_MONTH);
I expected one duplicate to be removed and one maintained.
You can use this query below to remove duplicates by using rowid as having a unique valued column :
delete contr t1
where rowid <
(
select max(rowid)
from contr t2
where t2.checkno = t1.checkno
and t2.salary_year = t1.salary_year
and t2.salary_month = t1.salary_month
);
Demo
Another way to achieve this assuming you have dupes with 3 columns you have mentioned is
Create a temp table with distinct values
Drop your table
Rename the temp table
Especially if you are dealing huge volume of data this way would be a lot faster than delete.
If the dup data you are working on is subset of your main table the steps would be
Create a temp table with distinct values
Delete all dup columns from main table
Insert data from temp table to main table
The SQL for the first step would be
create table tmp_CONTR AS
select distinct CHECKNO, SALARY_YEAR,SALARY_MONTH -- this part can be modified to match your needs
from CONTR t1;

sql server to delete a record and add sum of value on trigger

I have two tables in my database, bill_datail and bill_log. I want to delete one record from table bill_log and after that trigger an action to do something in table bill_detail. My code for delete is the following:
DELETE FROM [mydatabase].[dbo].[Bill_Log]
WHERE [mydatabase].[dbo].[Bill_Log].[CU_BILL_ID] in
(SELECT
FROM [mydatabase].[dbo].[Bill_Log],[mydatabase].[dbo].[Bill_Detail]
where [mydatabase].[dbo].[Bill_Log].bill_id=37
and [mydatabase].[dbo].[Bill_Log].bill_id=[mydatabase].[dbo].[CU_Bill_Detail].cu_bill_id
and [mydatabase].[dbo].[Bill_Detail].Pay_date>20130206
and [CL_Com_Rec_Description] like '%اoffpage%'
and [mydatabase].[dbo].[Bill_Log].amount<0
and [mydatabase].[dbo].[Bill_Log].[Com_Act_Date]='2013/02/07')
go
CREATE TRIGGER [mydatabase].[dbo].[Bill_Log]
ON [mydatabase].[dbo].[Bill_Log]]
AFTER Delete
AS
---
BEGIN
-- get 'amount' from deleted record and sum it to field 'amount' of bill detail
END
But in delete action I get the following error:
'Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.
And I don't know how to fix the error and do the second part.
You only need to get a list of CU_BILL_ID to search. So remove all other fields from inner query and just select CU_BILL_ID.
DELETE FROM [mydatabase].[dbo].[CU_Bill_Log]
WHERE [mydatabase].[dbo].[CU_Bill_Log].[CU_BILL_ID] in
(SELECT cu_bill_id
FROM [mydatabase].[dbo].[CU_Bill_Detail]
where Pay_date>13930206)
and [mydatabase].[dbo].[CU_Bill_Log].cu_bill_id=37
and [mydatabase].[dbo].[CU_Bill_Log].cu_bill_id=
and [mydatabase].[dbo].[CU_Bill_Detail].
and [CL_Com_Rec_Description] like '%اoffpage%'
and [mydatabase].[dbo].[CU_Bill_Log].amount<0
and [mydatabase].[dbo].[CU_Bill_Log].[CL_Com_Act_Date]='2013/02/07'
go
Try this please.
if you want use of "in" keyword in your main query ,
the subquery must return just one column as result
Select ID,F_Name,L_Name
From Clients
Where ID in(
Select ClientID
From Orders
Where OrderNo > 120
)

Fastest check if row exists in PostgreSQL

I have a bunch of rows that I need to insert into table, but these inserts are always done in batches. So I want to check if a single row from the batch exists in the table because then I know they all were inserted.
So its not a primary key check, but shouldn't matter too much. I would like to only check single row so count(*) probably isn't good, so its something like exists I guess.
But since I'm fairly new to PostgreSQL I'd rather ask people who know.
My batch contains rows with following structure:
userid | rightid | remaining_count
So if table contains any rows with provided userid it means they all are present there.
Use the EXISTS key word for TRUE / FALSE return:
select exists(select 1 from contact where id=12)
How about simply:
select 1 from tbl where userid = 123 limit 1;
where 123 is the userid of the batch that you're about to insert.
The above query will return either an empty set or a single row, depending on whether there are records with the given userid.
If this turns out to be too slow, you could look into creating an index on tbl.userid.
if even a single row from batch exists in table, in that case I
don't have to insert my rows because I know for sure they all were
inserted.
For this to remain true even if your program gets interrupted mid-batch, I'd recommend that you make sure you manage database transactions appropriately (i.e. that the entire batch gets inserted within a single transaction).
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
WHERE NOT EXISTS (
SELECT * FROM target t2, batch b2
WHERE t2.userid = b2.userid
-- ... other keyfields ...
)
;
BTW: if you want the whole batch to fail in case of a duplicate, then (given a primary key constraint)
INSERT INTO target( userid, rightid, count )
SELECT userid, rightid, count
FROM batch
;
will do exactly what you want: either it succeeds, or it fails.
If you think about the performace ,may be you can use "PERFORM" in a function just like this:
PERFORM 1 FROM skytf.test_2 WHERE id=i LIMIT 1;
IF FOUND THEN
RAISE NOTICE ' found record id=%', i;
ELSE
RAISE NOTICE ' not found record id=%', i;
END IF;
as #MikeM pointed out.
select exists(select 1 from contact where id=12)
with index on contact, it can usually reduce time cost to 1 ms.
CREATE INDEX index_contact on contact(id);
SELECT 1 FROM user_right where userid = ? LIMIT 1
If your resultset contains a row then you do not have to insert. Otherwise insert your records.
select true from tablename where condition limit 1;
I believe that this is the query that postgres uses for checking foreign keys.
In your case, you could do this in one go too:
insert into yourtable select $userid, $rightid, $count where not (select true from yourtable where userid = $userid limit 1);