SQL: pointing a duplicate data - sql

I have a tableA
ID col1 col2 status
1 ABC 123 NULL
2 ABC 214 NULL
3 BCA 001 NULL
4 ABC 123 NULL
5 BWE 765 NULL
6 ABC 123 NULL
7 BCA 001 NULL
I want to flag the duplicate data (col1, col2) & populate the column=status with a message referring to the ID of which is duplicate.
For example, ID=4 is duplicate of ID = 1 , ID=6 is duplicate of ID = 1 and ID 7 is duplicate of ID = 3.
status = "Duplicate of ID = (ID here) "
Expected result:
ID col1 col2 status
1 ABC 123 NULL
2 ABC 214 NULL
3 BCA 001 NULL
4 ABC 123 Duplicate of ID = 1
5 BWE 765 NULL
6 ABC 123 Duplicate of ID = 1
7 BCA 001 Duplicate of ID = 3
I can able to flag the duplicates but cant able to point then to the ID numbers. The script I used is :
WITH CTE_Duplicates1 AS
(SELECT ROW_NUMBER() OVER (PARTITION BY col1,col2
ORDER BY (SELECT 0)) RN,Status
FROM tableA
)
UPDATE CTE_Duplicates1
SET qnxtStatus = 'Duplicate of ID ='
WHERE RN<>1
Please correct. Thanks

;WITH CTE_Duplicates1 AS
(
SELECT MIN(ID) OVER (PARTITION BY col1, col2) Mn,
ROW_NUMBER() OVER (PARTITION BY col1, col2 ORDER BY ID) Rn,
*
FROM tableA
)
UPDATE CTE_Duplicates1
SET qnxtStatus = 'Duplicate of ID =' + CAST(Mn AS VARCHAR(11))
WHERE Rn > 1

Related

SQL - Order Data on a Column without including it in ranking

So I have a scenario where I need to order data on a column without including it in dense_rank(). Here is my sample data set:
This is the table:
create table temp
(
id integer,
prod_name varchar(max),
source_system integer,
source_date date,
col1 integer,
col2 integer);
This is the dataset:
insert into temp
(id,prod_name,source_system,source_date,col1,col2)
values
(1,'ABC',123,'01/01/2021',50,60),
(2,'ABC',123,'01/15/2021',50,60),
(3,'ABC',123,'01/30/2021',40,60),
(4,'ABC',123,'01/30/2021',40,70),
(5,'XYZ',456,'01/10/2021',80,30),
(6,'XYZ',456,'01/12/2021',75,30),
(7,'XYZ',456,'01/20/2021',75,30),
(8,'XYZ',456,'01/20/2021',99,30);
Now, I want to do dense_rank() on the data in such a way that for a combination of "prod_name and source_system", the rank gets incremented only if there is a change in col1 or col2 but the data should still be in ascending order of source_date.
Here is the expected result:
id
prod_name
source_system
source_date
col1
col2
Dense_Rank
1
ABC
123
01-01-21
50
60
1
2
ABC
123
15-01-21
50
60
1
3
ABC
123
30-01-21
40
60
2
4
ABC
123
30-01-21
40
70
3
5
XYZ
456
10-01-21
80
30
1
6
XYZ
456
12-01-21
75
30
2
7
XYZ
456
20-01-21
75
30
2
8
XYZ
456
20-01-21
99
30
3
As you can see above, the dates are changing but the expectation is that rank should only change if there is any change in either col1 or col2.
If I use this query
select id,prod_name,source_system,source_date,col1,col2,
dense_rank() over(partition by prod_name,source_system order by source_date,col1,col2) as rnk
from temp;
Then the result would come as:
id
prod_name
source_system
source_date
col1
col2
rnk
1
ABC
123
01-01-21
50
60
1
2
ABC
123
15-01-21
50
60
2
3
ABC
123
30-01-21
40
60
3
4
ABC
123
30-01-21
40
70
4
5
XYZ
456
10-01-21
80
30
1
6
XYZ
456
12-01-21
75
30
2
7
XYZ
456
20-01-21
75
30
3
8
XYZ
456
20-01-21
99
30
4
And, if I exclude source_date from order by in rank function i.e.
select id,prod_name,source_system,source_date,col1,col2,
dense_rank() over(partition by prod_name,source_system order by col1,col2) as rnk
from temp;
Then my result is coming as:
id
prod_name
source_system
source_date
col1
col2
rnk
3
ABC
123
30-01-21
40
60
1
4
ABC
123
30-01-21
40
70
2
1
ABC
123
01-01-21
50
60
3
2
ABC
123
15-01-21
50
60
3
6
XYZ
456
12-01-21
75
30
1
7
XYZ
456
20-01-21
75
30
1
5
XYZ
456
10-01-21
80
30
2
8
XYZ
456
20-01-21
99
30
3
Both the results are incorrect. How can I get the expected result? Any guidance would be helpful.
WITH cte AS (
SELECT *,
LAG(col1) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) lag1,
LAG(col2) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) lag2
FROM temp
)
SELECT *,
SUM(CASE WHEN (col1, col2) = (lag1, lag2)
THEN 0
ELSE 1
END) OVER (PARTITION BY prod_name, source_system ORDER BY source_date, id) AS `Dense_Rank`
FROM cte
ORDER BY id;
https://dbfiddle.uk/?rdbms=mysql_8.0&fiddle=ac70104c7c5dfb49c75a8635c25716e6
When comparing multiple columns, I like to look at the previous values of the ordering column, rather than the individual columns. This makes it much simpler to add more and more columns.
The basic idea is to do a cumulative sum of changes for each prod/source system. In Redshift, I would phrase this as:
select t.*,
sum(case when prev_date = prev_date_2 then 0 else 1 end) over (
partition by prod_name, source_system
order by source_date
rows between unbounded preceding and current row
)
from (select t.*,
lag(source_date) over (partition by prod_name, source_system order by source_date, id) as prev_date,
lag(source_date) over (partition by prod_name, source_system, col1, col2 order by source_date, id) as prev_date_2
from temp t
) t
order by id;
I think I have the syntax right for Redshift. Here is a db<>fiddle using Postgres.
Note that ties on the date can cause a problem -- regardless of the solution. This uses the id to break the ties. Perhaps id can just be used in general, but your code is using the date, so this uses the date with the id.

Counting total rows and rows under condition

I have table1 looking like:
docid val1 val2 val3 value
----------------------------------
001 1 1 null 10
001 null null null 5
001 1 null 1 20
001 1 null null 7
001 null null null 15
002 null null 1 30
002 null null null 2
I need as output:
Per docid
the total number of rows that exist for that docid
and the sum of value for those rows
the number of rows that hit on the condition: val1 = 1 or val2 = 1 or val3 = 1
and the sum of value for those rows
As follows:
docid total_rows total_rows_value rows_with_val val_rows_value
001 5 57 3 37
002 2 1 32 2
What I have until now:
select [docid],
count(1) as [rows_with_val],
sum([value]) as [val_rows_value]
from table1
where val1 = 1 or val2 = 1 or val3 = 1
group by [docid]
;
This will only give the hits though. How can I account for both? I understand by deleting the where-clause, but where do I put it then? I have been reading about case statement (in my select) but don't know how to apply it here.
You can use conditional aggregation:
select docid, count(*) total_rows, sum(value) as sum_value,
sum(case when 1 in (val1, val2, val3) then 1 else 0 end) as cnt_val1,
sum(case when 1 in (val1, val2, val3) then value else 0 end) as sum_val1
from mytable
group by docid

SQLite: Use subquery result in another subquery

I have following table with data
id | COL1
=========
1 | b
2 | z
3 | b
4 | c
5 | b
6 | a
7 | b
8 | c
9 | a
So i know ID of 'z' (ID = 2) in the table and i will call it Z_ID.
I need to retrieve rows between 'a' and 'c' (including 'a' and 'c').
It must be first 'a' that comes after Z_ID.
'c' must come after Z_ID and after 'a' that i found previously.
Result that i am seeking is:
id | COL1
=========
6 | a
7 | b
8 | c
My SELECT looks like this
SELECT *
FROM table
WHERE id >= (
SELECT MIN(ID)
FROM table
WHERE COL1 = 'a' AND ID > 2
)
AND id <= (
SELECT MIN(ID)
FROM table
WHERE COL1 = 'c'AND ID > 2 and ID > (
SELECT MIN(ID)
FROM table
WHERE COL1 = 'a' AND ID > 2
)
)
I am getting the result that i want. But i am concerned about performance because i am using same subquery two times. Is there a way to reuse a result from first subquery?
Maybe there is cleaner way to get the result that i need?
Use a CTE which will return only once the result of the subquery that you use twice:
WITH cte AS (
SELECT MIN(ID) minid
FROM tablename
WHERE COL1 = 'a' AND ID > 2
)
SELECT t.*
FROM tablename t CROSS JOIN cte c
WHERE t.id >= c.minid
AND t.id <= (
SELECT MIN(ID)
FROM tablename
WHERE COL1 = 'c' and ID > c.minid
)
In your 2nd query's WHERE clause:
WHERE COL1 = 'c'AND ID > 2 and ID > (...
the condition AND ID > 2 is not needed because the next condition and ID > (... makes sure that ID will be greater than 2 so I don't use it either in my code.
See the demo.
Results:
| id | COL1 |
| --- | ---- |
| 6 | a |
| 7 | b |
| 8 | c |
You can use window functions for this:
select t.*
from (select t.*,
min(case when id > min_a_id and col1 = 'c' then id end) over () as min_c_id
from (select t.*,
min(case when col1 = 'a' then id end) over () as min_a_id
from (select t.*,
min(case when col1 = 'z' then id end) over () as z_id
from t
) t
where id > z_id
) t
) t
where id >= min_a_id and id < min_c_id;

Customer Dimension

I am writing a SQL code to create a Customer Dimension.
ID Name File Import Date
1 XXX 12/30/2018
1 XXX 12/31/2018
1 XXX 1/1/2019
1 YYY 2/2/2019
1 YYY 3/2/2019
1 YYY 4/2/2019
2 AAA 1/1/2019
I want to create a Query where I can capture the distinct Name along with the History
New table
ID Name Active
1 XXX 0
1 YYY 1
2 AAA 1
Below query give me the latest record
SELECT Distinct a.[ID] as CustID
,a.[Name] as CustName
FROM X as a
inner join
(select ID,[MaxDate] = MAX(FileImportDate) from X group by ID ) b
on a.ID = b.ID
and a.FileImportDate = b.MaxDate`
enter code here`
I'll bite...
Going by comments, this is a guess
Example
Select Top 1 with ties
ID
,Name
,Active = case when [FileImportDate] = max([FileImportDate]) over (Partition By ID) then 1 else 0 end
From YourTable
Order By Row_Number() over (Partition By Name Order by [FileImportDate] Desc)
Returns
ID Name Active
2 AAA 1
1 XXX 0
1 YYY 1
Here is a dbFiddle
With distinct and case:
select
distinct t.id, t.name,
case
when exists (
select 1 from tablename
where
id = t.id
and name <> t.name
and fileimportdate > t.fileimportdate) then 0
else 1
end active
from tablename t
See the demo
Results:
id name active
1 XXX 0
1 YYY 1
2 AAA 1

SQL Server convert a table column values to names and assign 0 to the new columns if no value for a column

I would like to transform a table from rows to columns in SQL Server.
Given table1:
id value1 value2
1 name1 9
1 name1 26
1 name1 15
2 name2 20
2 name2 18
2 name2 61
I need a table like:
id name1 name2
1 9 0
1 26 0
1 15 0
2 0 20
2 0 18
2 0 61
Can pivot help here? An efficient way is preferred to do the convert because the table is large.
I have tried:
select
id, name1, name2
from
(select
id, value1, value2
from table1) d
pivot
(
max(value2)
for value1 in (name1, name2)
) piv;
But, it cannot provide all rows for same ID to combine with 0s.
Thanks
The 'secret' is to add a column to give uniqueness to each row within your 'nameX' groups. I've used ROW_NUMBER. Although PIVOT requires an aggregate, with our 'faked uniqueness' MAX, MIN etc will suffice. The final piece is to replace any NULLs with 0 in the outer select.
(BTW, we're on 2014 so I can't test this on 2008 - apologies)
SELECT * INTO #Demo FROM (VALUES
(1,'name1',9),
(1,'name1',26),
(1,'name1',15),
(2,'name2',20),
(2,'name2',18),
(2,'name2',61)) A (Id,Value1,Value2)
SELECT
Id
,ISNULL(Name1, 0) Name1
,ISNULL(Name2, 0) Name2
FROM
( SELECT
ROW_NUMBER() OVER ( PARTITION BY Id ORDER BY Id ) rn
,Id
,Value1
,Value2
FROM
#Demo ) A
PIVOT ( MAX(Value2) FOR Value1 IN ( [Name1], [Name2] ) ) AS P;
Id Name1 Name2
----------- ----------- -----------
1 9 0
1 26 0
1 15 0
2 0 20
2 0 18
2 0 61
you can do case based aggregation with group by
SQL Fiddle
select id,
max(case when value1 ='name1' then value2 else 0 end) as name1,
max(case when value1 ='name2' then value2 else 0 end) as name2
from Table1
group by id, value1, value2