SQL Server duplicate row - sql

I have a table with duplicate records. I want to mark whether the record is a duplicate or not in a another column, let's say a column name Flag. If the records is a duplicate mark it as 1 in Flag column else 0.
How to do this?
I can use a query to select duplicate records.
select
o.clientid, oc.dupeCount, o.pannodesc, o.CustNo
from
CustomerMaster1 o
inner join
(SELECT clientid, COUNT(*) AS dupeCount
FROM CustomerMaster1
WHERE ISNULL(PanNoDesc, '') <> ''
GROUP BY clientid
HAVING COUNT(*) > 1) oc ON o.clientid = oc.clientid
Simply saying, if there are two similar records, mark 1 against the second duplicated row, if three similar records mark 1 against two rows, leaving the original record as 0.

Just use count(*) as a window function to calculate the flag:
select o.clientid, oc.dupeCount, o.pannodesc, o.CustNo,
(case when count(*) over (partition by clientId) > 1
then 1 else 0
end) as IsDuplicate
from CustomerMaster1 o;
If you only case about certain records, then you can count them instead:
select o.clientid, oc.dupeCount, o.pannodesc, o.CustNo,
(case when sum(case when PanNoDesc <> '' or PanNoDesc is not null
then 1 else 0
end) over (partition by clientId) > 1
then 1 else 0
end) as IsDuplicate
from CustomerMaster1 o;
EDIT:
If you want to modify the data, assuming you have a flag, you can just use these statements as a CTE:
with toupdate as (
select o.clientid, oc.dupeCount, o.pannodesc, o.CustNo,
(case when sum(case when PanNoDesc <> '' or PanNoDesc is not null
then 1 else 0
end) over (partition by clientId) > 1
then 1 else 0
end) as NewIsDuplicate
from CustomerMaster1 o
)
update toupdate
set Flag = NewIsDuplicate;

You can write as
CREATE TABLE CustomerMaster1 (clientid INT,PanNoDesc VARCHAR(10),DupFlag bit)
INSERT INTO CustomerMaster1 VALUES(1,'A',NULL ),(1,'B',NULL )
SELECT clientid,PanNoDesc,DupFlag FROM CustomerMaster1
;WITH CTE AS(
SELECT clientid,
ROW_NUMBER()OVER (PARTITION BY clientid ORDER BY clientid ASC) AS rownum
FROM CustomerMaster1
WHERE ISNULL(PanNoDesc, '') <> ''
)
UPDATE T
SET T.DupFlag = (case WHEN rownum > 1 THEN 1 ELSE 0 END)
FROM CustomerMaster1 T
JOIN CTE ON CTE.clientid = T.clientid
SELECT clientid,PanNoDesc,DupFlag FROM CustomerMaster1
demo
Edit: Demo based on sample fields provided:
http://sqlfiddle.com/#!3/4592f/1

Related

SQL Function for updating column with values

Those who have helped me before, i tend to use SAS9.4 a lot for my day to day work, however there are times when i need to use SQL Server
There is a output table i have with 2 variables (attached output.csv)
output table
ID, GROUP, DATE
The table has 830 rows:
330 have a "C" group
150 have a "A" group
50 have a "B" group
the remaining 300 have group as "TEMP"
within SQL i do not now how to programatically work out the total volume of A+B+C. The aim is to update "TEMP" column to ensure there is an Equal amount of "A" and "B" totalling 250 of each (the remainder of the total count)
so the table totals
330 have a "C" group
250 have a "A" group
250 have a "B" group
You want to proportion the "temp" to get equal amounts of "A" and "B".
So, the idea is to count up everything in A, B, and Temp and divide by 2. That is the final group size. Then you can use arithmetic to allocate the rows in Temp to the two groups:
select t.*,
(case when seqnum + a_cnt <= final_group_size then 'A' else 'B' end) as allocated_group
from (select t.*, row_number() over (order by newid()) as seqnum
from t
where group = 'Temp'
) t cross join
(select (cnt_a + cnt_b + cnt_temp) / 2 as final_group_size,
g.*
from (select sum(case when group = 'A' then 1 else 0 end) as cnt_a,
sum(case when group = 'B' then 1 else 0 end) as cnt_b,
sum(case when group = 'Temp' then 1 else 0 end) as cnt_temp
from t
) g
) g
SQL Server makes it easy to put this into an update:
with toupdate as (
select t.*,
(case when seqnum + a_cnt <= final_group_size then 'A' else 'B' end) as allocated_group
from (select t.*, row_number() over (order by newid()) as seqnum
from t
where group = 'Temp'
) t cross join
(select (cnt_a + cnt_b + cnt_temp) / 2 as final_group_size,
g.*
from (select sum(case when group = 'A' then 1 else 0 end) as cnt_a,
sum(case when group = 'B' then 1 else 0 end) as cnt_b,
sum(case when group = 'Temp' then 1 else 0 end) as cnt_temp
from t
) g
) g
)
update toupdate
set group = allocated_group;
I'd go with a top 250 update style approach
update top (250) [TableName] set Group = 'A' where exists (Select * from [TableName] t2 where t2.id = [TableName].id order by newid()) and Group = 'Temp'
update top (250) [TableName] set Group = 'B' where exists (Select * from [TableName] t2 where t2.id = [TableName].id order by newid()) and Group = 'Temp'

In CTE query, count records in another table based on first table retrieved ID

I am working on CTE based query. I have never used this before. I am using the following query which is getting records from user_detail table.
with cte as ( select cust_ID, parentid, name, joinside,regdate,package,null lnode, null rnode from user_detail
where cust_ID = #nodeid
union all select t.cust_ID, t.parentid,t.name, t.joinside,t.regdate,t.package,
ISNULL(cte.lnode, CASE WHEN t.joinside = 0 THEN 1 ELSE 0 END) lnode,
ISNULL(cte.rnode, CASE WHEN t.joinside = 1 THEN 1 ELSE 0 END) rnode from user_detail
t inner join cte on cte.cust_ID = t.parentid )
select #nodeid nodeid,name,cust_ID,parentid,regdate,package from cte
where rnode='0' order by cust_id asc option (maxrecursion 0)
Above query is giving me 6 columns (nodeid,name,cust_ID,parentid,regdate,package).
Now what i actually want is, i want 7th column which will count rows based on cust_id from another table installments.
I am doing like below but when i add group by in the query it is giving me error..
declare #nodeid int = '1';
with cte as ( select cust_ID, parentid, name, joinside,regdate,package,null lnode, null rnode from user_detail
where cust_ID = #nodeid
union all select t.cust_ID, t.parentid,t.name, t.joinside,t.regdate,t.package,
ISNULL(cte.lnode, CASE WHEN t.joinside = 0 THEN 1 ELSE 0 END) lnode,
ISNULL(cte.rnode, CASE WHEN t.joinside = 1 THEN 1 ELSE 0 END) rnode from user_detail
t inner join cte on cte.cust_ID = t.parentid )
select #nodeid nodeid,name,ctttt.cust_ID,parentid,regdate,package,insttt.cust_id from cte as ctttt left join installments as insttt
on ctttt.cust_id = insttt.cust_id
where rnode='0' order by ctttt.cust_id asc option (maxrecursion 0)
Using sub query
(select count(*) from installments as insttt where ctttt.cust_id = insttt.cust_id ) cnt
Query:
declare #nodeid int = '1';
with cte as ( select cust_ID, parentid, name, joinside,regdate,package,null lnode, null rnode
from user_detail
where cust_ID = #nodeid
union all
select t.cust_ID, t.parentid,t.name, t.joinside,t.regdate,t.package,
ISNULL(cte.lnode, CASE WHEN t.joinside = 0 THEN 1 ELSE 0 END) lnode,
ISNULL(cte.rnode, CASE WHEN t.joinside = 1 THEN 1 ELSE 0 END) rnode
from user_detail t
inner join cte on cte.cust_ID = t.parentid )
select #nodeid nodeid,name,ctttt.cust_ID,parentid,regdate,package,insttt.cust_id ,
(select count(*) from installments as insttt where ctttt.cust_id = insttt.cust_id ) cnt
from cte as ctttt
where rnode='0'
order by ctttt.cust_id asc option (maxrecursion 0)

Counting users which doesnt make a certain event

Hi from the following table
id event
1 unknown
1 unknown
1 unknown
2 unknown
2 X
2 Y
3 unknown
3 unknown
4 X
5 Y
i want count all the amount of users which in all of their rows has unknown values
In this case they should be 2 ids out of 5
My attempt was :
select
count(distinct case when event != 'unknown' then id else null end) as loggeds,
count(distinct case when event = 'unknown' then id else null end) as not_log_android,
count(distinct event) as session_long
from table
but is completly wrong
With NOT EXISTS:
select t.id
from tablename as t
where not exists (
select 1 from tablename where id = t.id and event <> 'unknown'
)
group by t.id
for the number of disinct ids:
select count(distinct t.id)
from tablename as t
where not exists (
select 1 from tablename where id = t.id and event <> 'unknown'
)
See the demo
You can check this question: How to check if value exists in each group (after group by)
SELECT COUNT(DISTINCT t1.id)
FROM theTable t1
WHERE NOT EXISTS (SELECT 1 from theTable t2 where t1.id = t2.id and t2.value != 'unknown')
OR
SELECT COUNT(t.id)
FROM theTable t
GROUP BY t.id
HAVING MAX(CASE value WHEN 'unknown' THEN 0 ELSE 1 END) = 0
SELECT id
FROM YourTable
GROUP BY id
HAVING COUNT(*) = COUNT ( CASE WHEN event = 'unknown' THEN 1 END )
I would do aggregation :
SELECT id
FROM table t
GROUP BY id
HAVING MIN(event) = MAX(event) AND MIN(event) = 'unknown';

How to improve performance in hive

I am running below query which is around 2 million in hive. Is there any way to improve the performance? The source hive table is partition column of created_date
select t.id,
case when t.amt_1_rank < 0.3*f.amt_1_count then t.amt_1 else null end as amt_1,
case when t.amt_2_rank < 0.3*f.amt_2_count then t.amt_2 else null end as amt_2,
..
..
.. -- Like wise 30 columns e.g. amt_3,amt_3...
from (
select a.id,
a.amt_1,
row_number() over (ORDER BY cast(a.amt_1 AS DECIMAL(8,7)) DESC) AS amt_1_rank,
a.amt_2,
row_number() over (ORDER BY cast(a.amt_2 AS DECIMAL(8,7)) DESC) AS amt_2_rank
from source_table a WHERE created_date='2017-10-15' )t
join
(
SELECT count(case when amt_1='.' then null else 1 end) AS amt_1_count,
count(case when amt_2='.' then null else 1 end) AS amt_2_count,
..
..
FROM source_table
WHERE created_date='2017-10-15'
) f
You can do it without join:
select t.id,
case when t.amt_1_rank < 0.3*t.amt_1_count then t.amt_1 else null end as amt_1,
case when t.amt_2_rank < 0.3*t.amt_2_count then t.amt_2 else null end as amt_2,
..
..
.. -- Like wise 30 columns e.g. amt_3,amt_3...
from (
select a.id,
a.amt_1,
row_number() over (ORDER BY cast(a.amt_1 AS DECIMAL(8,7)) DESC) AS amt_1_rank,
a.amt_2,
row_number() over (ORDER BY cast(a.amt_2 AS DECIMAL(8,7)) DESC) AS amt_2_rank,
count(amt_1_flag) over() AS amt_1_count,
count(amt_2_flag) over() AS amt_2_count
from
(select a.*,
case when amt_1='.' then null else 1 end as amt_1_flag,
case when amt_2='.' then null else 1 end as amt_2_flag
from source_table a WHERE created_date='2017-10-15'
)a
)t

Count based on group with filter

I have a query that displays 2 columns: "Device_ID" and "Status". Device_ID is the name of all computers and status contains either "reboot" or "success" as values. I would like a third column that would count how many "success" there are for that specific Device_ID.
How could I go about doing this?
SELECT tgt.Device_ID, tgt.Status, src.cnt
FROM [TableName] tgt
INNER JOIN
(
Select Device_ID, count(CASE WHEN Status = 'SUCCESS' THEN 1 ELSE 0END) cnt
from [TableName]
GROUP BY Device_ID
) src
ON tgt.Device_ID= src.Device_ID;
SELECT A.Device_ID,A.Status,B.Count_of_Success_per_Device_ID
FROM Yourtable A
INNER JOIN
(
SELECT Device_ID,
SUM( CASE WHEN Status = 'Success' THEN 1 ELSE 0 END ) AS Count_of_Success_per_Device_ID
FROM Yourtable
GROUP BY Device_ID
) B
ON A.Device_ID = B.Device_ID ;