Count lead duplicate rows - sql

I have the below table
Table A:
row_number id start_dt end_dt cust_dt cust_id
1 101 4/8/19 4/20/19 4/10/19 725
2 101 4/21/19 5/20/19 4/10/19 456
3 101 5/1/19 6/30/19 4/10/19 725
4 101 7/1/19 8/20/19 4/10/19 725
I need to count "duplicates" in a table for testing purposes.
Criteria:
Need to exclude the start_dt and end_dt from my calculation.
It's only a duplicate if lead row is duplicated. So, for example row 1, row 3 or 4 are the same but only row 3 and 4 would be considered duplicates in this example.
What I have tried:
rank with a lead and self join but that doesn't seem to be working on my end.
How can I count the id to determine if there are duplicates?
Output: (something like below)
count id
2 101
End results for me is to have a count of 1 for the table
count id
1 101

Use row_number analytical function as following (gaps and island problem):
Select count(1), id from
(Select t.*,
row_number() over (order by row_number) as rn,
row_number() over (partition by id, cust_dt, cust_id order by row_number) as part_rn
From your_table t)
Group by id, cust_dt, cust_id, (rn-part_rn)
Having count(1) > 1
db<>fiddle demo
Cheers!!

If your definition of a duplicated row is: the CUST_IDin the lead row (with same id order by row_number) equalst to the current CUST_ID,
you may write it down simple using the LEAD analytic function.
select ID, ROW_NUMBER, CUST_ID,
case when CUST_ID = lead(CUST_ID) over (partition by id order by ROW_NUMBER) then 1 end is_dup
from tab
ID ROW_NUMBER CUST_ID IS_DUP
---------- ---------- ---------- ----------
101 1 725
101 2 456
101 3 725 1
101 4 725
The aggregated query to get the number of duplicated rows would than be
with dup as (
select ID, ROW_NUMBER, CUST_ID,
case when CUST_ID = lead(CUST_ID) over (partition by id order by ROW_NUMBER) then 1 end is_dup
from tab)
select ID, sum(is_dup) dup_cnt
from dup
group by ID
ID DUP_CNT
---------- ----------
101 1

Related

Select earliest date and count rows in table with duplicate IDs

I have a table called table1:
id created_date
1001 2020-06-01
1001 2020-01-01
1001 2020-07-01
1002 2020-02-01
1002 2020-04-01
1003 2020-09-01
I'm trying to write a query that provides me a list of distinct IDs with the earliest created_date they have, along with the count of rows each id has:
id created_date count
1001 2020-01-01 3
1002 2020-02-01 2
1003 2020-09-01 1
I managed to write a window function to grab the earliest date, but I'm having trouble figuring out where to fit the count statement in one:
SELECT
id,
created_date
FROM ( SELECT
id,
created_date,
row_number() OVER(PARTITION BY id ORDER BY created_date) as row_num
FROM table1)
) AS a
WHERE row_num = 1
You would use aggregation:
select id, min(create_date), count(*)
from table1
group by id;
I find it amusing that you want to use window functions -- which are considered more advanced -- when lowly aggregation suffices.

Select max and group by only one column [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 2 years ago.
I'm struggling to select multiple columns while using a max function because I only want it to group by one column.
Here is my dataset:
UPDATED_DATE ACCOUNT_NUMBER LIMIT
------------ -------------- -----
2020-02-01 ABC123 100
2020-02-06 ABC123 300
2020-03-04 XYZ987 500
2020-05-19 XYZ987 100
Here are the results I'm hoping to see:
UPDATED_DATE ACCOUNT_NUMBER LIMIT
------------ -------------- -----
2020-02-06 ABC123 300
2020-05-19 XYZ987 100
I appreciate the help.
You can use a window functions:
select t.*
from (select t.*, row_number() over partition by account_number order by updated_date desc) as seqnum
from t
) t
where seqnum = 1;
Or -- a method that typically has slightly better performance with the right indexes --:
select t.*
from t
where t.updated_date = (select max(t2.updated_date) from t t2 where t2.account_number = t.account_num);
Or, if you don't like subqueries and don't care so much about performance:
select top (1) with ties t.*
from t
order by row_number() over (partition by account_number order by updated_date desc);

Add temporary column with number in sequence in BigQuery

I have two columns: customers and orders. orders has customer_id column. So customer can have many orders. I need to find order number in sequence (by date). So result should be something like this:
customer_id order_date number_in_sequence
----------- ---------- ------------------
1 2020-01-01 1
1 2020-01-02 2
1 2020-01-03 3
2 2019-01-01 1
2 2019-01-02 2
I am going to use it in WITH clause. So I don't need to add it to the table.
You need row_number() :
select t.*,
row_number() over (partition by customer_id order by order_date) as number_in_sequence
from table t;

How to select rows where values changed for an ID

I have a table that looks like the following
id effective_date number_of_int_customers
123 10/01/19 0
123 02/01/20 3
456 10/01/19 6
456 02/01/20 6
789 10/01/19 5
789 02/01/20 4
999 10/01/19 0
999 02/01/20 1
I want to write a query that looks at each ID to see if the salespeople have newly started working internationally between October 1st and February 1st.
The result I am looking for is the following:
id effective_date number_of_int_customers
123 02/01/20 3
999 02/01/20 1
The result would return only the salespeople who originally had 0 international customers and now have at least 1.
I have seen similar posts here that use nested queries to pull records where the first date and last have different values. But I only want to pull records where the original value was 0. Is there a way to do this in one query in SQL?
In your case, a simple aggregation would do -- assuming that 0 is the earliest value:
select id, max(number_of_int_customers)
from t
where effective_date in ('2019-10-01', '2020-02-01')
group by id
having min(number_of_int_customers) = 0;
Obviously, this is not correct if the values can decrease to zero. But this having clause fixes that problem:
having min(case when number_of_int_customers = 0 then effective_date end) = min(effective_date)
An alternative is to use window functions, such asfirst_value():
select distinct id, last_noic
from (select t.*,
first_value(number_of_int_customers) over (partition by id order by effective_date) as first_noic,
first_value(number_of_int_customers) over (partition by id order by effective_date desc) as last_noic,
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where first_noic = 0;
Hmmm, on second thought, I like lag() better:
select id, number_of_int_customers
from (select t.*,
lag(number_of_int_customers) over (partition by id order by effective_date) as prev_noic
from t
where effective_date in ('2019-10-01', '2020-02-01')
) t
where prev_noic = 0;

fill in a null cell with cell from previous record

Hi I am using DB2 sql to fill in some missing data in the following table:
Person House From To
------ ----- ---- --
1 586 2000-04-16 2010-12-03
2 123 2001-01-01 2012-09-27
2 NULL NULL NULL
2 104 2004-01-01 2012-11-24
3 987 1999-12-31 2009-08-01
3 NULL NULL NULL
Where person 2 has lived in 3 houses, but the middle address it is not known where, and when. I can't do anything about what house they were in, but I would like to take the previous house they lived at, and use the previous To date to replace the NULL From date, and use the next address info and use the From date to replace the null To date ie.
Person House From To
------ ----- ---- --
1 586 2000-04-16 2010-12-03
2 123 2001-01-01 2012-09-27
2 NULL 2012-09-27 2004-01-01
2 104 2004-01-01 2012-11-24
3 987 1999-12-31 2009-08-01
3 NULL 2009-08-01 9999-01-01
I understand that if there is no previous address before a null address, that will have to stay null, but if a null address is the last know address I would like to change the To date to 9999-01-01 as in person 3.
This type of problem seems to me where set theory no longer becomes a good solution, however I am required to find a DB2 solution because that's what my boss uses!
any pointers/suggestions welcome.
Thanks.
It might look something like this:
select
person,
house,
coalesce(from_date, prev_to_date) from_date,
case when rn = 1 then coalesce (to_date, '9999-01-01')
else coalesce(to_date, next_from_date) end to_date
from
(select person, house, from_date, to_date,
lag(to_date) over (partition by person order by from_date nulls last) prev_to_date,
lead(from_date) over (partition by person order by from_date nulls last) next_from_date,
row_number() over (partition by person order by from_date desc nulls last) rn
from temp
) t
The above is not tested but it might give you an idea.
I hope in your actual table you have a column other than to_date and from_date that allows you to order rows for each person, otherwise you'll have trouble sorting NULL dates, as you have no way of knowing the actual sequence.
create table Temp
(
person varchar(2),
house int,
from_date date,
to_date date
)
insert into temp values
(1,586,'2000-04-16','2010-12-03 '),
(2,123,'2001-01-01','2012-09-27'),
(2,NULL,NULL,NULL),
(2,104,'2004-01-01','2012-11-24'),
(3,987,'1999-12-31','2009-08-01'),
(3,NULL,NULL,NULL)
select A.person,
A.house,
isnull(A.from_date,BF.to_date) From_date,
isnull(A.to_date,isnull(CT.From_date,'9999-01-01')) To_date
from
((select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) A left join
(select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) BF
on A.person = BF.person and
A.rownum = BF.rownum + 1)left join
(select *,ROW_NUMBER() over (order by (select 0)) rownum from Temp) CT
on A.person = CT.person and
A.rownum = CT.rownum - 1