How can I count duplicates that fall within a date range? (SQL) - sql

I have a table that contains Applicant ID, Application Date and Job Description.
I am trying to identify duplicates, defined as when the same Applicant ID applies for the same Job Description within 3 days of their other application.
I have already done this for the same date, this way:
CREATE TABLE Duplicates
SELECT
COUNT (ApplicantID) as ApplicantCount
ApplicantID
ApplicationDate
JobDescription
FROM Applications
GROUP BY ApplicantID,ApplicationDate,JobDescription
-
DELETE FROM Duplicates WHERE ApplicantCount <2
SELECT COUNT(*) FROM Duplicates
I'm now trying to make it so it doesn't have to match exactly on the ApplicationDate, but falls within a range. How do you do this?

You can use lead()/lag(). Here is an example that returns the first application when there is a duplicate:
SELECT a.*
FROM (SELECT a.*,
LEAD(ApplicationDate) OVER (PARTITION BY ApplicantID, JobDescription) as next_ad
FROM Applications a
) a
WHERE next_ad <= ApplicationDate + INTERVAL 3 DAY;
You can also phrase this using exists:
select a.*
from applications a
where exists (select 1
from applications a2
where a2.ApplicantID = a.ApplicantID and
a2.JobDescription = a.JobDescription and
a2.ApplicationDate > a.ApplicationDate and
a2.ApplicationDate <= a.ApplicationDate + interval 3 day
);

Related

SQL - Count new entries based on last date

I have a table with the follow structure
ID ReportDate Object_id
What I need to know, is the count of new and count of old (Object id's)
For example: If I have the data below:
I want the following output grouped by ReportDate:
I thought a way doing it using a Where clause based on date, however i need the data for all the dates I have in the table. To see the count of what already existed in the previous report and what is new at that report. Any Ideas?
Edit: New/Old definition- New would be the records that never appeared before that report run date and appeared on this one, whereas old is the number of records that had at least one match in previous dates. I'll edit the post to include this info.
managed to do it using a left join. Below is my solution in case it helps anyone in the future :)
SELECT table.ReportRunDate,
-1*sum(table.ReportRunDate = new_table.init_date) as count_new,
-1*sum(table.ReportRunDate <> new_table.init_date) as count_old,
count(*) as count_total
FROM table LEFT JOIN
((SELECT Object_ID, min(ReportRunDate) as init_date
FROM table
GROUP By OBJECT_ID) as new_table)
ON table.Object_ID = new_table.Object_ID
GROUP BY ReportRunDate
This would work in Oracle, not sure about ms-access:
SELECT ReportDate
,COUNT(CASE WHEN rnk = 1 THEN 1 ELSE NULL END) count_of_new
,COUNT(CASE WHEN rnk <> 1 THEN 1 ELSE NULL END)count_of_old
FROM (SELECT ID
,ReportDate
,Object_id
,RANK() OVER (PARTITION BY Object_id ORDER BY ReportDate) rnk
FROM table_name)
GROUP BY ReportDate
Inner query should rank each occurence of object_id based on the ReportDate so the 1st occurrence of certain object_id will have rank = 1, the next one rank = 2 etc.
Then the outer query counts how many records with rank equal/not equal 1 are the within each group.
I assumed that 1 object_id can appear only once within each reportDate.

Recursive subtraction from two separate tables to fill in historical data

I have two datasets hosted in Snowflake with social media follower counts by day. The main table we will be using going forward (follower_counts) shows follower counts by day:
This table is live as of 4/4/2020 and will be updated daily. Unfortunately, I am unable to get historical data in this format. Instead, I have a table with historical data (follower_gains) that shows net follower gains by day for several accounts:
Ideally - I want to take the follower_count value from the minimum date in the current table (follower_counts) and subtract the sum of gains (organic + paid gains) for each day, until the minimum date of the follower_gains table, to fill in the follower_count historically. In addition, there are several accounts with data in these tables, so it would need to be grouped by account. It should look like this:
I've only gotten as far as unioning these two tables together, but don't even know where to start with looping through these rows:
WITH a AS (
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
total_followers_count,
null AS paid_follower_gain,
null AS organic_follower_gain,
account_name,
last_update
FROM follower_counts
UNION ALL
SELECT
account_id,
date,
organizational_entity,
organizational_entity_type,
vanity_name,
localized_name,
localized_website,
organization_type,
null AS total_followers_count,
organic_follower_gain,
paid_follower_gain,
account_name,
last_update
FROM follower_gains)
SELECT
a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.total_followers_count,
a.organic_follower_gain,
a.paid_follower_gain,
a.account_name,
a.last_update
FROM a
ORDER BY date desc LIMIT 100
UPDATE: Changed union to union all and added not exists to remove duplicates. Made changes per the comments.
NOTE: Please make sure you don't post images of the tables. It's difficult to recreate your scenario to write a correct query. Test this solution and update so that I can make modifications if necessary.
You don't loop through in SQL because its not a procedural language. The operation you define in the query is performed for all the rows in a table.
with cte as (SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
(a.follower_count - (b.organic_gain+b.paid_gain)) AS follower_count,
a.account_name,
a.last_update,
b.organic_gain,
b.paid_gain
FROM follower_counts a
JOIN follower_gains b ON a.account_id = b.account_id
AND b.date < (select min(date) from
follower_counts c where a.account.id = c.account_id)
)
SELECT b.account_id,
b.date,
b.organizational_entity,
b.organizational_entity_type,
b.vanity_name,
b.localized_name,
b.localized_website,
b.organization_type,
b.follower_count,
b.account_name,
b.last_update,
b.organic_gain,
b.paid_gain
FROM cte b
UNION ALL
SELECT a.account_id,
a.date,
a.organizational_entity,
a.organizational_entity_type,
a.vanity_name,
a.localized_name,
a.localized_website,
a.organization_type,
a.follower_count,
a.account_name,
a.last_update,
NULL as organic_gain,
NULL as paid_gain
FROM follower_counts a where not exists (select 1 from
follower_gains c where a.account_id = c.account_id AND a.date = c.date)
You could do something like this, instead of using the variable you can just wrap it another bracket and write at end ) AS FollowerGrowth
DECLARE #FollowerGrowth INT =
( SELECT total_followers_count
FROM follower_gains
WHERE AccountID = xx )
-
( SELECT TOP 1 follower_count
FROM follower_counts
WHERE AccountID = xx
ORDER BY date ASCENDING )

How show the last status of a mobile number and old data in the same row ? using SQL

I'm working in a telecom and part of work is to check the last status for a specific mobile number along with that last de-active status,it's easy to get the active number by using the condition ACTIVE int the statement ,but it's not easy to pick the last de-active status because each number might have more than one de-active status or only one status ACTIVE, I use the EXP_DATE as an indicator for the last de-active status,I want to show both new data and old data in one row,but I'm struggling with that ,below my table and my expected result :-
my expected result
query that I use on daily basis
select * from test where exp_date>sysdate; to get the active numbers , to get the de-active number select * from test where exp_date<sysdate;
You just need to do outer join with one subquery containing ACTIVE records and one with latest DE-ACTIVE record as following:
SELECT A.MSISDN,
A.NAME,
A.SUB_STATUS,
A.CREATED_DATE,
A.EXP_DATE,
D.MSISDN AS MSISDN_,
D.NAME AS OLD_NAME,
D.SUB_STATUS OLD_STATUS,
D.CREATED_DATE AS OLD_CREATED_DATE,
D.EXP_DATE AS OLD_EXP_DATE
FROM
(SELECT * FROM TEST
WHERE EXP_DATE > SYSDATE
AND SUB_STATUS = 'ACTIVE') A -- ACTIVE RECORD
-- USE CONDITION TO FETCH ACTIVE RECORD AS PER YOUR REQUIREMENT
FULL OUTER JOIN
(SELECT * FROM
(SELECT T.*,
ROW_NUMBER() OVER (PARTITION BY T.MSISDN ORDER BY EXP_DATE DESC NULLS LAST) AS RN
FROM TEST T
WHERE T.EXP_DATE < SYSDATE
AND T.SUB_STATUS='DE-ACTIVE')
-- USE CONDITION TO FETCH DEACTIVE RECORD AS PER YOUR REQUIREMENT
WHERE RN = 1
) D
ON (A.MSISDN = D.MSISDN)
Cheers!!
Here is an overview of how to do this -- one query to get a distinct list of all the phone numbers, left join to a list of the most recent active on that phone number,left join to a list of the most recent de-active on the phone number
How about conditional aggregation?
select msidn,
max(case when status = 'DE-ACTIVE' then create_date end) as deactive_date,
max(case when status = 'ACTIVE' then exp_date end) as active_date
from test
group by msisdn

SQL Server iterating through time series data

I am using SQL Server and wondering if it is possible to iterate through time series data until specific condition is met and based on that label my data in other table?
For example, let's say I have a table like this:
Id Date Some_kind_of_event
+--+----------+------------------
1 |2018-01-01|dsdf...
1 |2018-01-06|sdfs...
1 |2018-01-29|fsdfs...
2 |2018-05-10|sdfs...
2 |2018-05-11|fgdf...
2 |2018-05-12|asda...
3 |2018-02-15|sgsd...
3 |2018-02-16|rgw...
3 |2018-02-17|sgs...
3 |2018-02-28|sgs...
What I want to get, is to calculate for each key the difference between two adjacent events and find out if there exists difference > 10 days between these two adjacent events. In case yes, I want to stop iterating for that specific key and put label 'inactive', otherwise 'active' in my other table. After we finish with one key, we start with another.
So for example id = 1 would get label 'inactive' because there exists two dates which have difference bigger that 10 days. The final result would be like that:
Id Label
+--+----------+
1 |inactive
2 |active
3 |inactive
Any ideas how to do that? Is it possible to do it with SQL?
When working with a DBMS you need to get away from the idea of thinking iteratively. Instead you need to try and think in sets. "Instead of thinking about what you want to do to a row, think about what you want to do to a column."
If I understand correctly, is this what you're after?
CREATE TABLE SomeEvent (ID int, EventDate date, EventName varchar(10));
INSERT INTO SomeEvent
VALUES (1,'20180101','dsdf...'),
(1,'20180106','sdfs...'),
(1,'20180129','fsdfs..'),
(2,'20180510','sdfs...'),
(2,'20180511','fgdf...'),
(2,'20180512','asda...'),
(3,'20180215','sgsd...'),
(3,'20180216','rgw....'),
(3,'20180217','sgs....'),
(3,'20180228','sgs....');
GO
WITH Gaps AS(
SELECT *,
DATEDIFF(DAY,LAG(EventDate) OVER (PARTITION BY ID ORDER BY EventDate),EventDate) AS EventGap
FROM SomeEvent)
SELECT ID,
CASE WHEN MAX(EventGap) > 10 THEN 'inactive' ELSE 'active' END AS Label
FROM Gaps
GROUP BY ID
ORDER BY ID;
GO
DROP TABLE SomeEvent;
GO
This assumes you are using SQL Server 2012+, as it uses the LAG function, and SQL Server 2008 has less than 12 months of any kind of support.
Try this. Note, replace #MyTable with your actual table.
WITH Diffs AS (
SELECT
Id
,DATEDIFF(DAY,[Date],LEAD([Date],1,0) OVER (ORDER BY [Id], [Date])) Diff
FROM #MyTable)
SELECT
Id
,CASE WHEN MAX(Diff) > 10 THEN 'Inactive' ELSE 'Active' END
FROM Diffs
GROUP BY Id
Just to share another approach (without a CTE).
SELECT
ID
, CASE WHEN SUM(TotalDays) = (MAX(CNT) - 1) THEN 'Active' ELSE 'Inactive' END Label
FROM (
SELECT
ID
, EventDate
, CASE WHEN DATEDIFF(DAY, EventDate, LEAD(EventDate) OVER(PARTITION BY ID ORDER BY EventDate)) < 10 THEN 1 ELSE 0 END TotalDays
, COUNT(ID) OVER(PARTITION BY ID) CNT
FROM EventsTable
) D
GROUP BY ID
The method is counting how many records each ID has, and getting the TotalDays by date differences (in days) between the current the next date, if the difference is less than 10 days, then give me 1, else give me 0.
Then compare, if the total days equal the number of records that each ID has (minus one) would print Active, else Inactive.
This is just another approach that doesn't use CTE.

ORACLE: How to get earliest record of certain value when value alternates?

I'll simplify what I'm looking for here.
I have a table that stores an asset name, the date (job runs daily), and a value that is either 1 or 0 that indicates whether the asset is out of compliance.
I need to get the earliest date where the value is 0.
The issue I run into is that the issue can be intermittent, such that the same asset may show as in compliance, then out, and then in again. I want to retrieve the earliest date it was out of compliance this time.
Asset Date Compliant
NAME 2-FEB-18 0
NAME 1-FEB-18 0
NAME 31-JAN-18 1
NAME 30-JAN-18 0
In this example, I want to retrieve 1-FEB-18, and not 30-JAN-18.
I'm using a subquery into a temp table that retrieves the MIN(date) which would return 30-JAN-18. Thoughts?
Anonymized current subquery:
least_recent_created AS
(
SELECT t.date,t.ASSET, t.DATABASE_NAME FROM table t
WHERE t.date =
(
SELECT MIN(date)
FROM table2 t2
WHERE t.ASSET_ID = t2.ASSET_ID
AND t.DATABASE_NAME = t2.DATABASE_NAME
AND t2.compliant = 0
)
)
You want the earliest out-of-compliance date since the last in compliance. If the asset was never in compliance, I assume you want the earliest date.
select t.asset, min(date)
from (select t.*,
max(case when t.complaint = 1 then date end) over (partition by asset) as max_compliant1_date
from t
) t
where complaint = 0 and
(date > max_complaint1_date or max_complaint1_date is null)
group by t.asset;
You can use the following query:
SELECT "Asset", MAX("Date")
FROM (
SELECT "Asset", "Date", "Compliant",
CASE
WHEN "Compliant" = 0 AND
LAG("Compliant") OVER (PARTITION BY "Asset"
ORDER BY "Date") = 1 THEN "Date"
END AS OutOfComplianceDate
FROM mytable) t
WHERE OutOfComplianceDate IS NOT NULL
GROUP BY "Asset"
The inner query identifies 'Out-of-Compliance' dates, that is dates where the current record has "Compliant" = 0 whereas the immediately preceding record has "Compliant" = 1.
The outer query returns the latest 'Out-of-Compliance' date per "Asset".
Demo here