Count duplicate records using with a common table expression

Count duplicate records using with a common table expression - sql

The data returned as dataset in the CTE below looks like:
|date | rows_added |
|-----------|------------|
|2022-04-18 | 100 |
|-----------|------------|
|2022-04-17 | 200 |
|-----------|------------|
|2022-04-17 | 600 |
|-----------|------------
How can I incorporate a count of the duplicate records, by date, in the following CTE?
with dataset as (
SELECT
date,
COUNT(*) as rows_added
FROM
my_table
WHERE
date between '2022-01-01 00:00:00'
AND '2022-04-18 00:00:00'
GROUP BY
date
)
SELECT
COUNT(*) as total_days_in_result_set,
COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
FROM dataset
If I was going to only count the duplicate dates I would use the following but I can't incorporate it into the CTE above:
SELECT date, COUNT(date)
FROM my_table
GROUP BY date
HAVING COUNT(date) >1
Desired output given the example above:
total_days_in_result_set | total_days_w_distinct_record_counts | toal_days_w_duplicate_record_counts | duplicate_dates |
----------------------------------------------------------------------------------------------------------------------------
3 | 3 | 0 | 2

Here's my solution:
WITH DATA AS (with dataset as (
SELECT
CAST(date as date),
COUNT(*) as rows_added
FROM
my_table
WHERE
date between '2022-04-17 00:00:00'
AND '2022-04-18 00:00:00'
GROUP BY
date
)
SELECT
COUNT(*) as total_days_in_result_set,
COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
CASE WHEN COUNT(date) > 1 THEN 'YES' ELSE 'NO' END AS duplicate_dates
FROM dataset
)
SELECT
total_days_in_result_set,
total_days_w_distinct_record_counts,
toal_days_w_duplicate_record_counts,
COUNT(*) FILTER(WHERE duplicate_dates = 'YES') as count_of_duplicate_dates
FROM DATA
GROUP BY
total_days_in_result_set,
total_days_w_distinct_record_counts,
toal_days_w_duplicate_record_counts

I think if you use the filter in your final query, you can achieve that:
SELECT
COUNT(*) as total_days_in_result_set,
COUNT(DISTINCT rows_added) as total_days_w_distinct_record_counts,
COUNT(*) - COUNT(DISTINCT rows_added) as toal_days_w_duplicate_record_counts,
count (*) filter (where rows_added > 1) as duplicate_Dates
FROM dataset
The final field called "duplicate dates" is the example.

Related

Count distinct values for day in oracle sql

I have a table that has the next values:
sta_datetime | calling_number |called_number
01/08/2019 | 999999 | 9345435
01/08/2019 | 999999 | 5657657
02/08/2019 | 999999 | 5657657
03/08/2019 | 999999 | 9844566
I want a query that counts the uniques values for each date in all the month , for example:
sta_datetime | calling_number | quantity_calls
01/08/2019 | 999999 | 2
02/08/2019 | 999999 | 0
03/08/2019 | 999999 | 1
In date 02/08/2019 is 0 because the called_numbers are repited in date 01/08/2019.

Assuming you have records on each day, you can just count the first in a series of days with a given called number by using lag():
select sta_datetime, calling_number,
sum(case when prev_sta_datetime = sta_datetime - 1 then 0 else 1 end) as cnt
from (select t.*,
lag(sta_datetime) over (partition by calling_number, called_number order by sta_datetime) as prev_sta_datetime
from t
) t
group by sta_datetime, calling_number
order by sta_datetime, calling_number;
If you only want to count the first date called_number was called, then:
select sta_datetime, calling_number,
sum(case when first_sta_datetime = sta_datetime then 1 else 0 end) as cnt
from (select t.*,
min(sta_datetime) over (partition by calling_number, called_number) as first_sta_datetime
from t
) t
group by sta_datetime, calling_number
order by sta_datetime, calling_number;

I think you can use not exists and then group by as following:
Select t1.sta_datetime, t1.calling_number, count(1) as quantity_calls
from your_table t1
Where not exists
(select 1 from
your_table t2
Where t2.sta_datetime < t1.sta_datetime
and t1.calling_number = t2.calling_number
and t1.called_number = t2.called_number
and trunc(t1.sta_datetime, 'month') = trunc(t2.sta_datetime, 'month'))
Group by t1.sta_datetime, t1.calling_number
Order by t1.calling_number, t1.sta_datetime;
Cheers!!

SQL Server - find absence date occurrences [duplicate]

This question already has an answer here:
SQL: Gaps and Islands, Grouped dates
(1 answer)
Closed 5 years ago.
I have the following dataset:
enter image description here
Here is script for this data:
;with dataset AS (
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-07' AS DATE) AS CUT_DATE
UNION
select 'EMP01' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-15' AS DATE) AS CUT_DATE
UNION
select 'EMP02' AS EMP_ID,CAST('2018-01-01' AS DATE) AS PERIOD_START,CAST('2018-01-31' AS DATE) AS PERIOD_END,CAST('2018-01-09' AS DATE) AS CUT_DATE
)
select *
from dataset
I need to divide these periods (PERIOD_START and PERIOD_END) by CUT_DATE (exclude cut dates from that periods) The number of cut dates could be any (3,5,8 etc).
Expecting result for the dataset above is:

If your version of SQL Server supports LAG, you can use this.
SELECT EMPLOYEE_ID,
ITEM_TYPE,
MIN(APPLY_DATE) AS STARTDATE,
MAX(APPLY_DATE) AS ENDDATE
FROM
(SELECT T.*,
SUM(CASE WHEN PREV_TYPE=ITEM_TYPE THEN 0 ELSE 1 END)
OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS GRP
FROM (SELECT D.*,
LAG(ITEM_TYPE) OVER(PARTITION BY EMPLOYEE_ID ORDER BY APPLY_DATE) AS PREV_TYPE
FROM DATA D
) T
) T
WHERE ITEM_TYPE IN ('Sickness','Vacation')
GROUP BY EMPLOYEE_ID,ITEM_TYPE,GRP
The logic is to get the previous row's item_type (based on ascending order of apply_date) and compare it with the current row's value. If they are equal, they belong to the same group. Else you start a new group. This is done in the sum window function. After groups are assigned, you just need to get the max and min date for an employee_id,item_type.
Sample Demo

You would use the LAG function.
If you order by something, the LAG function gives the previous value;
a full description can be found at: http://www.sqlservercentral.com/articles/T-SQL/106783/
Take a look at vkp's answer for a full query

This is another way if way if lag is supported.
Rextester Sample
with tbl as
(select d.*
,case when (item_type = lag(item_type) over (partition by employee_id order by apply_date))
then 0
else 1
end grp_tmp
from DATA2 d
where
item_type <> 'Worked'
)
,tbl2 as
(select t.*
,sum(grp_tmp) over (order by employee_id,apply_date
rows between unbounded preceding and current row
)
as grp
from tbl t
)
select
EMPLOYEE_ID
,ITEM_TYPE
,(CONVERT(VARCHAR(24),min(apply_date),103)
+' - '
+CONVERT(VARCHAR(24),max(apply_date),103)
) as range
from tbl2
group by EMPLOYEE_ID,
ITEM_TYPE
,grp
order by
employee_id
,min(apply_date);
Output
+-------------+-----------+-------------------------+
| EMPLOYEE_ID | ITEM_TYPE | range |
+-------------+-----------+-------------------------+
| 1 | Sickness | 23/05/2017 - 24/05/2017 |
| 1 | Vacation | 26/05/2017 - 29/05/2017 |
| 1 | Sickness | 01/06/2017 - 01/06/2017 |
| 2 | Sickness | 25/05/2017 - 30/05/2017 |
+-------------+-----------+-------------------------+

Querying for an ID that has the most number of reads

Suppose I have a table like the one below:
+----+-----------+
| ID | TIME |
+----+-----------+
| 1 | 12-MAR-15 |
| 2 | 23-APR-14 |
| 2 | 01-DEC-14 |
| 1 | 01-DEC-15 |
| 3 | 05-NOV-15 |
+----+-----------+
What I want to do is for each year ( the year is defined as DATE), list the ID that has the highest count in that year. So for example, ID 1 occurs the most in 2015, ID 2 occurs the most in 2014, etc.
What I have for a query is:
SELECT EXTRACT(year from time) "YEAR", COUNT(ID) "ID"
FROM table
GROUP BY EXTRACT(year from time)
ORDER BY COUNT(ID) DESC;
But this query just counts how many times a year occurs, how do I fix it to highest count of an ID in that year?
Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 2 |
| 2012 | 2 |
+------+----+
Expected Output:
+------+----+
| YEAR | ID |
+------+----+
| 2015 | 1 |
| 2014 | 2 |
+------+----+

Starting with your sample query, the first change is simply to group by the ID as well as by the year.
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
ORDER BY EXTRACT(year from time) DESC, COUNT(*) DESC
With that, you could find the rows you want by visual inspection (the first row for each year is the ID with the most rows).
To have the query just return the rows with the highest totals, there are several different ways to do it. You need to consider what you want to do if there are ties - do you want to see all IDs tied for highest in a year, or just an arbitrary one?
Here is one approach - if there is a tie, this should return just the lowest of the tied IDs:
WITH groups AS (
SELECT EXTRACT(year from time) "YEAR" , id, COUNT(*) "TOTAL"
FROM table
GROUP BY EXTRACT(year from time), id
)
SELECT year, MIN(id) KEEP (DENSE_RANK FIRST ORDER BY total DESC)
FROM groups
GROUP BY year
ORDER BY year DESC

You need to count per id and then apply a RANK on that count:
SELECT *
FROM
(
SELECT EXTRACT(year from time) "YEAR" , ID, COUNT(*) AS cnt
, RANK() OVER (PARTITION BY "YEAR" ORDER BY COUNT(*) DESC) AS rnk
FROM table
GROUP BY EXTRACT(year from time), ID
) dt
WHERE rnk = 1
If this return multiple rows with the same high count per year and you want just one of them randomly, you can switch to a ROW_NUMBER.

This should do what you're after, I think:
with sample_data as (select 1 id, to_date('12/03/2015', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('23/04/2014', 'dd/mm/yyyy') time from dual union all
select 2 id, to_date('01/12/2014', 'dd/mm/yyyy') time from dual union all
select 1 id, to_date('01/12/2015', 'dd/mm/yyyy') time from dual union all
select 3 id, to_date('05/11/2015', 'dd/mm/yyyy') time from dual)
-- End of creating a subquery to mimick a table called "sample_data" containing your input data.
-- See SQL below:
select yr,
id most_frequent_id,
cnt_id_yr cnt_of_most_freq_id
from (select to_char(time, 'yyyy') yr,
id,
count(*) cnt_id_yr,
dense_rank() over (partition by to_char(time, 'yyyy') order by count(*) desc) dr
from sample_data
group by to_char(time, 'yyyy'),
id)
where dr = 1;
YR MOST_FREQUENT_ID CNT_OF_MOST_FREQ_ID
---- ---------------- -------------------
2014 2 2
2015 1 2

Three-Column Group-By in Oracle?

I have two different result sets:
Result 1:
+--------------+--------------+
| YEAR_MONTH | UNIQUE_USERS |
+--------------+--------------+
| 2013-08 | 1111 |
+--------------+--------------+
| 2013-09 | 2222 |
+--------------+--------------+
Result 2:
+--------------+----------------+
| YEAR_MONTH | UNIQUE_ACTIONS |
+--------------+----------------+
| 2013-08 | 111111111 |
+--------------+----------------+
| 2013-09 | 222222222 |
+--------------+----------------+
The code for Result 1:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS
FROM CORE.DATE_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
The code for Result 2:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
However, I've tried to group them by simply doing this:
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS, COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.DATE_TEST, CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
And that doesn't work. I've also tried an INNER JOIN on the second result set (result set 1 had t1 as a variable name, and result set 2 had t2), and got the error, Invalid Identifier, on t2.
This is my desired output:
+--------------+--------------+----------------+
| YEAR_MONTH | UNIQUE_USERS | UNIQUE_ACTIONS |
+--------------+--------------+----------------+
| 2013-08 | 1111 | 111111111 |
+--------------+--------------+----------------+
| 2013-09 | 2222 | 222222222 |
+--------------+--------------+----------------+
How do I do that correctly? It doesn't necessarily need to be a three-column group by; it just needs to work.

Try:
select a.YEAR_MONTH, a.UNIQUE_USERS, b.UNIQUE_ACTIONS
from (
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH,
COUNT(DISTINCT EMPLOYEE_ID) UNIQUE_USERS
FROM CORE.DATE_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
) a
join (
SELECT TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH,
COUNT(DISTINCT EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.ACTION_TEST
GROUP BY TO_CHAR(ACCESS_DATE, 'yyyy-mm')
) b
on a.YEAR_MONTH = b.YEAR_MONTH
order by a.YEAR_MONTH ASC

If both tables have many records, a Cartesian join is a poor solution and may not actually provide the answer you want. I'd solve this problem something like this:
SELECT TO_CHAR (COALESCE (t1.year_month, t2.year_month), 'yyyy-mm')
AS year_month,
t1.unique_users,
t2.unique_actions
FROM (SELECT TRUNC (access_date, 'mm') AS year_month,
COUNT (DISTINCT employee_id) AS unique_users
FROM core.date_test
GROUP BY TRUNC (access_date, 'mm')) t1
FULL OUTER JOIN
(SELECT TRUNC (access_date, 'mm') AS year_month,
COUNT (DISTINCT employee_action) AS unique_actions
FROM core.action_test
GROUP BY TRUNC (access_date, 'mm')) t2
ON t1.year_month = t2.year_month
ORDER BY COALESCE (t1.year_month, t2.year_month) ASC
The reason a Cartesian join performs poorly is that every row in the first table must be matched with every row in the second table before the group by is applied. If each table has only 1000 rows, that's 1,000,000 values that the database has to construct.

SELECT date.TO_CHAR(ACCESS_DATE, 'yyyy-mm') YEAR_MONTH, COUNT(DISTINCT date.EMPLOYEE_ID) UNIQUE_USERS, COUNT(DISTINCT act.EMPLOYEE_ACTION) UNIQUE_ACTIONS
FROM CORE.DATE_TEST date, CORE.ACTION_TEST act
WHERE date.TO_CHAR(ACCESS_DATE, 'yyyy-mm')=act.TO_CHAR(ACCESS_DATE, 'yyyy-mm')
ORDER BY YEAR_MONTH ASC
Hope This will work as we need to specify the table name from where we want to extract the rows....

Select distinct users group by time range

I have a table with the following info
|date | user_id | week_beg | month_beg|
SQL to create table with test values:
CREATE TABLE uniques
(
date DATE,
user_id INT,
week_beg DATE,
month_beg DATE
)
INSERT INTO uniques VALUES ('2013-01-01', 1, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-03', 3, '2012-12-30', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-06', 4, '2013-01-06', '2013-01-01')
INSERT INTO uniques VALUES ('2013-01-07', 4, '2013-01-06', '2013-01-01')
INPUT TABLE:
| date | user_id | week_beg | month_beg |
| 2013-01-01 | 1 | 2012-12-30 | 2013-01-01 |
| 2013-01-03 | 3 | 2012-12-30 | 2013-01-01 |
| 2013-01-06 | 4 | 2013-01-06 | 2013-01-01 |
| 2013-01-07 | 4 | 2013-01-06 | 2013-01-01 |
OUTPUT TABLE:
| date | time_series | cnt |
| 2013-01-01 | D | 1 |
| 2013-01-01 | W | 1 |
| 2013-01-01 | M | 1 |
| 2013-01-03 | D | 1 |
| 2013-01-03 | W | 2 |
| 2013-01-03 | M | 2 |
| 2013-01-06 | D | 1 |
| 2013-01-06 | W | 1 |
| 2013-01-06 | M | 3 |
| 2013-01-07 | D | 1 |
| 2013-01-07 | W | 1 |
| 2013-01-07 | M | 3 |
I want to calculate the number of distinct user_id's for a date:
For that date
For that week up to that date (Week to date)
For the month up to that date (Month to date)
1 is easy to calculate.
For 2 and 3 I am trying to use such queries:
SELECT
date,
'W' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY week_beg) AS "cnt"
FROM user_subtitles
SELECT
date,
'M' AS "time_series",
(COUNT DISTINCT user_id) COUNT (user_id) OVER (PARTITION BY month_beg) AS "cnt"
FROM user_subtitles
Postgres does not allow window functions for DISTINCT calculation, so this approach does not work.
I have also tried out a GROUP BY approach, but it does not work as it gives me numbers for whole week/months.
Whats the best way to approach this problem?

Count all rows
SELECT date, '1_D' AS time_series, count(DISTINCT user_id) AS cnt
FROM uniques
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W', count(*) OVER (PARTITION BY week_beg ORDER BY date)
FROM uniques
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M', count(*) OVER (PARTITION BY month_beg ORDER BY date)
FROM uniques
ORDER BY 1, time_series
Your columns week_beg and month_beg are 100 % redundant and can easily be replaced by
date_trunc('week', date + 1) - 1 and date_trunc('month', date) respectively.
Your week seems to start on Sunday (off by one), therefore the + 1 .. - 1.
The default frame of a window function with ORDER BY in the OVER clause uses is RANGE BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW. That's exactly what you need.
Use UNION ALL, not UNION.
Your unfortunate choice for time_series (D, W, M) does not sort well, I renamed to make the final ORDER BY easier.
This query can deal with multiple rows per day. Counts include all peers for a day.
More about DISTINCT ON:
Select first row in each GROUP BY group?
DISTINCT users per day
To count every user only once per day, use a CTE with DISTINCT ON:
WITH x AS (SELECT DISTINCT ON (1,2) date, user_id FROM uniques)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM x
GROUP BY 1
UNION ALL
SELECT DISTINCT ON (1)
date, '2_W'
,count(*) OVER (PARTITION BY (date_trunc('week', date + 1)::date - 1)
ORDER BY date)
FROM x
UNION ALL
SELECT DISTINCT ON (1)
date, '3_M'
,count(*) OVER (PARTITION BY date_trunc('month', date) ORDER BY date)
FROM x
ORDER BY 1, 2
DISTINCT users over dynamic period of time
You can always resort to correlated subqueries. Tend to be slow with big tables!
Building on the previous queries:
WITH du AS (SELECT date, user_id FROM uniques GROUP BY 1,2)
,d AS (
SELECT date
,(date_trunc('week', date + 1)::date - 1) AS week_beg
,date_trunc('month', date)::date AS month_beg
FROM uniques
GROUP BY 1
)
SELECT date, '1_D' AS time_series, count(user_id) AS cnt
FROM du
GROUP BY 1
UNION ALL
SELECT date, '2_W', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.week_beg AND d.date )
FROM d
GROUP BY date, week_beg
UNION ALL
SELECT date, '3_M', (SELECT count(DISTINCT user_id) FROM du
WHERE du.date BETWEEN d.month_beg AND d.date)
FROM d
GROUP BY date, month_beg
ORDER BY 1,2;
SQL Fiddle for all three solutions.
Faster with dense_rank()
#Clodoaldo came up with a major improvement: use the window function dense_rank(). Here is another idea for an optimized version. It should be even faster to exclude daily duplicates right away. The performance gain grows with the number of rows per day.
Building on a simplified and sanitized data model
- without the redundant columns
- day as column name instead of date
date is a reserved word in standard SQL and a basic type name in PostgreSQL and shouldn't be used as identifier.
CREATE TABLE uniques(
day date -- instead of "date"
,user_id int
);
Improved query:
WITH du AS (
SELECT DISTINCT ON (1, 2)
day, user_id
,date_trunc('week', day + 1)::date - 1 AS week_beg
,date_trunc('month', day)::date AS month_beg
FROM uniques
)
SELECT day, count(user_id) AS d, max(w) AS w, max(m) AS m
FROM (
SELECT user_id, day
,dense_rank() OVER(PARTITION BY week_beg ORDER BY user_id) AS w
,dense_rank() OVER(PARTITION BY month_beg ORDER BY user_id) AS m
FROM du
) s
GROUP BY day
ORDER BY day;
SQL Fiddle demonstrating the performance of 4 faster variants. It depends on your data distribution which is fastest for you.
All of them are about 10x as fast as the correlated subqueries version (which isn't bad for correlated subqueries).

Without correlated subqueries. SQL Fiddle
with u as (
select
"date", user_id,
date_trunc('week', "date" + 1)::date - 1 week_beg,
date_trunc('month', "date")::date month_beg
from uniques
)
select
"date", count(distinct user_id) D,
max(week_dr) W, max(month_dr) M
from (
select
user_id, "date",
dense_rank() over(partition by week_beg order by user_id) week_dr,
dense_rank() over(partition by month_beg order by user_id) month_dr
from u
) s
group by "date"
order by "date"

Try
SELECT
*
FROM
(
SELECT dates, count(user_id), 'D' as timesereis FROM users_data GROUP BY dates
UNION
SELECT max(dates), count(user_id), 'W' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
UNION
SELECT max(dates), count(user_id), 'M' FROM users_data GROUP BY date_part('year',dates)+date_part('week',dates)
) tEMP order by dates, timesereis
SQLFIDDLE

Try queries like this
SELECT count(distinct user_id), date_format(date, '%Y-%m-%d') as date_period
FROM uniques
GROUP By date_period

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Count duplicate records using with a common table expression - sql

Related

Count distinct values for day in oracle sql

SQL Server - find absence date occurrences [duplicate]

Querying for an ID that has the most number of reads

Three-Column Group-By in Oracle?

Select distinct users group by time range

Categories

Resources