Count the total (N) of duplicates in a column - sql

I'm attempting to count the total number of duplicates in a column (not the individual duplicates).
from outputs
GROUP BY journal_id
HAVING ( COUNT(doi) > 1 )
WHERE journal_id = 1
SQL TABLE
doi journal_id
123 1
123 2
123 1
124 1
The expected answer is 2

The number of entire row duplicates can be calculated by taking the total number of rows and subtracting the number of distinct rows:
select a.cnt_all - d.cnt_individual
from (select count(*) as cnt_all
from outputs
) a cross join
(select count(*) as cnt_individual
from (select distinct *
from outputs
) d
) d;
If you know your columns and your database supports multiple arguments to count(distinct), this can be radically simplified to:
select count(*) - count(distinct doi, journal_id)
from outputs;
Or, if your database doesn't support this:
select sum(cnt - 1)
from (select doi, journal_id, count(*) as cnt
from outputs
group by doi, journal_id
) o;

Just sum up the count of the individual duplicates by journal id.
SELECT
SUM(COUNT(doi)) AS total_duplicates
from
outputs
WHERE
journal_id = 1
GROUP BY
journal_id
HAVING
(COUNT(doi) > 1)

Related

Finding unique combination of columns associated with 1 non-unique column

Here's my table:
ItemID
ItemName
ItemBatch
TrackingNumber
a
bag
1
498239
a
bag
1
498239
a
bag
1
958103
b
paper
2
123444
b
paper
2
123444
I'm trying to find occurrences of ItemID + ItemName + ItemBatch that have a non-unique TrackingNumber. So in the example above, there are 3 occurrences of a bag 1 and at least 1 of those rows has a different TrackingNumber from any of the other rows. In this case 958103 is different from 498239 so it should be a hit.
For b paper 2 the TrackingNumber is unique for all the respective rows so we ignore this. Is there a query that can pull this combination of columns with 3 identical fields and 1 non-unique field?
Yet another option:
SELECT *
FROM tab
WHERE ItemBatch IN (SELECT ItemBatch
FROM tab
GROUP BY ItemBatch, TrackingNumber
HAVING COUNT(TrackingNumber) = 1)
This query finds the combination of (ItemBatch, TrackingNumber) that occur only once, then gets all rows corresponding to their ItemBatch values.
Try it here.
You can use GROUP BY and HAVING
SELECT
t.ItemID,
t.ItemName,
t.ItemBatch,
COUNT(*)
FROM YourTable t
GROUP BY
t.ItemID,
t.ItemName,
t.ItemBatch
HAVING COUNT(DISTINCT TrackingNumber) > 1;
Or if you want each individual row you can use a window function. You cannot use COUNT(DISTINCT in a window function, but you can simulate it with DENSE_RANK and MAX
SELECT
t.*
FROM (
SELECT *,
Count = MAX(dr) OVER (PARTITION BY t.ItemID, t.ItemName, t.ItemBatch)
FROM (
SELECT *,
dr = DENSE_RANK() OVER (PARTITION BY t.ItemID, t.ItemName, t.ItemBatch ORDER BY t.TrackingNumber)
FROM YourTable t
) t
) t
WHERE t.Count > 1;
db<>fiddle

Total of group by divided by total of group by after where statement

I get the total number after grouping by and the total of the filled rows after grouping by.
SELECT X, COUNT(*) as total
FROM table
WHERE Y IS NULL
GROUP BY X
This for example results in row 1 has 1000 values
SELECT X, COUNT(*) as total
FROM table
GROUP BY X
this for examples results in row 1 has 500 values.
Now what I want to get is the proportion. Meaning I want to get the total of my group by statement divided by the total of my group by statement with the WHERE filter. Is this possible?
You may use a case expression to count filtered rows in the same group by projection eg
SELECT
X,
COUNT(
CASE WHEN Y IS NULL THEN 1 END
) as total_when_y_is_null,
COUNT(*) as total,
COUNT(
CASE WHEN Y IS NULL THEN 1 END
) / COUNT(*) as desired_value
FROM table
GROUP BY X
Feel free to change the operations as desired.
You need to join both Tables and have now access to both values
SELECT X, (total1/total2) As proportion
FROM
(SELECT X, COUNT(*) as total1
FROM table
WHERE Y IS NULL
GROUP BY X) As t2
JOIN
(SELECT X, COUNT(*) as total2
FROM table
GROUP BY X) As t1 ON t1.X = t2.X

find records with matching values in a column

I want to return all the records that having matching phoneBoxRecordIDs in phoneBox DB.
SELECT * FROM phoneBox where phoneBoxRecordIDs MATCH
would return:
Id phoneBoxRecordIDs colour
4 492948 Blue
9 492948 Brown
27 492948 Pink
You could group by the field where count > 1,
But this would only return the phoneboxrecordid and the # of records with that id
SELECT Count(*) [Count]
, phoneBoxRecordIDs
FROM phoneBox
Group By phoneBoxRecordIDs
Having Count(*) > 1
If you want the rows where phoneBoxRecordIDs appear more than once, then an ANSI-standard method uses window functions:
select pb.*
from (select pb.*, count(*) over (partition by phoneBoxRecordIDs) as cnt
from phoneBox
) pb
where cnt > 1
order by phoneBoxRecordIDs;
You could also do this by returning records only when a matching record exists:
select pb.*
from phoneBox pb
where exists (select 1
from phoneBox pb2
where pb2.phoneBoxRecordIDs = pb.phoneBoxRecordIDs and
pb2.id <> pb.id
);

Aggregate values by range

I have a table of users with profit and number of transactions columns:
...
I want to average profit of users in three groups - with relatively large number of transactions, average number if transactions and small number if transactions.
To get range series I use generate_series:
SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
)
FROM portfolios_static
And I do get three categories:
I need a table like this one:
How do I get average profit of users which belong to each category and count number of users that belong to each category?
This can be simpler and faster. Assuming no entry has 0 deals:
SELECT y.max_deals AS deals
, avg(profit_perc_year) AS avg_profit
, count(*) AS users
FROM (
SELECT (generate_series (0,2) * x.max_t)/3 AS min_deals
,(generate_series (1,3) * x.max_t)/3 AS max_deals
FROM (SELECT max(transactions_year) AS max_t FROM portfolios_static) x
) y
JOIN portfolios_static p ON p.transactions_year > min_deals
AND p.transactions_year <= max_deals
GROUP BY 1
ORDER BY 1;
SQL Fiddle.
This will do:
with s as
(SELECT max(transactions_year)/3 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year)/3 * 2 series FROM portfolios_static
UNION ALL
SELECT max(transactions_year) series FROM portfolios_static
),
s1 as
(SELECT generate_series(
max(transactions_year)/3,
max(transactions_year),
max(transactions_year)/3
) AS series
FROM portfolios_static
),
srn as
(SELECT series,
row_number() over (order by series) rn
from s),
prepost as
(select coalesce(pre.series,0) as pre,
post.series as post
from srn post
left join srn pre on pre.rn = post.rn-1)
select pp.post number_of_deals_or_less,
avg(profit_perc_year) average_profit,
count(*) number_of_users
from portfolios_static p INNER JOIN prepost pp
ON p.transactions_year > pp.pre AND p.transactions_year <= pp.post
GROUP by pp.post
order by pp.post;
BTW, I had to ditch generate_series and use just normal UNION ALL, as generate series will not return the proper MAX() value when the max value is not divisible by 3. For example, if you replace the srn CTE to
srn as
(SELECT series,
row_number() over (order by series) rn
from s1), -- use generate_series
You will notice that in some cases the last value in series will be less then max(transactions_year)
SQL Fiddle

grouping and aggregates with subqueries

I have a query that is designed to find the number of people who went to a hospital more than once. What I have works, but is there a way to do it without the subquery?
SELECT count(*) as counts, hospitals.hospitalname
FROM Patient INNER JOIN
hospitals ON Patient.hospitalnpi = hospitals.npi
WHERE (hospitals.hospitalname = 'X')
group by patientid, hospitalname
having count(patient.patientid) >1
order by count(*) desc
This will always return the number of correct rows (30), but not the number 30. If I remove the group by patientid then I get the entire result set returned.
I solved this problem by doing
select COUNT(*),hospitalname
from
(
SELECT count(*) as counts,hospitals.hospitalname
FROM hospitals INNER JOIN
Patient ON hospitals.npi = Patient.hospitalnpi
group by patientid, hospitals.hospitalname
having count(patient.patientid) >1
) t
group by t.hospitalname
order by t.hospitalname desc
I feel that there has to be a more elegant solution than using subqueries all the time. How could this be improved?
sample data from first query
row # revisits
1 2
2 2
3 2
4 2
same data from second, working query
row# hosp. name revisitAggregate
1 x 30
2 y 15
3 z 5
Simple one-to-many relationship between patient and hospitals
It's super hacky, but here you are:
SELECT TOP 1
ROW_NUMBER() OVER (order by patient.patientid) as Count
FROM
Patient
INNER JOIN hospitals
ON Patient.hospitalnpi = hospitals.npi
WHERE
(hospitals.hospitalname = 'X')
GROUP BY
patientid,
hospitalname
HAVING
count(patient.patientid) >1
ORDER BY
Count desc
select distinct hospitalname, count(*) over (partition by hospitalname) from (
SELECT hospitalname, count(*) over (partition by patientid,
hospitals.hospitalname) as counter
FROM hospitals INNER JOIN
Patient ON hospitals.npi = Patient.hospitalnpi
WHERE (hospitals.hospitalname = 'X')
) Z
where counter > 1