Is there a way to use DISTINCT and COUNT(*) together to bulletproof your code against DUPLICATE entries? - sql

I got help with a function yesterday to correctly get the count of multiple items in a column based on multiple criteria/columns. However, if there is a way to get the DISTINCT count of all the entries in the table based on aggregated GROUP BY statement.
SELECT TIME = ap.day,
acms.tenantId,
acms.CallingService,
policyList = ltrim(sp.value),
policyInstanceList = ltrim(dp.value),
COUNT(*) AS DISTINCTCount
FROM dbo.acms_data acms
CROSS APPLY string_split(acms.policyList, ',') sp
CROSS APPLY string_split(acms.policyInstanceList, ',') dp
CROSS APPLY (select day = convert(date, acms.[Time])) ap
GROUP BY ap.day, acms.tenantId, sp.value, dp.value, acms.CallingService
I would just like to know if there would be a way to see if there is a workaround for using DISTINCT and Count(*) together and whether or not it would affect my results to make this algorithm potentially invulnerable to duplicate entries.
The reason why I have to use COUNT(*) is because I am aggregating based on every column in the table not just a specific column or multiple.

We can use DISTINCT with COUNT together like this example.
USE AdventureWorks2012
GO
-- This query shows 290 JobTitle
SELECT COUNT(JobTitle) Total_JobTitle
FROM [HumanResources].[Employee]
GO
-- This query shows only 67 JobTitle
SELECT COUNT( DISTINCT JobTitle) Total_Distinct_JobTitle
FROM [HumanResources].[Employee]
GO

Related

Postgres: Count multiple events for distinct dates

People of Stack Overflow!
Thanks for taking the time to read this question. What I am trying to accomplish is to pivot some data all from just one table.
The original table has multiple datetime entries of specific events (e.g. when the customer was added add_time and when the customer was lost lost_time).
This is one part of two rows of the deals table:
id
add_time
last_mail_time
lost_time
5
2020-03-24 09:29:24
2020-04-03 13:20:29
NULL
310
2020-03-24 09:29:24
NULL
2020-04-03 13:20:29
I want to create a view of this table. A view that has one row for each distinct date and counts the number of events at this specific time.
This is the goal (times do not match with the example!):
I have working code, like this:
SELECT DISTINCT
change_datetime,
(SELECT COUNT(add_time) as add_time_count FROM deals WHERE add_time::date = change_datetime),
(SELECT COUNT(lost_time) as lost_time_count FROM deals WHERE lost_time::date = change_datetime)
FROM (
SELECT
add_time::date AS change_datetime
FROM
deals
UNION ALL
SELECT
lost_time::date AS change_datetime
FROM
deals
) AS foo
WHERE change_datetime IS NOT NULL
ORDER BY
change_datetime;
but this has some ugly O(n2) queries and takes a lot of time.
Is there a better, more performant way to achieve this?
Thanks!!
You can use a lateral join to unpivot and then aggregate:
select t::date,
count(*) filter (where which = 'add'),
count(*) filter (where which = 'mail'),
count(*) filter (where which = 'lost')
from deals d cross join lateral
(values (add_time, 'add'),
(last_mail_time, 'mail'),
(lost_time, 'lost')
) v(t, which)
group by t::date;

Aggregating based on GROUPING of multiple columns

I am trying to subquery and aggregate in SQL after doing an initial query with multiple joins. My ultimate goal is to get a count (or a sum) of specimens tested based on a grouping of multiple columns. This is slightly different from SQL Server query - Selecting COUNT(*) with DISTINCT and SQL Server: aggregate error on grouping.
The three tables that I use (PERSON, SPECIMEN, TEST), have 1-many relationships. So PERSON has many SPECIMENS and those SPECIMENS have many TESTS. I did three inner joins to combine these tables plus an additional table (ANALYSIS).
WITH TALLY as (
SELECT PERSON.NAME, PERSON.PHASE, TEST.DATE_STARTED, TEST.ANALYSIS, SPECIMEN.GROUP, TEST.STATUS,
ANALYSIS.ANALYSIS_TYPE, SPECIMEN.SPECIMEN_NUMBER
FROM DB.TEST
INNER JOIN
DB.SAMPLE ON
TEST.SPECIMEN_NUMBER = SPECIMEN.SPECIMEN_NUMBER
INNER JOIN
DB.PRODUCT ON
SPECIMEN.PERSON = PERSON.NAME
INNER JOIN
DB.ANALYSIS ON
TEST.ANALYSIS = ANALYSIS.NAME
WHERE PERSON.NAME = 'Joe'
AND TEST.DATE_STARTED >= '20-DEC-16' AND TEST.DATE_STARTED <='01-APR-18'
AND PERSON.PHASE = 'PHASE1'
ORDER BY TEST.DATE_STARTED)
SELECT COUNT(DISTINCT ANALYSIS) as SPECIMEN_COUNT, DATE_STARTED, ANALYSIS, STATUS, GROUP, ANALYSIS_TYPE
FROM TALLY
GROUP BY DATE_STARTED, ANALYSIS, STATUS, GROUP, ANALYSIS_TYPE
ORDER BY DATE_STARTED;
This gives me the repeated columns: first grouping repeated 4 times
What I am trying to see is: aggregated first grouping with total count
Any thoughts as to what is missing? SUM instead of COUNT or in addition to COUNT creates an error. Thanks in advance!
9/17/2020 Update: I have tried adding a subquery because I also need to use a new column of metadata (ANALYSIS_TYPE_ALIAS) which is created in the first query through a CASE STATEMENT(...). I have also tried using another subquery with inner join to count based on those conditions to a temp table, but still cannot seem to aggregate to flatten the table. Here is my current attempt:
WITH TALLY as (
SELECT PERSON.NAME, PERSON.PHASE, TEST.DATE_STARTED, TEST.ANALYSIS, SPECIMEN.GROUP, TEST.STATUS,
ANALYSIS.ANALYSIS_TYPE...
FROM DB.TEST
INNER JOIN
DB.SAMPLE ON
TEST.SPECIMEN_NUMBER = SPECIMEN.SPECIMEN_NUMBER
INNER JOIN
DB.PRODUCT ON
SPECIMEN.PERSON = PERSON.NAME
INNER JOIN
DB.ANALYSIS ON
TEST.ANALYSIS = ANALYSIS.NAME
WHERE PERSON.NAME = 'Joe'
AND TEST.DATE_STARTED >= '20-DEC-16' AND TEST.DATE_STARTED <='01-APR-18'
AND PERSON.PHASE = 'PHASE1'
ORDER BY TEST.DATE_STARTED),
SUMMARY_COMBO AS (SELECT DISTINCT(CONCAT(CONCAT(CONCAT(CONCAT(ANALYSIS, DATE_STARTED),STATUS), GROUP), ANALYSIS_TYPE_ALIAS))AS UUID,
TALLY.NAME, TALLY.PHASE, TALLY.DATE_STARTED, TALLY.ANALYSIS, TALLY.GROUP, TALLY.STATUS, TALLY.ANALYSIS_TYPE_ALIAS
FROM TALLY)
SELECT SUMMARY_COMBO.NAME, SUMMARY_COMBO.PHASE, SUMMARY_COMBO.DATE_STARTED, SUMMARY_COMBO.ANALYSIS,SUMMARY_COMBO.GROUP, SUMMARY_COMBO.STATUS, SUMMARY_COMBO.ANALYSIS_TYPE_ALIAS,
COUNT(SUMMARY_COMBO.ANALYSIS) OVER (PARTITION BY SUMMARY_COMBO.UUID) AS SPECIMEN_COUNT
FROM SUMMARY_COMBO
ORDER BY SUMMARY_COMBO.DATE_STARTED;
This gave me the following table Shows aggregated counts, but doesn't aggregate based on unique UUID. Is there a way to take the sum of the count? I've tried to do this by storing count to a subquery and then referencing that count variable, but I am missing something in how to group the 8 columns of data that I want to show + the count of that combination of columns.
Thanks!
Just remove analysis from the group by clause, since that's the column whose distinct values you want to count. Otherwise, the query generates more groups than what you need (and the count of distinct analysis values in each group is always 1).
WITH TALLY as ( ...)
SELECT COUNT(DISTINCT ANALYSIS) as SPECIMEN_COUNT, DATE_STARTED, ANALYSIS, STATUS, GROUP, ANALYSIS_TYPE
FROM TALLY
GROUP BY DATE_STARTED, STATUS, GROUP, ANALYSIS_TYPE
ORDER BY DATE_STARTED;

Pivot in SQL without Aggregate function

I have a scenario Where I have a table like
Table View
and What Output I want is
If your argument is "I will only ever have one value or no values, therefore I don't want an aggregate", realise that there are several aggregates that, if they're only passed a single value to aggregate, will return that value back as their result. MIN and MAX come to mind. SUM also works for numeric data.
Therefore the solution to specifying a PIVOT without an aggregate is instead to specify such a "pass through" aggregate here.
Basically, PIVOT internally works a lot the same as GROUP BY. Except the grouping columns are all columns in the current result set other than the column mentioned in the aggregate part of the PIVOT specification. And just as with the rules for the SELECT clause when GROUP BY is used1, every column either needs to be a grouping column or contained in an aggregate.
1Grumble, grumble, older mysql grumble. Although the defaults are more sensible from 5.7.5 up.
Try this:
Demo
with cte1 as
(
select 'Web' as platformname,'abc' as productname,'A' as grade
union all
select 'Web' ,'cde' ,'B'
union all
select 'IOS' ,'xyz' ,'C'
union all
select 'MAX' ,'cde' ,'D'
)
select productname,[Web], [IOS], [Android],[Universal],[Mac],[Win32]
from cte1 t
pivot
(
max(grade)
for platformname in ([Web], [IOS], [Android],[Universal],[Mac],[Win32])
) p
You can "pivot" such data using joins:
select p.productname,
t_win32.grade as win32,
t_universal.grade as universal,
. . .
from products p left join -- assume you have such a table
t t_win32
on t_win32.product_name = p.productname and t_win32.platform = 'Win32' left join
t t_universal
on t_universal.product_name = p.productname and t_universal.platform = 'Universal' left join
. . .
If you don't have a table products, use a derived table instead:
from (select distinct product_name from t) p left join
. . .

Modify my SQL Server query -- returns too many rows sometimes

I need to update the following query so that it only returns one child record (remittance) per parent (claim).
Table Remit_To_Activate contains exactly one date/timestamp per claim, which is what I wanted.
But when I join the full Remittance table to it, since some claims have multiple remittances with the same date/timestamps, the outermost query returns more than 1 row per claim for those claim IDs.
SELECT * FROM REMITTANCE
WHERE BILLED_AMOUNT>0 AND ACTIVE=0
AND REMITTANCE_UUID IN (
SELECT REMITTANCE_UUID FROM Claims_Group2 G2
INNER JOIN Remit_To_Activate t ON (
(t.ClaimID = G2.CLAIM_ID) AND
(t.DATE_OF_LATEST_REGULAR_REMIT = G2.CREATE_DATETIME)
)
where ACTIVE=0 and BILLED_AMOUNT>0
)
I believe the problem would be resolved if I included REMITTANCE_UUID as a column in Remit_To_Activate. That's the REAL issue. This is how I created the Remit_To_Activate table (trying to get the most recent remittance for a claim):
SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
MAX(claim_id) AS ClaimID,
INTO Latest_Remit_To_Activate
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID
Claims_Group2 contains these fields:
REMITTANCE_UUID,
CLAIM_ID,
BILLED_AMOUNT,
CREATE_DATETIME
Here are the 2 rows that are currently giving me the problem--they're both remitts for the SAME CLAIM, with the SAME TIMESTAMP. I only want one of them in the Remits_To_Activate table, so only ONE remittance will be "activated" per Claim:
enter image description here
You can change your query like this:
SELECT
p.*, latest_remit.DATE_OF_LATEST_REMIT
FROM
Remittance AS p inner join
(SELECT MAX(create_datetime) as DATE_OF_LATEST_REMIT,
claim_id,
FROM Claims_Group2
WHERE BILLED_AMOUNT>0
GROUP BY Claim_ID
ORDER BY Claim_ID) as latest_remit
on latest_remit.claim_id = p.claim_id;
This will give you only one row. Untested (so please run and make changes).
Without having more information on the structure of your database -- especially the structure of Claims_Group2 and REMITTANCE, and the relationship between them, it's not really possible to advise you on how to introduce a remittance UUID into DATE_OF_LATEST_REMIT.
Since you are using SQL Server, however, it is possible to use a window function to introduce a synthetic means to choose among remittances having the same timestamp. For example, it looks like you could approach the problem something like this:
select *
from (
select
r.*,
row_number() over (partition by cg2.claim_id order by cg2.create_datetime desc) as rn
from
remittance r
join claims_group2 cg2
on r.remittance_uuid = cg2.remittance_uuid
where
r.active = 0
and r.billed_amount > 0
and cg2.active = 0
and cg2.billed_amount > 0
) t
where t.rn = 1
Note that that that does not depend on your DATE_OF_LATEST_REMIT table at all, it having been subsumed into the inline view. Note also that this will introduce one extra column into your results, though you could avoid that by enumerating the columns of table remittance in the outer select clause.
It also seems odd to be filtering on two sets of active and billed_amount columns, but that appears to follow from what you were doing in your original queries. In that vein, I urge you to check the results carefully, as lifting the filter conditions on cg2 columns up to the level of the join to remittance yields a result that may return rows that the original query did not (but never more than one per claim_id).
A co-worker offered me this elegant demonstration of a solution. I'd never used "over" or "partition" before. Works great! Thank you John and Gaurasvsa for your input.
if OBJECT_ID('tempdb..#t') is not null
drop table #t
select *, ROW_NUMBER() over (partition by CLAIM_ID order by CLAIM_ID) as ROW_NUM
into #t
from
(
select '2018-08-15 13:07:50.933' as CREATE_DATE, 1 as CLAIM_ID, NEWID() as
REMIT_UUID
union select '2018-08-15 13:07:50.933', 1, NEWID()
union select '2017-12-31 10:00:00.000', 2, NEWID()
) x
select *
from #t
order by CLAIM_ID, ROW_NUM
select CREATE_DATE, MAX(CLAIM_ID), MAX(REMIT_UUID)
from #t
where ROW_NUM = 1
group by CREATE_DATE

Count query giving wrong column name error

select COUNT(analysed) from Results where analysed="True"
I want to display count of rows in which analysed value is true.
However, my query gives the error: "The multi-part identifier "Results.runId" could not be bound.".
This is the actual query:
select ((SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True')/failCount) as PercentAnalysed
from Runs
where Runs.runId=Analysed.runId
My table schema is:
The value I want for a particular runId is: (the number of entries where analysed=true)/failCount
EDIT : How to merge these two queries?
i) select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner from Runs,Product where Runs.prodId=Product.prodId
ii) select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I tried this :
select runId,Runs.prodId,prodDate,prodName,buildNumber,totalCount as TotalTestCases,(passCount*100)/(passCount+failCount) as PassPercent,
passCount,failCount,runOwner,counts.runId,(cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs,Product
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
where Runs.prodId=Product.prodId
but it gives error.
Your problems are arising from improper joining of tables. You need information from both Runs and Results, but they aren't combined properly in your query. You have the right idea with a nested subquery, but it's in the wrong spot. You're also referencing the Analysed table in the outer where clause, but it hasn't been included in the from clause.
Try this instead:
select (cast(counts.Count as decimal(10,4)) / cast(failCount as decimal(10,4))) as PercentAnalysed
from Runs
inner join
(
SELECT COUNT(*) AS 'Count', runId
FROM Results
WHERE Analysed = 'True'
GROUP BY runId
) counts
on counts.runId = Runs.runId
I've set this up as an inner join to eliminate any runs which don't have analysed results; you can change it to a left join if you want those rows, but will need to add code to handle the null case. I've also added casts to the two numbers, because otherwise the query will perform integer division and truncate any fractional amounts.
I'd try the following query:
SELECT COUNT(*) AS 'Count'
FROM Results
WHERE Analysed = 'True'
This will count all of your rows where Analysed is 'True'. This should work if the datatype of your Analysed column is either BIT (Boolean) or STRING(VARCHAR, NVARCHAR).
Use CASE in Count
SELECT COUNT(CASE WHEN analysed='True' THEN analysed END) [COUNT]
FROM Results
Click here to view result
select COUNT(*) from Results where analysed="True"