COUNT DISTINCT with CONDITIONS - sql

I want to count the number of distinct items in a column subject to a certain condition, for example if the table is like this:
tag | entryID
----+---------
foo | 0
foo | 0
bar | 3
If I want to count the number of distinct tags as "tag count" and count the number of distinct tags with entry id > 0 as "positive tag count" in the same table, what should I do?
I'm now counting from two different tables where in the second table I've only selected those rows with entryID larger than zero. I think there should be a more compact way to solve this problem.

You can try this:
select
count(distinct tag) as tag_count,
count(distinct (case when entryId > 0 then tag end)) as positive_tag_count
from
your_table_name;
The first count(distinct...) is easy.
The second one, looks somewhat complex, is actually the same as the first one, except that you use case...when clause. In the case...when clause, you filter only positive values. Zeros or negative values would be evaluated as null and won't be included in count.
One thing to note here is that this can be done by reading the table once. When it seems that you have to read the same table twice or more, it can actually be done by reading once, in most of the time. As a result, it will finish the task a lot faster with less I/O.

This may work:
SELECT Count(tag) AS 'Tag Count'
FROM Table
GROUP BY tag
and
SELECT Count(tag) AS 'Negative Tag Count'
FROM Table
WHERE entryID > 0
GROUP BY tag

Try the following statement:
select distinct A.[Tag],
count(A.[Tag]) as TAG_COUNT,
(SELECT count(*) FROM [TagTbl] AS B WHERE A.[Tag]=B.[Tag] AND B.[ID]>0)
from [TagTbl] AS A GROUP BY A.[Tag]
The first field will be the tag the second will be the whole count the third will be the positive ones count.

I agree with #ntalbs solution,
if you want to count a column's data when the condition of another column's data is valid, you can do this
select
count(distinct tag) as tag_count,
count(distinct tag, case when entryId > 0 then tag end) as positive_tag_count
from
your_table_name;
On line 3, I added the column name beside the distinct, so it will count the distinct tags when the entryId is greater than 0

Code counts the unique/distinct combination of Tag & Entry ID when [Entry Id]>0
select count(distinct(concat(tag,entryId)))
from customers
where id>0
In the output it will display the count of unique values
Hope this helps

This may also work:
SELECT
COUNT(DISTINCT T.tag) as DistinctTag,
COUNT(DISTINCT T2.tag) as DistinctPositiveTag
FROM Table T
LEFT JOIN Table T2 ON T.tag = T2.tag AND T.entryID = T2.entryID AND T2.entryID > 0
You need the entryID condition in the left join rather than in a where clause in order to make sure that any items that only have a entryID of 0 get properly counted in the first DISTINCT.

Related

In Snowflake, I want to count duplicates in a table based on all the columns in the table without typing out every column name

I have a table with 60 columns in it. I would like to identify how many duplicates there are in the table based on all the columns being identical.
I don't want to have to type out every field name in the SELECT or GROUP BY clauses. Is there a way to do that?
You can use an approach like this for each table:
SELECT
MD5(OBJECT_CONSTRUCT(SRC.*)::VARCHAR) DUP_MD5, SUM(1) AS TOTAL_COUNT
FROM <table> SRC
GROUP BY 1
HAVING SUM(1) > 1;

SQL query to return results from one to many table

I'm having difficulties trying to return some data from a poorly structured one to many table.
I've been provided with a data export where everything from 'Section Codes' onwards (in cat_fullxPath) relates to a 'skillID' in my clients database.
The results previously returned on one line but I've used a split function to break these out (from the cat_fullXPath column). You can see the relevant 'skillID' from my clients DB in the far right column:
From here, there are thousands of records that may have a mixture of these skillIDs (and many others, I've just provided this one example). I want to be able to find the records that match all 4 (or however many match from another example) skillIDs and ONLY those.
For example (I just happen to know this ID gives me the results I want):
SELECT
id
skillID
FROM table1
WHERE skillID IN ( 1004464, 1006543, 1004605, 1006740 )
AND id = 69580;
This returns me:
Note that these are the only columns in that table.
So this is an ID I'd want to return.
These are results I'd not want to return as one of the skillIDs are missing:
I've created a temp table with a count of all the skills for each ID but I'm not sure if I'm going down the right path at this point
I'm pretty sure that there's a simple solution to this, however I'm hitting my head against the wall. Hope someone can help!
EDIT
This might be a clearer example of when there are different groups of skillIds that I need to align. I've partitioned these off by cat_fullxpath to see if this makes things clearer:
In this screenshot, for example I want to find the ids for everything in table1 where skillID IN (1003914,1005354,1004701) then repeat for (1004659,1004492,1004493,1004701). etc
We know that you need exactly 4 skills, so just make a subquery:
select id from
(
SELECT
id
count(skillID) countSkill
FROM table1
WHERE skillID IN ( 1004464, 1006543, 1004605, 1006740 )
group by id;
)
where countSkill = 4;
Could work with sum, instead of count. But instead of filtering by the 4, you filter by 4022352, which is the sum of all skillID.
You can also remove the subquery and use HAVING. But you will obtain worse performance.
SELECT
id
count(skillID) countSkill
FROM table1
WHERE skillID IN ( 1004464, 1006543, 1004605, 1006740 )
group by id
having count(skillID) = 4;
You haven't told us your DBMS. Here is a standard SQL approach:
select id
from table1
group by id
having count(case when skillid = 1004464 then 1 end) > 0
and count(case when skillid = 1006543 then 1 end) > 0
and count(case when skillid = 1004605 then 1 end) > 0
and count(case when skillid = 1006740 then 1 end) > 0
and count(case when skillid not in (1004464, 1006543, 1004605, 1006740) then 1 end) = 0;
Another option is to concatenate all skills and see if the resulting skill list matches the desired skill list. In SQL Server the string aggregation function is STRING_AGG.
select id
from table1
group by id
having string_agg(skillid, ',') within group (order by skillid) in
(
'1004464,1004605,1006543,1006740'
);
You can easily extend the IN clause with other combinations or even get the list from another table. Only make sure the skill IDs in the strings are sorted in order to make the strings comparable ('1004464,1004605,1006543,1006740' <> '1006740,1004464,1004605,1006543').

SQL: At least one value exists in another table

I am trying to create a table that has columns called user_id and top5_foods (binary column). I currently have two tables, one has all of the user_ids and the foods associated with those user_ids and one table that only contains the top5 foods according to a type of calculation to select the top5 foods.
The table that I am trying to create if to have the column of the user_id and if at least one of their favorite foods is in the top_5_food table, put the value of the top5_foods as 1 and if not, 0.
Something like the following:
user_id top5_foods
----------------------
34223 1
43225 0
34323 1
I have tried to use the CASE command but it just duplicated the user_ids and mark 1 or 0 whenever it finds a food that is in the top_5_foods table. But I don't want it to duplicate. Could you please help ?
Thank you very much
If I understand correctly, a left join and aggregation:
select uf.user_id,
(count(t.food_id) > 0) as top5_foods
from user_foods uf left join
top5_foods t
on uf.food_id = t.food_id
group by uf.user_id;

Determine the number of times a null value occurs in column B for a distinct value in column A, SQL table

I have a SQL table with "name" as one column, date as another, and location as a third. The location column supports null values.
I am trying to write a query to determine the number of times a null value occurs in the location column for each distinct value in the name column.
Can someone please assist?
One method uses conditional aggregation:
select name, sum(case when location is null then 1 else 0 end)
from t
group by name;
Another method that involves slightly less typing is:
select name, count(*) - count(location)
from t
group by name;
use count along with filters, as you only requires Null occurrence
select name, count(*) occurances
from mytable
where location is null
group by name
From your question, you'll want to get a distinct list of all different 'name' rows, and then you would like a count of how many NULLs there are per each name.
The following will achieve this:
SELECT name, count(*) as null_counts
FROM table
WHERE location IS NULL
GROUP BY name
The WHERE clause will only retrieve records where the records have NULL as their location.
The GROUP BY will pivot the data based on NAME.
The SELECT will give you the name, and the COUNT(*) of the number of records, per name.

How do I check if all posts from a joined table has the same value in a column?

I'm building a BI report for a client where there is a 1-n related join involved.
The joined table has a field for employee ID (EmplId).
The query that I've built for this report is supposed to give a 1 in its field "OneEmployee" if all the related posts have the same employee in the EmplId field, null if it's different employees, i.e:
TaskTrans
TaskTransHours > EmplId: 'John'
TaskTransHours > EmplId: 'John'
This should give a 1 in the said field in the query
TaskTrans
TaskTransHours > EmplId: 'John'
TaskTransHours > EmplId: 'George'
This should leave the said field blank
The idea is to create a field where a case function checks this and returns the correct value. But my problem is whereas there is a way to check for this through SQL.
select not count(*) from your_table
where employee_id = GIVEN_ID
and your_field not in ( select min(your_field)
from your_table
where employee_id = GIVEN_ID);
Note: my first idea was to use LIMIT 1 in the inner query, but MYSQL didn't like it, so min it was - the points to use any, but only one. Min should work, but the field should be indexed, then this query will actually execute rather fast, as only indexes would be used (obviously employee_id should also be indexed).
Note2: Do not get too confused with not in front of count(*), you want 1 when there is none that is different, I count different ones, and then give you the not count(*), which will be one if count is 0, otherwise 0.
Seems a job for a window COUNT():
SELECT
…,
CASE COUNT(DISTINCT TaskTransHours.EmplId) OVER () WHEN 1 THEN 1 END
AS OneEmployee
FROM …