How DISTINCT statement operates in this query? - sql

In a query,
SELECT modal_text,
COUNT(DISTINCT CASE
WHEN ab_group = ‘control’ THEN user_id
END) AS ‘control_clicks’
FROM onboarding_modals
GROUP BY 1
ORDER BY 1;
I can’t understand why, in line 2, DISTINCT is used. I thought that query will operate well without DISTINCT and it does. How this statement operates in this query?

I think it could serve a purpose, but it could work without given that your data is of a certain form.
Let's go over it step by step:
CASE WHEN ab_group = ‘control’ THEN user_id END)
This returns a list of user_id's from records where ab_group equals 'control'
DISTINCT This filters the list on duplicates If there are no user_id's in multiple records. Or there are, but there is no case in which both of them have ab_group 'control', this is superfluous.
COUNT This counts the number of items in the list passed to it Preceding it by distinct will have it output the number of unique items in your list.
Hence: If on boarding_modals contains only unique user_id's or user_id's can be listed more than once but never in combination with ab_group = 'control'. DISTINCT can safely be removed while maintaining the same outcome.

Related

Count Distinct in MS Access - How to count number of accounts with at least one activity of a certain type?

I have a list of activities where one field is AccountName indicating the name of the person. There is another field called ActivityType.
Basically I want to return the number of accounts that have at least one ActivityType that is equal to RET (indicating a returned item). Since a person may have multiple returns in a given year I don't want to count the number of returns total, just the number of accounts that have at least one RET.
I've tried various combinations of select statements, count statements, having statements, it's just not working right.
Here is what I tried:
Select DISTINCT Count(Activities.AccountName) AS CountOfAccountName
FROM Activities
GROUP BY Activities.ActivityType
HAVING (Activities.ActivityType = "RET")
But this seems to return a much larger number than if I just do the select statement:
SELECT DISTINCT Activities.AccountName
FROM Activities
WHERE (Activities.ActivityType = "RET")
I think you need a simple COUNT and GROUP BY function. Since Access doesn't allow Count Distinct, You need to first select distinct records and then count them -
SELECT Count(T.AccountName) AS CountOfAccountName
FROM (SELECT AccountName
FROM Activities
WHERE Activities.ActivityType = 'RET'
GROUP BY AccountName) T;
This will filter only those records having ActivityType = 'RET' and then will count the distinct AccountName.
Use a subquery:
SELECT Count(*) AS ActiveAccounts
FROM
(SELECT DISTINCT AccountName
FROM Activities
WHERE ActivityType = "RET")
If activity/types cannot be duplicated, then you can simply use:
select count(*) AS CountOfAccountName
from Activities
where Activities.ActivityType = "RET";
If they can be, then you need a subquery as shown in other answers.

Return All Historical Account Records for Accounts with Change in Corresponding Value

I am trying to select all records in a time-variant Account table for each account with a change in an associated value (e.g. the maturity date). A change in the value will result in the most recent record for an account being end-dated and a new record (containing a new effective date of the following day) being created. The most recent records for accounts in this table have an end-date of 12/31/9000.
For instance, in the below illustration, account 44444444 would not be included in my query result set since it hasn't had a change in the value (and thus also has no additional records aside from the original); however, the other accounts have multiple changes in values (and multiple records), so I would want to see those returned.
I've tried using the row_num function, as well as a reflexive join, but for some reason I'm not getting the expected results. What are some ways to obtain the results I need?
Note: The primary key for this table includes the acct_id and eff_dt. Also, I'm using PostgreSQL in a Greenplum environment.
Here are two types of queries I tried to use but which produced problematic results:
Query 1
Query 2
If you want only the accounts, use aggregation:
select acct_id
from t
group by acct_id
having min(value) <> max(value);
Based on your description, you could also use count(*) >.
If you want the original records, you can use window functions:
select t.*
from (select t.*, count(*) over (partition by acct_id) as cnt
from t
) t
where cnt > 1;

Is GROUP BY needed in the following correlated subquery?

Given scenario:
table fd
(cust_id, fd_id) primary-key and amount
table loan
(cust_id, l_id) primary-key and amount
I want to list all customers who have a fixed deposit with an amount less than the sum of all their loans.
Query:
SELECT cust_id
FROM fd
WHERE amount
<
(SELECT sum(amount)
FROM loan
WHERE fd.cust_id = loan.cust_id);
OR should we use
SELECT cust_id
FROM fd
WHERE amount
<
(SELECT sum(amount)
FROM loan
WHERE fd.cust_id = loan.cust_id group by cust_id);
A customer can have multiple loans but one FD is considered at a time.
GROUP BY can be omitted in this case, because there is only (one) aggregate function(s) in the SELECT list and all rows are guaranteed to belong to the same group of cust_id ( by the WHERE clause).
The aggregation will be over all rows with matching cust_id in both cases. So both queries are correct.
This would be a cleaner another way to implement the same thing:
SELECT fd.cust_id
FROM fd
JOIN loan USING (cust_id)
GROUP BY fd.cust_id, fd.amount
HAVING fd.amount < sum(loan.amount)
There is one difference: rows with identical (cust_id, amount) in fd only appear once in the result of my query, while they would appear multiple times in the original.
Either way, if there is no matching row with a non-null amount in table loan, you get no rows at all. I assume you are aware of that.
There are no need for GROUP BY since you filtered data by cust_id. In any case inner query will return the same result.
No, it isn't, because you calculate sum(amount) for customer with id = fd.cust_id, so for a single customer.
However, if somehow your subquery calculate sum for more than one customer, the group by would cause the subquery to generate more than one row and this will cause the condition(<) to fail, and thus, the query to fail.
A query with an aggregate like sum but without a group by will output one group. The aggregates will be computed over all matching rows.
A subquery in a condition clause is only allowed to return one row. If the subquery returned multiple rows, what would the following expression mean?
where 1 > (... subquery ...)
So the group by must be omitted; you would even get an error for your second query.
N.B. When you specify all, any, or some a subquery can return multiple rows:
where 1 > ALL (... subquery ...)
But it's easy to see why that doesn't make sense in your case; you'd compare one customer's data to that of another.

How to produce a distinct count of records that are stored by day by month

I have a table with several "ticket" records in it. Each ticket is stored by day (i.e. 2011-07-30 00:00:00.000) I would like to count the unique records in each month by year I have used the following sql statement
SELECT DISTINCT
YEAR(TICKETDATE) as TICKETYEAR,
MONTH(TICKETDATE) AS TICKETMONTH,
COUNT(DISTINCT TICKETID) AS DAILYTICKETCOUNT
FROM
NAT_JOBLINE
GROUP BY
YEAR(TICKETDATE),
MONTH(TICKETDATE)
ORDER BY
YEAR(TICKETDATE),
MONTH(TICKETDATE)
This does produce a count but it is wrong as it picks up the unique tickets for every day. I just want a unique count by month.
Try combining Year and Month into one field, and grouping on that new field.
You may have to cast them to varchar to ensure that they don't simply get added together. Or.. you could multiple through the year...
SELECT
(YEAR(TICKETDATE) * 100) + MONTH(TICKETDATE),
count(*) AS DAILYTICKETCOUNT
FROM NAT_JOBLINE GROUP BY
(YEAR(TICKETDATE) * 100) + MONTH(TICKETDATE)
Presuming that TICKETID is not a primary or unique key, but does appear multiple times in table NAT_JOBLINE, that query should work. If it is unique (does not occur in more than 1 row per value), you will need to select on a different column, one that uniquely identifies the "entity" that you want to count, if not each occurance/instance/reference of that entity.
(As ever, it is hard to tell without working with the actual data.)
I think you need to remove the first distinct. You already have the group by. If I was the first Distict I would be confused as to what I was supposed to do.
SELECT
YEAR(TICKETDATE) as TICKETYEAR,
MONTH(TICKETDATE) AS TICKETMONTH,
COUNT(DISTINCT TICKETID) AS DAILYTICKETCOUNT
FROM NAT_JOBLINE
GROUP BY YEAR(TICKETDATE), MONTH(TICKETDATE)
ORDER BY YEAR(TICKETDATE), MONTH(TICKETDATE)
From what I understand from your comments to Phillip Kelley's solution:
SELECT TICKETDATE, COUNT(*) AS DAILYTICKETCOUNT
FROM NAT_JOBLINE
GROUP BY TICKETDATE
should do the trick, but I suggest you update your question.

Different faces of COUNT

I would like to know the difference between the following 4 simple queries in terms of result and functionality:
SELECT COUNT(*) FROM employees;
SELECT COUNT(0) FROM employees;
SELECT COUNT(1) FROM employees;
SELECT COUNT(2) FROM employees;
The four examples all evaluate to the same number - there is no difference.
What might give a different answer would be:
SELECT COUNT(middle_initial) FROM employees;
If there are any entries with a NULL in the middle_initial column, then the count returned will be different from COUNT(*) because it will be just the number of non-null values in the column.
No difference in terms of result, they all return the number of rows in employees.
COUNT(expression) simply means "for each row in this table, if expression evaluates to a non-null value, count this row".
But, * means count anything, while n is a constant numeric value and is therefore never null. Hence, both don't take into account the actual row data and thus return the total number of rows in a table.
SELECT COUNT(x) FROM employees will give you the number of rows where x is not null.
Count(*) :
Specifies that all rows should be counted to return the total number of rows in a table. COUNT(*) takes no parameters and cannot be used with DISTINCT. COUNT(*) does not require an expression parameter because, by definition, it does not use information about any particular column. COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. It counts each row separately. This includes rows that contain null values.
Hence, COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
To sum. Count(*) will return all rows in your query which match your where clause.
So if you go
SELECT COUNT(*) FROM EMPLOYEES
the row count will be returned the same as if you went
SELECT * FROM employees
All rows will be returned from the table.
Count 1,2,3 and 4
COUNT(*) counts the number of rows produced by the query, whereas COUNT(1) counts the number of 1 values. Note that when you include a literal such as a number or a string in a query, this literal is "appended" or attached to every row that is produced by the FROM clause. This also applies to literals in aggregate functions, such as COUNT(1). The same can be said for Count(2) , Count(3) and Count(4). It will evaluate the expression based on the number of Count(variable) values and return non-null results.
So if you go
SELECT COUNT(1) from emplyees
it will return the same row count as if you went
SELECT first_name from employees
(Where first_name is column no. 1 in the table)
However the advantage here is you can go
SELECT COUNT(Distinct 1) from employees
and then it would return the count of unique records for that column in the table.