How can BigQuery SQL give different DISTINCT and GROUP BY results? - sql

We seem to be getting two different, mutually incompatible results from legacy SQL and standard SQL in Google Big Query.
Here is our standard SQL Query...which gives an answer with 218,529 rows.
SELECT DISTINCT(EID)
FROM test.ourBQtable
Here is our legacy SQL Query...
SELECT COUNT(EID) AS Total, EID
FROM [ourBQproject:test.ourBQtable]
GROUP BY EID
ORDER BY Total DESC
This shows results that look like the table below but yet also shows 218,529 rows of results:
Total EID
376 jb+qLvHMm5JrMkNybAi6uC75FzgsGcNQhJ19IeWFDcQ=
352 JGqNBgicm+mpcYBS4K7AI2WXI3xaSgMkktb+7oOjjnQ=
How is it possible to have what appears to be duplicate EIDs (376 of them as shown in one case in the table) - but when using the DISTINCT(EID) command - the number of rows doesn't decrease? Shouldn't DISTINCT be filtering out all the duplicate rows? Do we really have duplicate rows?
What are we missing in our understanding?

Your code appears to be working exactly correctly.
DISTINCT EID is saying that there are 218,529 different values of EID. This should be returning one row for each of the 218,529 different EIDs.
When you use GROUP BY, you are getting one row for each of the EIDs. In this case, you get the same number.
Try running this query:
SELECT COUNT(*) as num_rows, COUNT(DISTINCT EID) as num_eids
FROM test.ourBQtable;
This will show the number of rows in the table and the number of distinct values of EID (ignoring NULL values)`.

Below two query are equivalent and return same number or rows - one per each unique EID
SELECT DISTINCT EID
FROM test.ourBQtable
and
SELECT EID
FROM test.ourBQtable
GROUP BY EID
That explains why number of output rows are the same
Now, in second query you added COUNT(EID)
SELECT COUNT(EID) AS Total, EID
FROM test.ourBQtable
GROUP BY EID
this does not change the number of output rows, but rather adds count of rows in test.ourBQtable with respective EID (if you sum all these counts - you will get total rows in the original table)

Related

Why does MAX statement require a Group By?

I understand why the first query needs a GROUP BY, as it doesn't know which date to apply the sum to, but I don't understand why this is the case with the second query. The value that ultimately is the max amount is already contained in the table - it is not calculated like SUM is. thank you
-- First Query
select
sum(OrderSales),OrderDates
From Orders
-- Second Query
select
max(FilmOscarWins),FilmName
From tblFilm
It is not the SUM and MAX that require the GROUP BY, it is the unaggregated column.
If you just write this, you will get a single row, for the maximum value of the FilmOscarWins column across the whole table:
select
max(FilmOscarWins)
From
tblFilm
If the most Oscars any film won was 12, that one row will say 12. But there could be multiple films, all of which won 12 Oscars, so if we ask for the FilmName alongside that 12, there is no single answer.
By adding the Group By, we fundamentally change the query: instead of returning one number for the whole table, it will return one row for each group - which in this case, means one row for each film.
If you do want to get a list of all those films which had the maximum 12 Oscars, you have to do something more complicated, such as using a sub-query to first find that single number (12) and then find all the rows matching it:
select
FilmOscarWins,
FilmName
From
tblFilm
Where FilmOscarWins = (
select
max(FilmOscarWins)
From
tblFilm
)
If you want the film with the most Oscar wins, then use select top:
select top (1) f.*
From tblFilm f
order by FilmOscarWins desc;
In an aggregation query, the select columns need to be consistent with the group by columns -- the unaggregated columns in the select must match the group by.

Validate that only one value exists

I have a table with two relevant columns. I'll call them EID and MID. They are not unique.
In theory, if the data is set up correctly, there will be many records for each EID and every one of those records should have the same MID.
There are situations where someone may manually update data incorrectly and I need to be able to quickly identify if there is a second MID for any EID.
Ideally, I'd have a query that returns how many MIDs for each EID, but only showing results where there is more than 1 MID. Below is what I'd like the results to look like.
EID Count of Distinct MID values
200345 2
304334 3
I've tried several different forms of queries, but I can't seem to figure out how to reach this result. We're on SQL Server.
You can use the following using COUNT with DISTINCT and HAVING:
SELECT EID, COUNT(DISTINCT MID)
FROM table_name
GROUP BY EID
HAVING COUNT(DISTINCT MID) > 1
demo on dbfiddle.uk

MS Access Count unique values of one table appearing in second table which is related to a third table

I am working with my lab database and close to complete it. But i am stuck in a query and a few similar queries which all give back the similar results.
Here is the Query in design mode
and this is what it gives out
This query is counting the number of ID values in table PatientTestIDs whereas I want to count the number of unique PatientID values grouped by each department
I have even tried Unique Values and Unique Records properties but all the times it gives the same result.
What you want requires two queries.
Query1:
SELECT DISTINCT PatientID, DepartmentID FROM PatientTestIDs;
Query2:
SELECT Count(*) AS PatientsPerDept, DepartmentID FROM Query1 GROUP BY DepartmentID;
Nested all in one:
SELECT Count(*) AS PatientsPerDept, DepartmentID FROM (SELECT DISTINCT PatientID, DepartmentID FROM PatientTestIDs) AS Query1 GROUP BY DepartmentID;
You can include the Departments table in query 2 (or the nested version) to pull in descriptive fields but will have to include those additional fields in the GROUP BY.

How to insert a count column into a sql query

I need the second column of the table retrieved from a query to have a count of the number of rows, so row one would have a 1, row 2 would have a 2 and so on. I am not very proficient with sql so I am sorry if this is a simple task.
A basic example of what I am doing would be is:
SELECT [Name], [I_NEED_ROW_COUNT_HERE],[Age],[Gender]
FROM [customer]
The row count must be the second column and will act as an ID for each row. It must be the second row as the text file it is generating will be sent to the state and they require a specific format.
Thanks for any help.
With your edit, I see that you want a row ID (normally called row number rather than "count") which is best gathered from a unique ID in the database (person_id or some other unique field). If that isn't possible, you can make one for this report with ROW_NUMBER() OVER (ORDER BY EMPLOYEE_ID DESC) AS ID, in your select statement.
select Name, ROW_NUMBER() OVER (ORDER BY Name DESC) AS ID,
Age, Gender
from customer
This function adds a field to the output called ID (see my tips at the bottom to describe aliases). Since this isn't in the database, it needs a method to determine how it will increment. After the over keyword it orders by Name in descending order.
Information on Counting follows (won't be unique by row):
If each customer has multiple entries but the selected fields are the same for that user and you are counting that user's records (summed in one result record for the user) then you would write:
select Name, count(*), Age, Gender
from customer
group by name, age, gender
This will count (see MSDN) all the user's records as grouped by the name, age and gender (if they match, it's a single record).
However, if you are counting all records so that your whole report has the grand total on every line, then you want:
select Name, (select count(*) from customer) as "count", Age, Gender
from customer
TIP: If you're using something like SSMS to write a query, dragging in columns will put brackets around the columns. This is only necessary if you have spaces in column names, but a DBA will tend to avoid that like the plague. Also, if you need a column header to be something specific, you can use the as keyword like in my first example.
W3Schools has a good tutorial on count()
The COUNT(column_name) function returns
the number of values (NULL values will not be counted) of the
specified column:
SELECT COUNT(column_name) FROM table_name;
The COUNT(*) function returns the number of records in a table:
SELECT COUNT(*) FROM table_name;
The COUNT(DISTINCT column_name) function returns the number of
distinct values of the specified column:
SELECT COUNT(DISTINCT column_name) FROM table_name;
COUNT(DISTINCT) works with ORACLE and Microsoft SQL Server, but
not with Microsoft Access.
It's odd to repeat the same number in every row but it sounds like this is what you're asking for. And note that this might not work in your flavor of SQL. MS Access?
SELECT [Name], (select count(*) from [customer]), [Age], [Gender]
FROM [customer]

Different faces of COUNT

I would like to know the difference between the following 4 simple queries in terms of result and functionality:
SELECT COUNT(*) FROM employees;
SELECT COUNT(0) FROM employees;
SELECT COUNT(1) FROM employees;
SELECT COUNT(2) FROM employees;
The four examples all evaluate to the same number - there is no difference.
What might give a different answer would be:
SELECT COUNT(middle_initial) FROM employees;
If there are any entries with a NULL in the middle_initial column, then the count returned will be different from COUNT(*) because it will be just the number of non-null values in the column.
No difference in terms of result, they all return the number of rows in employees.
COUNT(expression) simply means "for each row in this table, if expression evaluates to a non-null value, count this row".
But, * means count anything, while n is a constant numeric value and is therefore never null. Hence, both don't take into account the actual row data and thus return the total number of rows in a table.
SELECT COUNT(x) FROM employees will give you the number of rows where x is not null.
Count(*) :
Specifies that all rows should be counted to return the total number of rows in a table. COUNT(*) takes no parameters and cannot be used with DISTINCT. COUNT(*) does not require an expression parameter because, by definition, it does not use information about any particular column. COUNT(*) returns the number of rows in a specified table without getting rid of duplicates. It counts each row separately. This includes rows that contain null values.
Hence, COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
To sum. Count(*) will return all rows in your query which match your where clause.
So if you go
SELECT COUNT(*) FROM EMPLOYEES
the row count will be returned the same as if you went
SELECT * FROM employees
All rows will be returned from the table.
Count 1,2,3 and 4
COUNT(*) counts the number of rows produced by the query, whereas COUNT(1) counts the number of 1 values. Note that when you include a literal such as a number or a string in a query, this literal is "appended" or attached to every row that is produced by the FROM clause. This also applies to literals in aggregate functions, such as COUNT(1). The same can be said for Count(2) , Count(3) and Count(4). It will evaluate the expression based on the number of Count(variable) values and return non-null results.
So if you go
SELECT COUNT(1) from emplyees
it will return the same row count as if you went
SELECT first_name from employees
(Where first_name is column no. 1 in the table)
However the advantage here is you can go
SELECT COUNT(Distinct 1) from employees
and then it would return the count of unique records for that column in the table.