Group by and having trouble understanding - sql

I was looking at some SQL query that I have in Access database that I did not make.
One of the SQL query goes something like this:
select column1 from table1 group by column1 having count(*)>1
The purpose of this query is to find the value in column1 that appears more than once. I can verify that this query works correctly and returns the column value that appears more than once.
I however do not understand why this query works. As per my understanding using group by will remove duplicate fields. For instance if column1 had
column1
apple
mango
mango
Doing group by (column1) will result
column1
apple
mango
At this point, if we perform having count(*)>1 or having count(column1)>1, it should return no result because group by has already removed the duplicate field. But clearly, I am wrong as the above SQL statement does give the accurate result.
Would you please let me know the problem in my understanding?
Edit 1:
Besides the accepted answer, I this article which deals with order of SQL operation really helped my understanding

You are misunderstanding how HAVING works. In fact, you can think of it by using subqueries. Your query is equivalent to:
select column1
from (select column1, count(*) as cnt
from table1
group by column1
) as t
having cnt > 1;
That is, having filters an aggregation query after the aggregation has taken place. However, the aggregation functions are applied per group. So count(*) is counting the number of rows in each group. That is why it is identifying duplicates.

group by doesn't just remove duplicate values - it returns one row per distinct value of the group by clause, and allows you apply aggregate function per such unique value.
In this query, you actually query the values of column1 and the result of count(*) per value of column1, then, you use the having clause to return only the values of column1 that have a count(*) greater than 1.

GROUP BY clause groups the selection per the fields you mention, on this case column1 but can be a combined column (e.g. column1, column2).
By the way, I think if you run:
SELECT column1, Count(*) AS [Count], MIN(column2) AS MinColumn2, MAX(column2) AS MaxColumn2
FROM table1
GROUP BY column1;
Will help you to understand how GROUPING works. When filter by any column directly you may use the WHERE condition but if you want to filter per any field calculated from the grouping you need to use the HAVING clause.

Related

Oracle query mistake

I need to know where the mistake is in this oracle query?
SELECT(KEY1),COUNT(*) FROM TABLE1 GROUP BY AGE
SELECT KEY1,COUNT(*) FROM TABLE1 GROUP BY KEY1
There are two problems. First one: You cannot close the parenthesis after the first keyword. Second: You have to group by all keys that are in the query that are not all row dependend. In that case "KEY1". If you want to order by age you have to query age as parameter.
SELECT AGE,COUNT(*) FROM TABLE1 GROUP BY AGE
Your table naming is not very good. I assume you should have a look at group by tutorials like https://www.w3schools.com/sql/sql_groupby.asp or the sql tutorial https://www.w3schools.com/sql/
Your query had an issue. You have to modify your query as below
SELECT KEY1,COUNT(*) FROM TABLE1 GROUP BY KEY1.
Observation:
All the columns that are added in the select statement alongside the aggregate functions, should be included the group by columns.
Your first column does have the bracket in it which should be removed.

SQL: Mapping first (or random) result of element from specific column matching certain conditions in another column to variable

The problem is as follows: I want to query a database, and return the number of hits that have particular values in two different columns, grouped by timestamp window. I am currently doing this successfully with the following query:
SELECT shortdate,
sum(column1 like '%interestingthing1%') thing1count,
sum(column1 like '%interestingthing2%') thing2count,
FROM (
select LEFT(string(DATE), 8) shortdate,column1
from [database]
where column3 like '%thingthatcolumn3shouldbe%'
)
group by shortdate
ORDER BY 1 DESC
LIMIT 1000
What I want to also do is to also return a variable which is a random element of column2, corresponding to a row that satisfies the thingthatcolumn3shouldbe, one for each interestingthingn (n=1,2,...).
I have some intuition that this can be done either with some user-defined function (I am doing this in google bigquery, which allows javascript functions), or a whole mess of UNIONs within the FROM (...) statement above. But since I am far from a SQL expert, this question is partially an effort to start using some best/better practices early on. Naked opinions welcome.
Thanks in advance,
Samuel
Perhaps something like the query below. It will pick the "first" value for column2 when column1 satisfies your condition, but since order of values going into aggregation function is not deterministic - it is as good as random. And since FIRST aggregate function ignores NULLs, we just need to make sure to convert column2 to NULL when condition isn't met.
SELECT shortdate,
sum(column1 like '%interestingthing1%') thing1count,
sum(column1 like '%interestingthing2%') thing2count,
first(if(column1 like '%interestingthing1%', column2, NULL)) column2forthing1,
first(if(column1 like '%interestingthing2%', column2, NULL)) column2forthing2
FROM (
select LEFT(string(DATE), 8) shortdate, column1, column2
from [database]
where column3 like '%thingthatcolumn3shouldbe%'
)
group by shortdate
ORDER BY 1 DESC
LIMIT 1000

Get ANY(col) instead of MIN(col) from a group

I have a SQL Query (simplified from real use):
SELECT MIN(cola), colb FROM tbl GROUP BY colb;
But actually, I don't need the minimum value- any cola value will do- it's only used to show an example value from the group.
At the moment PG has to do the group and then sort each group by cola to find the minimum value in the group, but this is slow because there's a lot of records in each group.
Does Postgres have some kind of FIRST(cola) or ANY(cola) that would just return whatever cola it sees first (like MySQL does when you don't use an aggregate function) or without needing to sort / read cola from every row?
I think using DISTINCT ON() with no order by will achieve what you are after:
SELECT DISTINCT ON (ColB) ColA, ColB
FROM tbl;
Example on SQL Fiddle
The docs state
DISTINCT ON ( expression [, ...] ) keeps only the first row of each set of rows where the given expressions evaluate to equal. The DISTINCT ON expressions are interpreted using the same rules as for ORDER BY (see above). Note that the "first row" of each set is unpredictable unless ORDER BY is used to ensure that the desired row appears first.
However, with no example data to work on I can't actually compare if this will outperform using MIN or any other aggregate function.
This statement:
At the moment PG has to do the group and then sort each group by cola
to find the minimum value in the group, but this is slow because
there's a lot of records in each group.
May logically describe what Postgres does, but it does not explain what is actually going on.
Postgres -- as with any database that I'm familiar with -- will keep a "register" for the minimum value. As new data comes in, it will compare the value in the next row to the minimum. If the new value is smaller, then it will be copied in. This, incidentally, is whay min(), max(), avg(), and count() are all faster than count(distinct). For the latter, the list of values within a group must be maintained.
The distinct on approach may be faster than the group by. The reason, however, is not because the database engine is sorting all values for a given colb to get the minimum.
Try using fetch first row at the end of your sql:
http://www.postgresql.org/docs/8.1/static/sql-fetch.html
SELECT MIN(cola), colb
FROM tbl
GROUP BY colb
FETCH FIRST ROW only;
Inspired by Gareth's answer above:
SQL Fiddle
; WITH C as (SELECT *, ROW_NUMBER() OVER (PARTITION BY ColB) as rn FROM tbl)
SELECT *
FROM c
WHERE rn = 1
Not sure if it will perform any better\worse than MIN().

"group by" needed in count(*) SQL statement?

The following statement works in my database:
select column_a, count(*) from my_schema.my_table group by 1;
but this one doesn't:
select column_a, count(*) from my_schema.my_table;
I get the error:
ERROR: column "my_table.column_a" must appear in the GROUP BY clause
or be used in an aggregate function
Helpful note: This thread: What does SQL clause "GROUP BY 1" mean? discusses the meaning of "group by 1".
Update:
The reason why I am confused is because I have often seen count(*) as follows:
select count(*) from my_schema.my_table
where there is no group by statement. Is COUNT always required to be followed by group by? Is the group by statement implicit in this case?
This error makes perfect sense. COUNT is an "aggregate" function. So you need to tell it which field to aggregate by, which is done with the GROUP BY clause.
The one which probably makes most sense in your case would be:
SELECT column_a, COUNT(*) FROM my_schema.my_table GROUP BY column_a;
If you only use the COUNT(*) clause, you are asking to return the complete number of rows, instead of aggregating by another condition. Your questing if GROUP BY is implicit in that case, could be answered with: "sort of": If you don't specify anything is a bit like asking: "group by nothing", which means you will get one huge aggregate, which is the whole table.
As an example, executing:
SELECT COUNT(*) FROM table;
will show you the number of rows in that table, whereas:
SELECT col_a, COUNT(*) FROM table GROUP BY col_a;
will show you the the number of rows per value of col_a. Something like:
col_a | COUNT(*)
---------+----------------
value1 | 100
value2 | 10
value3 | 123
You also should take into account that the * means to count everything. Including NULLs! If you want to count a specific condition, you should use COUNT(expression)! See the docs about aggragate functions for more details on this topic.
If you don't use the Group by clause at all then all that will be returned is a count of 1 for each row, which is already assumed anyway and therefore redundant data. By adding GROUP BY 1 you have categorized the information thereby making it non-redundant even though it returns the same result in theory as the statement that creates an error.
When you have a function like count, sum etc. you need to group the other columns. This would be equivalent to your query:
select column_a, count(*) from my_schema.my_table group by column_a;
When you use count(*) with no other column, you are counting all rows from SELECT * from the table. When you use count(*) alongside another column, you are counting the number of rows for each different value of that other column. So in this case you need to group the results, in order to show each value and its count only once.
group by 1 in this case refers to column_a which has the column position 1 in your query.
This why it works on your server. Indeed this is not a good practice in sql.
You should mention the column name because the column order may change in the table so it will be hard to maintain this code.
The best solution is:
select column_a, count(*) from my_schema.my_table group by column_a;

Not getting the correct count in SQL

I am totally new to SQL. I have a simple select query similar to this:
SELECT COUNT(col1) FROM table1
There are some 120 records in the table and shown on the GUI.
For some reason, this query always returns a number which is less than the actual count.
Can somebody please help me?
Try
select count(*) from table1
Edit: To explain further, count(*) gives you the rowcount for a table, including duplicates and nulls. count(isnull(col1,0)) will do the same thing, but slightly slower, since isnull must be evaluated for each row.
You might have some null values in col1 column. Aggregate functions ignore nulls.
try this
SELECT COUNT(ISNULL(col1,0)) FROM table1
Slightly tangential, but there's also the useful
SELECT count(distinct cola) from table1
which gives you number of distinct column in the table.
You are getting the correct count
As per https://learn.microsoft.com
COUNT(*) returns the number of items in a group. This includes NULL values and duplicates.
COUNT(ALL expression) evaluates an expression for each row in a group and returns the number of nonnull values.
COUNT(DISTINCT expression) evaluates an expression for each row in a group and returns the number of unique, non null values.
In your case you have passed the column name in COUNT that's why you will get count of not null records, now you're in your table data you may have null values in given column(col1)
Hope this helps!