I've seen answers to this question for other databases (MySQL, SQL Server, etc.) but not for PostgreSQL. So, is COUNT(1) or COUNT(*) faster/better for selecting the row count of a table?
Benchmarking the difference
The last time I've benchmarked the difference between COUNT(*) and COUNT(1) for PostgreSQL 11.3, I've found that COUNT(*) was about 10% faster. The explanation by Vik Fearing at the time has been that the constant expression 1 (or at least its nullability) is being evaluated for the entire count loop. I haven't checked whether this has been fixed in PostgreSQL 14.
Don't worry about this in real world queries
However, you shouldn't worry about such a performance difference. The difference of 10% was measurable in a benchmark, but I doubt you can consistently measure such a difference in an ordinary query. Also, ideally, all SQL vendors optimise the two things in the same way, given that 1 is a constant expression, and thus can be eliminated. As mentioned in the above article, I couldn't find any difference in any other RDBMS that I've tested (MySQL, Oracle, SQL Server), and I wouldn't expect there to be any difference.
Related
I'm currently taking an SQL course and trying to understand efficiency of queries.
Given this query, what's the efficiency of it:
SELECT *
FROM Customers
WHERE Age = (SELECT MIN(Age)
FROM Customers)
What i'm trying to understand is if the subquery runs once at the beginning and then the query is O(n+n)?
Or does the subquery run everytime you go through a customer's age which makes it O(n^2)?
Thank you!
If you want to understand how the query optimizer interperets a query you have to review the execution / explain plan which almost every RDBMS makes available.
As noted in the comments you tell the RDBMS what you want, not how to get it.
Very often it helps to have a deeper understanding of the particular database engine being used in order to write a query in the most performant way, ie, to be able to think like the query processor.
Like any language, there's more than one way to skin a cat, so to speak, and with SQL there is usually more than one way to write a query that results in the same output - very often many ways, depending on the complexity.
How a query execution plan gets built and executed is determined by the query optimizer at compile time and depends on many factors, depending on the RDBMS, such as data cardinality, table size, row size, estimated number of rows, sargability, indexes, available resources, current load, concurrency, isolation level - just to name a few.
It often helps to write queries in the most performant way by thinking what you would have to do to accomplish the same task.
In your example, you are looking for all the rows in a table where a particular value equals another value. You have chosen to find that value by first looking for the minimum age - you would only have to do this once as it's a single scalar value, so it's reasonable to assume (but not guaranteed) the database engine would do the same.
You could also approach the problem by aggregating and limiting to the top qualifying row and including ties, if the syntax is supported by the RDBMS, and joining the results.
Ultimately there is no black and white answer.
In the the table T, it is guaranteed that each value of column A is associated with exactly one value of column B (i.e. that there is a functional dependency A → B). Because of this both of the queries below return the same results. Which one will generally run faster?
Using GROUP BY on A and B
select
A
,B
,sum(C)
from
T
group by
A
,B
or using MAX/MIN on B?
select
A
,MAX(B)
,sum(C)
from
T
group by
A
I do know that the GROUP BY A and B version is better at not concealing data issues where an A arrives that is associated with more than one B, I'm just curious about whether one of the queries is generally more work for a DBMS to execute. If the answer depends entirely on the choice of DBMS and you still have interesting information to share then choose your favourite DBMS and answer only for it.
Well I went ahead and ran a test on SQL Server 2016 even though I was interested in uncovering more general, theory-based information. I used four columns in the role of B above to accentuate any differences in run time and submitted a batch containing both types of query above. The execution plans generated by SQL Server were almost identical but the cost reported for the GROUP BY query was 53% of the batch while that of the MAX/MIN query was 47%.
The initial index seek step is identical for both queries. It is followed by hash table building step in which the GROUP BY version incurs a higher cost than the MAX/MIN version. The steps after that have negligible cost for both versions.
Counter-intuitively, in spite of the GROUP BY version having a slightly higher cost, it runs in slightly less time. I guess it's still possible to consume more CPU cycles while running if parallelism is greater. At this point I've reached the end of my ability (and appetite) to scry DBMS execution plans so I'll leave it there.
This question already has answers here:
Closed 11 years ago.
Possible Duplicate:
Count(*) vs Count(1)
I remember anecdotally being told:
never use count(*) when count(1) will do
Recently I passed this advice on to another developer, and was challenged to prove this was true. My argument was what I was told along with when I was given the advice: that the database would only return the first column, which would then be counted. The counterargument was that the database wouldn't evaluate anything in the brackets.
From some (unscientific) testing on small tables, there certainly seems to be no difference. I don't currently have access to any large tables to experiment on.
I was given this advice when I was using Sybase, and tables had hundreds of millions of rows. I'm now working with Oracle and considerably less data.
So I guess in summary, my two questions are:
Which is faster, count(1) or count(*)?
Would this vary in different database vendors?
According to another similar question (Count(*) vs Count(1)), they are the same.
In Oracle, according to Ask Tom, count(*) is the correct way to count the number of rows because the optimizer changes count(1) to count(*). count(1) actually means to count rows with non-null 1's (all of them are non-null so optimizer will change it for you).
See
What is better in MYSQL count(*) or count(1)?
for MYSQL (no difference between count(*) and count(1))
Count(*) vs Count(1)
http://beyondrelational.com/blogs/dave_ballantyne/archive/2010/07/27/count-or-count-1.aspx
for MS Sql Server (no difference)
http://dbaspot.com/sybase/349079-count-vs-count-1-a.html
for Sybase (no difference)
In reading books specifically on TSQL and Microsoft SQL Server, I have read that using * is better because it lets the optimizer decide what is best to do. I'll try to find the names of the specific books and post those here.
This is such a basic query pattern, and the meaning is identical. I've read more than once that the optimizer treats them identically - can't find a specific reference right now but put this in the category of "institutional knowledge".
(should have searched first...http://stackoverflow.com/questions/1221559/count-vs-count1)
I can only speak to SQL Server, but testing on a 5 GB table, 11 mm records - both the number of reads and execution plan were identical.
I'd say there is no difference.
As far as I know using count() should be faster because when that function is called the engine counts only indexes. From another point of view probably both count() and count(1) in binary code look very similar, so there should be no difference.
count(1)
No, generally speaking this will always have slightly better performance.
It would only affect if upscaled to a drastic amount but it is good practice.
I want to know that my sql execute count queries in linear time or in log(n) time i think that if query parameter was indexed it can do it by cubing
MyISAM will return immediatelly.
InnoDB will do PK scan, so time will lineary increase with number of records.
If you need to see approximately how many records InnoDB table holds, the fastest way is using
EXPLAIN select * from student;
(but innodb statistics may be wrong, so 40% error is quite possible also)
It all depends on the query, or more precisely, on the query plan MySql eventually select to process the query.
Also it all depend what we mean by 'n', in these big O expression. For example if 'n' is the count value eventually returned, and if that counts is that produced by a query which requires iteratively scanning multiple tables, the complexity could be worse than linear.
The answer to this is complicated. Not only does it depend on the number of tables involved, but it can also depend on what storage engine you're using.
Having said that, this is what the manual says:
COUNT(*) is optimized to return very
quickly if the SELECT retrieves from
one table, no other columns are
retrieved, and there is no WHERE
clause. For example:
mysql> SELECT COUNT(*) FROM student;
This optimization applies only to
MyISAM tables only, because an exact
row count is stored for this storage
engine and can be accessed very
quickly. For transactional storage
engines such as InnoDB, storing an
exact row count is more problematic
because multiple transactions may be
occurring, each of which may affect
the count.
-- MySQL Manual
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
GROUP BY NR_DZIALU
HAVING NR_DZIALU = 30
or
SELECT NR_DZIALU, COUNT (NR_DZIALU) AS LICZ_PRAC_DZIALU
FROM PRACOWNICY
WHERE NR_DZIALU = 30
GROUP BY NR_DZIALU
The theory (by theory I mean SQL Standard) says that WHERE restricts the result set before returning rows and HAVING restricts the result set after bringing all the rows. So WHERE is faster. On SQL Standard compliant DBMSs in this regard, only use HAVING where you cannot put the condition on a WHERE (like computed columns in some RDBMSs.)
You can just see the execution plan for both and check for yourself, nothing will beat that (measurement for your specific query in your specific environment with your data.)
It might depend on the engine. MySQL for example, applies HAVING almost last in the chain, meaning there is almost no room for optimization. From the manual:
The HAVING clause is applied nearly last, just before items are sent to the client, with no optimization. (LIMIT is applied after HAVING.)
I believe this behavior is the same in most SQL database engines, but I can't guarantee it.
The two queries are equivalent and your DBMS query optimizer should recognise this and produce the same query plan. It may not, but the situation is fairly simple to recognise, so I'd expect any modern system - even Sybase - to deal with it.
HAVING clauses should be used to apply conditions on group functions, otherwise they can be moved into the WHERE condition. For example. if you wanted to restrict your query to groups that have COUNT(DZIALU) > 10, say, you would need to put the condition into a HAVING because it acts on the groups, not the individual rows.
I'd expect the WHERE clause would be faster, but it's possible they'd optimize to exactly the same.
Saying they would optimize is not really taking control and telling the computer what to do. I would agree that the use of having is not an alternative to a where clause. Having has a special usage of being applied to a group by where something like a sum() was used and you want to limit the result set to show only groups having a sum() > than 100 per se. Having works on groups, Where works on rows. They are apples and oranges. So really, they should not be compared as they are two very different animals.
"WHERE" is faster than "HAVING"!
The more complex grouping of the query is - the slower "HAVING" will perform to compare because: "HAVING" "filter" will deal with larger amount of results and its also being additional "filter" loop
"HAVING" will also use more memory (RAM)
Altho when working with small data - the difference is minor and can absolutely be ignored
"Having" is slower if we compare with large amount of data because it works on group of records and "WHERE" works on number of rows..
"Where" restricts results before bringing all rows and 'Having" restricts results after bringing all the rows
Both the statements will be having same performance as SQL Server is smart enough to parse both the same statements into a similar plan.
So, it does not matter if you use WHERE or HAVING in your query.
But, ideally you should use WHERE clause syntactically.