Where should I do the rowcount when checking for existence: sql or php? - sql

In the case when I want to check, if a certain entry in the database exists I have two options.
I can create an sql query using COUNT() and then check, if the result is >0...
...or I can just retrieve the record(s) and then count the number of rows in the returned rowset. For example with $result->num_rows;
What's better/faster? in mysql? in general?

YMMV, but I suspect that if you are only checking for existence, and don't need to use the retrieved data in any way, the COUNT() query will be faster. How much faster will depend on how much data.

The fastest is probably asking the database if something exists:
SELECT EXISTS ([your query here])

SELECT 1
FROM (SELECT 1) t
WHERE EXISTS( SELECT * FROM foo WHERE id = 42 )
Just tested, works fine on MySQL v5
COUNT(*) is generally less efficient if:
you can have duplicates (because the
DBMS will have to exhaustively
search all of the records/indexes to
give you the exact answer) or
have NULL entries (for the same
reason)
If you are COUNT'ing based on a WHERE clause that is guaranteed to produce a single record (or 0) and the DBMS knows this (based upon UNIQUE indexes), then it ought to be just as efficient. But, it is unlikely that you will always have this condition. Also, the DBMS may not always pick up on this depending on the version and DBMS.
Counting in the application (when you don't need the row) is almost always guaranteed to be slower/worse because:
You have to send data to the client, the client has to buffer it and do some work
You may bump out things in the DBMS MRU/LRU data cache that are more important
Your DBMS will (generally) have to do more disk I/O to fetch record data that you will never use
You have more network activity
Of course, if you want to DO something with the row if it exists, then it is definitely faster/best to simply try and fetch the row to begin with!

If all you are doing is checking for the existance, then
Select count(*) ...
But if you will retrieve the data if it exists, then just get the data and check it in PHP, otherwise you'll have two calls.

For me is in the database.
Making a count(1) is faster than $result->num_rows because in the $result->num_rows you make 2 operations 1 select and a count if the select has a count is faster to get the result.
Except if you also want the information from the db.

If you want raw speed, benchmark! In addition to the methods others have suggested:
SELECT 1 FROM table_name WHERE ... LIMIT 1
may be faster due to avoiding the subselect. Benchmark it.

SELECT COUNT(*) FROM table
is the best choice, this operation is extremely fast both on small tables and large tables. While it's possible that
SELECT id FROM table
is faster on small tables, the difference in speed will be microscopic. But if you have a large table, this operation can be very slow.
Therefore, your best bet is to always choose to COUNT(*) the table (and it's faster to do * than it is to pick a specific column) as overall, it will be the fastest operation.

I would definitely do it in the PHP to decrease load on the database.
In order to get a count and get the returned rows in SQL you would have to do two queries.. a COUNT and then a SELECT
The PHP way gives you everything you need in one result object.

Related

When would two consecutive SELECT queries run in the same session produce different results?

I got an assignment to find a situation when two consecutive select queries produce different results.
My idea is that if we first run the first query, then we modify some records in a parallel session, and then we run the second query, the results will obviously be different.
I'm curious if there are other situations besides the one mentioned above.
The scenario you described would definitely work, although note you'd have to commit those changes (not sure id it was implied or not, but it's probably a good idea to be explicit).
Another idea that may or may not be a valid solution here is to play around with the ordering. E.g., consider a query like SELECT num_col FROM my_table. Since there is no order by clause, the database is free to return the rows in any way it chooses. Creating an index on num_col between the two queries would probably make the database prefer to query the data from it (full index scan vs full table scan), and chances are you'll get the result in a different order with and without the index.
EDIT:
Another idea could be if you query the current time (e.g., SELECT CURRENT_TIMESTAMP in PosgreSQL, other RDBMSs may have slightly different syntax) - no data in the database is changed, but consecutive calls to the same query will return different results as the time moves forward.
There is no need to assume that underlying data changes.
The simplest solution is using a volatile function. For instance, this might return different results when run at different times -- even with no changes to the underlying data:
select t.*
from t
where created_at < current_timestamp - interval '1 year';
Or:
select t.*
from t
order by random()
fetch first 100 rows only;
Actually, a query as simple as:
select random()
also meets the requirements, without actually having to involve any tables.
As you've said, the same two consecutive select would only produce different results if the data has changed between the two selects.
You could also have a select in a transaction where data was changed, and then have a rollback and to the same select and have different results but once again it implies the underlying data has changed.
If you set the With No Lock, you can query the same data two times and have different results, for example if a big update was running during one of you select. It's an edge case but it can happen.
Brent Ozark have very good explanation about the "With No Lock" issue https://www.brentozar.com/archive/2019/08/but-nolock-is-okay-when-the-data-isnt-changing-right/

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Is using COUNT(*) or SELECT * a good idea?

I've heard several times that you shouldn't perform COUNT(*) or SELECT * for performance reasons, but wasn't able to dig up some further information about it.
I can imagine that the database is then using all columns for the action, which can be an impressive performance loss, but I'm not sure about that. Does somebody have further information about the topic?
1. On count(*) vs. count(something else)
SQL is declarative in that you specify what you want. This is different from specifying how to get what you want. That means the database engine is free to realize your query in whatever way it thinks is the most efficient. Many database optimizers rewrites your query to a less costly alternative (if such a plan is available).
Given the following table:
table(
pk not null
,color not null
,nullable null
,unique(pk)
,index(color)
);
...all of the following are functionally equivalent (due to the mechanics of count and nulls):
1) select count(*) from table;
2) select count(1) from table;
3) select count(pk) from table;
4) select count(color) from table;
Regardless of which form you use, the optimizer is free to rewrite the query to another form if it is more efficient. (Again, not all optimizers are sophisticated enough to do this). The unique index(pk) would be smaller (bytes occupied) than the entire table. Therefore it would be more efficient to count the number of index entries rather than scanning through the entire table. In Oracle we have bitmap indexes, which also compress repeating strings. If we had used such an index on the color column, it would probably have been the smallest index to scan. Oracle also supports table compression which in some cases makes the physical table smaller than a composite index.
1. TL;DR;
Your specific dbms will have its own set of tools that enables different rewriting rules and in turn execution plans. That renders the question somewhat useless (unless we talk about a specific release of a specific dbms). I recommend COUNT(*) in all cases because it requires the least cognitive effort to grasp.
2. On select a,b,c vs. select *
There are very few valid uses of SELECT * in code you write and put into production. Imagine a table which contains Bluray movies (yes, the movies is stored as a blob in this table). So you slapped together your awesomesauce abstraction layer and put SELECT * FROM movies where id = ? in the getMovies(movie_id) method. I will refrain myself from explaining why SELECT name FROM movies will be transported across the network just a tad faster. Of course, in most realistic cases it won't have a noticable impact.
One last point on performance is that when all the referenced columns (selected, filtered) in your query exists as an index (called a covering index), the database need not touch the table at all. It can be fully resolved from scanning the index only. By selecting all columns you remove this option from the optimizer.
Another thing about SELECT * which is far more serious than anything, is that it creates an implicit dependency on a specific physical layout of the table. Let me explain. Consider the following tables:
table T1(name, id)
table T2(name, id)
The following statement...
insert into t1 select * from t2;
... will break or produce a different result if any of the following happens:
Any of the tables columns are rearranged for example T1(id, name)
T1 gets an additional not-null column
T2 gets another column
2. TL;DR; When possible, explicitly specify the columns you want (eventually, you'll have to do that anyway). Also, selecting fewer columns are faster than selecting more columns. A possitive side-effect on explicit selects is that it gives greater freedom to the optimizer.
COUNT(*) is different from COUNT(column1) !
COUNT(*) returns the number of records, and does NOT use more resources, while COUNT(column1) counts the number of records where column1 is non null.
For SELECT, it is different. SELECT * will of course request more data.
When using count(*) the * doesn't mean "all fields". Using count(field) will count all non-null values in the field, but count(*) will always count all records even if all fields in all records are null, so it doesn't need to check the data in the fields at all.
Using select * means that you almost always return more data than you are going to use, which of course is a waste. However, perhaps more serious is the maintainence problem; if you add fields to a table your query will return these too. That might mean that the record becomes too large to fit in the buffer, resulting in an error message.
Don't confuse the * in "COUNT(*)" with the * in "SELECT * ". They are completely unrelated but sometimes confused because it's such an odd syntax. There is nothing wrong with using COUNT(*), which just means "count rows".
SELECT * on the other hand means "select all columns". That's generally poor practice because it tightly couples your code to the database schema. That means when you change the table you probably have to change the code even if it should have been unaffected. It increases the impact of any schema change.
SELECT * may also cause a sub-optimal query plan. Either because you didn't really need all columns or because it forces the DBMS to do an extra lookup at runtime to get the list of columns.
It's absolutely true that "*" is "all columns". And you're right in the point of if you've a table with an incredible number of columns (say 100+), these kind of queries can be bad in terms of efficiency.
I believe that the best solution is creating database views previously filtering the amount of records evolved in the count operation, so, the performance impact isn't a big problem, because views can be cached.
In the other hand, it seems that "*" operator should be avoided when returning records, and it's brutally better to select the fields you really need to use in some business.
When using SELECT * it can have a performance hit. Applications which use the SELECT * syntax when they actually only need a handful of columns are transferring more data across the network than they need to consume, which is wasteful.
Also, in Microsoft SQL Server at least, there's a strange problem when you use SELECT * in a view and then add a column to the underlying table. The column headings and data returned by the view don't match each other following certain changes! See my blog post for further details of this particular problem.
Depending on the size of the database depends on how inefficient it becomes, the simnplest way to describe would be like so:
when you specifically do:
SELECT column1,column2,column3 FROM table1
Mysql knows exactly exactly what columns it looking for, but when you do
SELECT * FROM table1
Mysql does not know the columns you want, it knows you want all of them but not the names, so it has to perform extra tasks that analyse the table to discover the columns, thus resulting in using resources.
In case of COUNT(*) it depends on database and its version. For example in modern versions of MS SQL it doesn't matter [source needed].
So the best approach in case of COUNT(*) is to measure it.
Using SELECT * is really bad idea. * means read all columns which can be heavy IO and network operation (especially for various type of CHAR columns). Moreover -- rather rarely you need all columns.

Fast check of existence of an entry in a SQL database

Nitpicker Question:
I like to have a function returning a boolean to check if a table has an entry or not. And i need to call this a lot, so some optimizing is needed.
Iues mysql for now, but should be fairly basic...
So should i use
select id from table where a=b limit 1;
or
select count(*) as cnt from table where a=b;
or something completly different?
I think SELECT with limit should stop after the first find, count(*) needs to check all entries. So SELECT could be faster.
Simnplest thing would be doing a few loop and test it, but my tests were not helpful. (My test system seemd to be used otherwise too, which diluted mny results)
this "need" is often indicative of a situation where you are trying to INSERT or UPDATE. the two most common situations are bulk loading/updating of rows, or hit counting.
checking for existence of a row first can be avoided using the INSERT ... ON DUPLICATE KEY UPDATE statement. for a hit counter, just a single statement is needed. for bulk loading, load the data in to a temporary table, then use INSERT ... ON DUPLICATE KEY UPDATE using the temp table as the source.
but if you can't use this, then the fastest way will be select id from table where a=b limit 1; along with force index to make sure mysql looks ONLY at the index.
The limit 1 will tell the MySQL to stop searching after it finds one row. If there can be multiple rows that match the criteria, this is faster than count(*).
There are more ways to optimize this, but the exact nature would depend on the amount of rows and the spread of a and b. I'd go with the "where a=b" approach until you actually encounter performance issues. Databases are often so fast that most queries are no performance issue at all.

When is the proper time to use the SQL statement "SELECT [results] FROM [tables] WHERE [conditions] FETCH FIRST [n] ROWS ONLY"

I'm not quite sure when selecting only a certain number of rows would be better than simply making a more specific select statement. I have a feeling I'm missing something pretty straight forward but I can't figure it out. I have less than 6 months experience with any SQL and it's been cursory at that so I'm sorry if this is a really simple question but I couldn't find a clear answer.
I know of two common uses:
Paging : be sure to specify an ordering. If an ordering isn't specified, many db implementations use whatever is convenient for executing the query. This "optimal" ordering behavior could give very unpredictable results.
SELECT top 10 CustomerName
FROM Customer
WHERE CustomerID > 200 --start of page
ORDER BY CustomerID
Subqueries : many places that a subquery can be issued require that the result is a single value. top 1 is just faster than max in many cases.
--give me some customer that ordered today
SELECT CustomerName
FROM Customer
WHERE CustomerID =
(
SELECT top 1 CustomerID
FROM Orders
WHERE OrderDate = #Today
)
Custom paging, typically.
We're using the statement for the following reasons:
Show only the most relevant results (say the top 100) without having to transfer all rows from the DB to the client. In this case, we also use ORDER BY.
We just want to know if there are matching rows and have a few examples. In this case, we don't order the results and again, FETCH FIRST is much more cheap than having the DB prepare to transfer lots of rows and then throw them away at the client. This is usually during software development when need to get a feeling if a certain SQL is right.
When you want to display the values to a user, you are only likely to need N rows. Generally the database server can fetch the first N rows, faster than it can fetch all of the rows, so your screen redraw can go a little bit faster.
Oracle even has a hint, called FIRST_ROWS that suggests that getting data back quick, is more important that getting it all back efficiently.
The designers of SQL agree with you, which is why standard SQL doesn't included top/limit/fetch first, etc.
Think google search results and the number of pages there are typically for results.
Though obviously, there's much more to it in their case but that's the idea.
Aside from paging, any time you want the most or least [insert metric here] row from a table, ordering on [whatever metric] and limiting to 1 row is, IME, better than doing a subquery using MIN/MAX. Could vary by engine.