When would two consecutive SELECT queries run in the same session produce different results? - sql

I got an assignment to find a situation when two consecutive select queries produce different results.
My idea is that if we first run the first query, then we modify some records in a parallel session, and then we run the second query, the results will obviously be different.
I'm curious if there are other situations besides the one mentioned above.

The scenario you described would definitely work, although note you'd have to commit those changes (not sure id it was implied or not, but it's probably a good idea to be explicit).
Another idea that may or may not be a valid solution here is to play around with the ordering. E.g., consider a query like SELECT num_col FROM my_table. Since there is no order by clause, the database is free to return the rows in any way it chooses. Creating an index on num_col between the two queries would probably make the database prefer to query the data from it (full index scan vs full table scan), and chances are you'll get the result in a different order with and without the index.
EDIT:
Another idea could be if you query the current time (e.g., SELECT CURRENT_TIMESTAMP in PosgreSQL, other RDBMSs may have slightly different syntax) - no data in the database is changed, but consecutive calls to the same query will return different results as the time moves forward.

There is no need to assume that underlying data changes.
The simplest solution is using a volatile function. For instance, this might return different results when run at different times -- even with no changes to the underlying data:
select t.*
from t
where created_at < current_timestamp - interval '1 year';
Or:
select t.*
from t
order by random()
fetch first 100 rows only;
Actually, a query as simple as:
select random()
also meets the requirements, without actually having to involve any tables.

As you've said, the same two consecutive select would only produce different results if the data has changed between the two selects.
You could also have a select in a transaction where data was changed, and then have a rollback and to the same select and have different results but once again it implies the underlying data has changed.
If you set the With No Lock, you can query the same data two times and have different results, for example if a big update was running during one of you select. It's an edge case but it can happen.
Brent Ozark have very good explanation about the "With No Lock" issue https://www.brentozar.com/archive/2019/08/but-nolock-is-okay-when-the-data-isnt-changing-right/

Related

Select without order by

It is my understanding that select is not guaranteed to always return the same result.
Following query is not guaranteed to return the same result every time:
select * from myTable offset 10000 limit 100
My question is if myTable is not changed between executions of select (no deletions or inserts) can i rely on it returning the same result set every time?
Or to put it in another way if my database is locked for changes can I rely on select returning the same result?
I am using postgresql.
Tables and result sets (without order by) are simply not ordered. It really is that simple.
In some databases, under some circumstances, the order will be consistent. However, you should never depend on this. Subsequent releases, for instance, might invalidate the query.
For me, I think the simplest way to understand this is by thinking of parallel processing. When you execute a query, different threads might go out and start to fetch data; which values are returned first depends on non-reproducible factors.
Another way to think of it is to consider a page cache that already has pages in memory -- probably from the end of the table. The SQL engine could read the pages in any order (although in practice this doesn't really happen).
Or, some other query might have a row or page lock, so that page gets skipped when reading the records.
So, just accept that unordered means what ordered means. Add an order by if you want data in a particular order. If you use a clustered index key, then there is basically no performance hit.

Determine if a SQL Insert/Update statement affects the result from a stored Select Statement

Thought this would be a good place to ask for some "brainstorming." Apologies if it's a little broad/off subject.
I was wondering if anyone here had any ideas on how to approach the following problem:
First assume that I have a select statement stored somewhere as an object (this can be the tree form of the query). For example (for simplicity):
SELECT A, B FROM table_A WHERE A > 10;
It's easy to determine the below would change the result of the above query:
INSERT INTO table_A (A,B) VALUES (12,15);
But, given any possible Insert/Update/Whatever statement, as well as any possible starting Select (but we know the Selects and can analyze them all day) I'd like to determine if it would affect the result of the Select Statement.
It's fine to assume that there won't be any "outside" queries, and that we know about all the queries being sent to the DB. It is also assumed we know the DB schema.
No, this isn't for homework. Just a brain teaser I've been thinking about and started to get stuck on (obviously, SQL can get very complicated.)
Based on the reply to the comment, I'd say that without additional criteria, this ranges between very hard and impossible.
Very hard (leastways, it would be for me) because you'd have to write something to parse and interpret your SQL statements into a workable frame of reference for your goals. Doable, but can it be worth the effort?
Impossible because some queries transcend phrases like "Byzantinely complex". (Think nested queries, correlated subqueries, views, common table expressions, triggers, outer joins, and who knows what all.) Without setting criteria such as "no subqueries, no views or triggers, no more than X joins" and so forth, the problem becomes open-ended enough to warrant an NP Complete answer.
My first thought would be to put a trigger on table_A, where if any of the columns you're affecting (col A in this case) changes to meet (or no longer meet) the condition (> 10 here), then the trigger records that an "affecting" change has taken place.
E.g. have another little table to record a "last update timestamp", which the trigger could pop a getdate() into when it detects such a change.
Then, you could check that table to see if the timestamp has changed since the last time you ran the select query - if it has, then you know you need to re-run it, if it hasn't, then you know the results would be the same.
The table could hold many such timestamps (one per row, perhaps with the table/trigger name as a key value in another column) to service many such triggers.
Advantage? Being done in a trigger on the table means no risk of a change that could affect the select statement being missed.
Disadvantage? I guess depending on how your select statements come into existence, you might have an undesirable/unmanageable overhead in creating the trigger(s).

Complexity comparison between temporary table + index creation vice multi-table group by without index

I have two potential roads to take on the following problem, the try it and see methodology won't pay off for this solution as the load on the server is constantly in flux. The two approaches I have are as follows:
select *
from
(
select foo.a,bar.b,baz.c
from foo,bar,baz
-- updated for clarity sake
where foo.a=b.bar
and b.bar=baz.c
)
group by a,b,c
vice
create table results as
select foo.a,bar.b,baz.c
from foo,bar,baz
where foo.a=b.bar
and b.bar=baz.c ;
create index results_spanning on results(a,b,c);
select * from results group by a,b,c;
So in case it isn't clear. The top query performs the group by outright against the multi-table select thus preventing me from using an index. The second query allows me to create a new table that stores the results of the query, proceeding to create a spanning index, then finishing the group by query to utilize the index.
What is the complexity difference of these two approaches, i.e. how do they scale and which is preferable in the case of large quantities of data. Also, the main issue is the performance of the overall select so that is what I am attempting to fix here.
Comments
Are you really doing a CROSS JOIN on three tables? Are those three
columns indexed in their own right? How often do you want to run the
query which delivers the end result?
1) No.
2) Yes, where clause omitted for the sake of discussion as this is clearly a super trivial example
3) Doesn't matter.
2nd Update
This is a temporary table as it is only valid for a brief moment in time, so yes this table will only be queried against one time.
If your query is executed frequently and unacceptably slow, you could look into creating materialized views to pre-compute the results. This gives you the benefit of an indexable "table", without the overhead of creating a table every time.
You'll need to refresh the materialized view (preferably fast if the tables are large) either on commit or on demand. There are some restrictions on how you can create on commit, fast refreshable views, and they will add to your commit time processing slightly, but they will always give the same result as running the base query. On demand MVs will become stale as the underlying data changes until these are refreshed. You'll need to determine whether this is acceptable or not.
So the question is, which is quicker?
Run a query once and sort the result set?
Run a query once to build a table, then build an index, then run the query again and sort the result set?
Hmmm. Tricky one.
The use cases for temporary tables are pretty rare in Oracle. They normally onlya apply when we need to freeze a result set which we are then going to query repeatedly. That is apparently not the case here.
So, take the first option and just tune the query if necessary.
The answer is, as is so often the case with tuning questions, it depends.
Why are you doing a GROUP BY in the first place. The query as you posted it doesn't do any aggregation so the only reason for doing GROUP BY woudl be to eliminate duplicate rows, i.e. a DISTINCT operation. If this is actually the case then you doing some form of cartesian join and one tuning the query would be to fix the WHERE clause so that it only returns discrete records.

Speed of paged queries in Oracle

This is a never-ending topic for me and I'm wondering if I might be overlooking something. Essentially I use two types of SQL statements in an application:
Regular queries with a "fallback" limit
Sorted and paged queries
Now, we're talking about some queries against tables with several million records, joined to 5 more tables with several million records. Clearly, we hardly want to fetch all of them, that's why we have the above two methods to limit user queries.
Case 1 is really simple. We just add an additional ROWNUM filter:
WHERE ...
AND ROWNUM < ?
That's quite fast, as Oracle's CBO will take this filter into consideration for its execution plan and probably apply a FIRST_ROWS operation (similar to the one enforced by the /*+FIRST_ROWS*/ hint.
Case 2, however is a bit more tricky with Oracle, as there is no LIMIT ... OFFSET clause as in other RDBMS. So we nest our "business" query in a technical wrapper as such:
SELECT outer.* FROM (
SELECT * FROM (
SELECT inner.*, ROWNUM as RNUM, MAX(ROWNUM) OVER(PARTITION BY 1) as TOTAL_ROWS
FROM (
[... USER SORTED business query ...]
) inner
)
WHERE ROWNUM < ?
) outer
WHERE outer.RNUM > ?
Note that the TOTAL_ROWS field is calculated to know how many pages we will have even without fetching all data. Now this paging query is usually quite satisfying. But every now and then (as I said, when querying 5M+ records, possibly including non-indexed searches), this runs for 2-3minutes.
EDIT: Please note, that a potential bottleneck is not so easy to circumvent, because of sorting that has to be applied before paging!
I'm wondering, is that state-of-the-art simulation of LIMIT ... OFFSET, including TOTAL_ROWS in Oracle, or is there a better solution that will be faster by design, e.g. by using the ROW_NUMBER() window function instead of the ROWNUM pseudo-column?
The main problem with Case 2 is that in many cases the whole query result set has to be obtained and then sorted before the first N rows can be returned - unless the ORDER BY columns are indexed and Oracle can use the index to avoid a sort. For a complex query and a large set of data this can take some time. However there may be some things you can do to improve the speed:
Try to ensure that no functions are called in the inner SQL - these may get called 5 million times just to return the first 20 rows. If you can move these function calls to the outer query they will be called less.
Use a FIRST_ROWS_n hint to nudge Oracle into optimising for the fact that you will never return all the data.
EDIT:
Another thought: you are currently presenting the user with a report that could return thousands or millions of rows, but the user is never realistically going to page through them all. Can you not force them to select a smaller amount of data e.g. by limiting the date range selected to 3 months (or whatever)?
You might want to trace the query that takes a lot of time and look at its explain plan. Most likely the performance bottleneck comes from the TOTAL_ROWS calculation. Oracle has to read all the data, even if you only fetch one row, this is a common problem that all RDBMS face with this type of query. No implementation of TOTAL_ROWS will get around that.
The radical way to speed up this type of query is to forego the TOTAL_ROWS calculation. Just display that there are additional pages. Do your users really need to know that they can page through 52486 pages? An estimation may be sufficient. That's another solution, implemented by google search for example: estimate the number of pages instead of actually counting them.
Designing an accurate and efficient estimation algorithm might not be trivial.
A "LIMIT ... OFFSET" is pretty much syntactic sugar. It might make the query look prettier, but if you still need to read the whole of a data set and sort it and get rows "50-60", then that's the work that has to be done.
If you have an index in the right order, then that can help.
It may perform better to run two queries instead of trying to count() and return the results in the same query. Oracle may be able to answer the count() without any sorting or joining to all the tables (join table elimination based on declared foreign key constraints). This is what we generally do in our application. For performance important statements, we write a separate query that we know will return the correct count as we can sometimes do better than Oracle.
Alternatively, you can make a tradeoff between performance and recency of the data. Bringing back the first 5 pages is going to be nearly as quick as bringing back the first page. So you could consider storing the results from 5 pages in a temporary table along with an expiry date for the information. Take the result from the temporary table if valid. Put a background task in to delete the expired data periodically.

Where should I do the rowcount when checking for existence: sql or php?

In the case when I want to check, if a certain entry in the database exists I have two options.
I can create an sql query using COUNT() and then check, if the result is >0...
...or I can just retrieve the record(s) and then count the number of rows in the returned rowset. For example with $result->num_rows;
What's better/faster? in mysql? in general?
YMMV, but I suspect that if you are only checking for existence, and don't need to use the retrieved data in any way, the COUNT() query will be faster. How much faster will depend on how much data.
The fastest is probably asking the database if something exists:
SELECT EXISTS ([your query here])
SELECT 1
FROM (SELECT 1) t
WHERE EXISTS( SELECT * FROM foo WHERE id = 42 )
Just tested, works fine on MySQL v5
COUNT(*) is generally less efficient if:
you can have duplicates (because the
DBMS will have to exhaustively
search all of the records/indexes to
give you the exact answer) or
have NULL entries (for the same
reason)
If you are COUNT'ing based on a WHERE clause that is guaranteed to produce a single record (or 0) and the DBMS knows this (based upon UNIQUE indexes), then it ought to be just as efficient. But, it is unlikely that you will always have this condition. Also, the DBMS may not always pick up on this depending on the version and DBMS.
Counting in the application (when you don't need the row) is almost always guaranteed to be slower/worse because:
You have to send data to the client, the client has to buffer it and do some work
You may bump out things in the DBMS MRU/LRU data cache that are more important
Your DBMS will (generally) have to do more disk I/O to fetch record data that you will never use
You have more network activity
Of course, if you want to DO something with the row if it exists, then it is definitely faster/best to simply try and fetch the row to begin with!
If all you are doing is checking for the existance, then
Select count(*) ...
But if you will retrieve the data if it exists, then just get the data and check it in PHP, otherwise you'll have two calls.
For me is in the database.
Making a count(1) is faster than $result->num_rows because in the $result->num_rows you make 2 operations 1 select and a count if the select has a count is faster to get the result.
Except if you also want the information from the db.
If you want raw speed, benchmark! In addition to the methods others have suggested:
SELECT 1 FROM table_name WHERE ... LIMIT 1
may be faster due to avoiding the subselect. Benchmark it.
SELECT COUNT(*) FROM table
is the best choice, this operation is extremely fast both on small tables and large tables. While it's possible that
SELECT id FROM table
is faster on small tables, the difference in speed will be microscopic. But if you have a large table, this operation can be very slow.
Therefore, your best bet is to always choose to COUNT(*) the table (and it's faster to do * than it is to pick a specific column) as overall, it will be the fastest operation.
I would definitely do it in the PHP to decrease load on the database.
In order to get a count and get the returned rows in SQL you would have to do two queries.. a COUNT and then a SELECT
The PHP way gives you everything you need in one result object.