Strange postgres SELECT ordering issue - sql

I have this postgres sql query:
select * from stats where athlete_id = 5
That will return, as you'd imagine, all the stats for each athlete of id 5
So far so normal.
However I just noticed that for athlete_id above 110, it returns the stats rows in the opposite order.
So for athlete_id = 110 the 'id' column comes out like this which is what i'm used to:
2325
2401
2482
2537
2592
2647
...
etc. Each stats table row in order by id.
Then if you select 111, it comes out like this:
5652
5610
5569
5528
5487
5437
5387
5336
...
How is that even possible? This is all doing the queries inside the pgadmin interface. There's no additional where clauses, only the one I've stated, i.e. WHERE athlete_id = 111
What? I've been making all sorts of code changes but what on earth would cause this within pgaadmin /pgsql ?
What happened between 110 & 111? Lots and lots of refactoring and code changes inside rails, but no actual direct SQL manipulation. Yes I realise the answer might be in there, but I don't know how to look, as this is not in the rails app - this is pure SQL via pgadmin, so something must have been done in postgres - I need to understand what that might be or I'm at a loss to debug.
Any ideas? In rails, things like:
Athlete.stats.last.score return the 'first' row instead of the 'last' row, completely screwing up the application, but only for athletes with an id above 110!
Totally confused!

If you don't specify an ORDER BY clause, there is no guarantee as to what order the rows will be returned in. They could be ascending order, descending order, an order that has no apparent logic to it, and potentially even a different order each time you query (though that is unlikely to happen in practice).
The specific order you get back is due to implementation details of the steps in the query plan. When you change the parameters it can cause PostgreSQL to prefer one query plan over another. For example, it may decide to use a table scan for one query, but an index for the other. It's hard to know exactly the reason in your specific case, but looking at the results of EXPLAIN might give some hints.
To fix your query to return the rows in the order you expect, add an ORDER BY clause:
SELECT *
FROM stats
WHERE athlete_id = 5
ORDER BY id

Unless you specify an ORDER BY clause on your query, you cannot assume/expect that SQL will return the data to you in any certain order.

Related

How does Oracle SQL order rows when there is no ORDER BY

I have currently written SQL to bring back data in a select statement without using ORDER BY. From what I have read, the selection seems to be random? As I am trying to run tests on data that should be updated once selected, is there a way to 'predict' which entries will be returned by the select statement (and therefore should be modified)? My select statement is:
SELECT x.FIELD_A,
x.FIELD_B,
x.FIELD_C,
y.FIELD_D,
y.FIELD_E,
y.FIELD_F
FROM TABLE_X x
LEFT JOIN TABLE_Y Y ON x.FIELD_A = y.FIELD_G
WHERE y.FIELD_G in ('xxx', 'yyy', 'zzz', 'www')
AND y.FIELD_E = 'vvvv'
AND y.FIELD_G IS NULL
AND y.FIELD_H IS NULL
AND y.FIED_I IS NULL
AND x.FIELD_J IN ('uuu', 'ttt', 'sss', 'rrrr')
AND y.FIELD_K_DATE > ADD_MONTHS(SYSDATE, -12)
AND ROWNUM <= 20
How does Oracle SQL order rows when there is no ORDER BY?
It doesn't order them at all.
From a classic Tom Kyte article, and quoting his own book:
You should think of a heap organized table as a big unordered collection of rows. These rows will come out in a seemingly random order, and depending on other options being used (parallel query, different optimizer modes and so on), they may come out in a different order with the same query. Do not ever count on the order of rows from a query unless you have an ORDER BY statement on your query!
If you run your query several times today you may see the same results in the same order, but might not. You could run it a hundred times and perhaps get the same result each time, and think that means that's what it will always return. But if you run it again tomorrow it may or may not be the same. If you run it weeks or months or years in the future it still could be the same as you see today, but is increasingly likely not to be as other things change - including data volumes, statistics gathering, database upgrades or even potentially patching, and so on.
Without an order-by clause you are at the mercy of the optimiser and things even that can't control, potentially even down to the order blocks of data are read from disk.
So no, you can't predict which entries will be returned by your query. You will get 20 rows but which ones you get and what order those are in is non-deterministic.
If you need a predictable result you need to order the results before apply the rownum filter (or using fetch first).

SQLite: How to SELECT "most recent record for each user" from single table with composite key?

I'm not a database guru and feel like I'm missing some core SQL knowledge to grok a solution to this problem. Here's the situation as briefly as I can explain it.
Context:
I have a SQLite database table that contains timestamped user event records. The records can be uniquely identified by the combination of timestamp and user ID (i.e., when the event took place and who the event is about). I understand this situation is called a "composite primary key." The table looks something like this (with a bunch of other columns removed, of course):
sqlite> select Last_Updated,User_ID from records limit 4;
Last_Updated User_ID
------------- --------
1434003858430 1
1433882146115 3
1433882837088 3
1433964103500 2
Question: How do I SELECT a result set containing only the most recent record for each user?
Given the above example, what I'd like to get back is a table that looks like this:
Last_Updated User_ID
------------- --------
1434003858430 1
1433882837088 3
1433964103500 2
(Note that the result set only includes user 3's most recent record.)
In reality, I have approximately 2.5 million rows in this table.
Bonus: I've been reading answers about JOINs, de-dupe procedures, and a bunch more, and I've been googling for tutorials/articles in the hopes that I would find what I'm missing. I have extensive programming background so I could de-dupe this dataset in procedural code like I've done a hundred times before, but I'm tired of writing scripts to do what I believe should be possible in SQL. That's what it's for, right?
So, what do you think is missing from my understand of SQL, conceptually, that I need in order to understand why the solution you've provided to my question actually works? (A reference to a good article that actually explains the theory behind the practice would suffice.) I want to know WHY the solution actually works, not just that it does.
Many thanks for your time!
You could try this:
select user_id, max(last_updated) as latest
from records
group by user_id
This should give you the latest record per user. I assume you have an index on user_id and last_updated combined.
In the above query, generally speaking - we are asking the database to group user_id records. If there are more than 1 records for user_id 1, they will all be grouped together. From that recordset, maximum last_updated will be picked for output. Then the next group is sought and the same operation is applied there.
If you have a composite index, sqlite will likely just use the index because the index contains both fields addressed in the query. Indexes are smaller than the table itself, so scanning or seeking is faster.
Well, in true "d'oh!" fashion, right after I ask this question, I find the answer.
For my case, the answer is:
SELECT MAX(Last_Updated),User_ID FROM records GROUP BY User_ID
I was making this more complicated than it needed to be by thinking I needed to use JOINs and stuff. Applying an aggregate function like MAX() is all that's needed to select only those rows whose content matches the function result. That means this statement…
SELECT MAX(Last_Updated),User_ID FROM records
…would therefor return a result set containing only 1 row, the most recent event.
By adding the GROUP BY clause, however, the result set contains a row for each "group" of results, i.e., for each user. My programmer-brain did not understand that GROUP BY is how we say "for each" in SQL. I think I get it now.
Note to self: keep it simple, stupid. :)

Using limit in sqlite SQL statement in combination with order by clause

Will the following two SQL statements always produce the same result set?
1. SELECT * FROM MyTable where Status='0' order by StartTime asc limit 10
2. SELECT * FROM (SELECT * FROM MyTable where Status='0' order by StartTime asc) limit 10
Yes, but ordering subqueries is probably a bad habit to get into. You could feasibly add a further ORDER BY outside the subquery in your second example, e.g.
SELECT *
FROM (SELECT *
FROM Test
ORDER BY ID ASC
) AS A
ORDER BY ID DESC
LIMIT 10;
SQLite still performs the ORDER BY on the inner query, before sorting them again in the outer query. A needless waste of resources.
I've done an SQL Fiddle to demonstrate so you can view the execution plans for each.
No. First because the StartTime column may not have UNIQUE constraint. So, even the first query may not always produce the same result - with itself!
Second, even if there are never two rows with same StartTime, the answer is still negative.
The first statement will always order on StartTime and produce the first 10 rows. The second query may produce the same result set but only with a primitive optimizer that doesn't understand that the ORDER BY in the subquery is redundant. And only if the execution plan includes this ordering phase.
The SQLite query optimizer may (at the moment) not be very bright and do just that (no idea really, we'll have to check the source code of SQLite*). So, it may appear that the two queries are producing identical results all the time. Still, it's not a good idea to count on it. You never know what changes will be made in a future version of SQLite.
I think it's not good practice to use LIMIT without ORDER BY, in any DBMS. It may work now, but you never know how long these queries will be used by the application. And you may not be around when SQLite is upgraded or the DBMS is changed.
(*) #Gareth's link provides the execution plan which suggests that current SQLite code is dumb enough to execute the redundant ordering.

Query Optimisation (ORDER BY)

I have an SQL query similar to below:
SELECT NAME,
MY_FUNCTION(NAME) -- carries out some string manipulation
FROM TITLES
ORDER BY NAME; -- has an index.
The TITLES table has approximately 12,000 records. At the moment the query takes over 5 minutes to execute but if I remove the ORDER BY clause then it executes within a couple of seconds.
Does anyone have any suggestions on how to be speed up this query.
If MY_FUNCTION is deterministic (i.e. always returns the same result for the same input value) then you could create an index on (NAME, MY_FUNCTION(NAME)) and it may help (or may not!)
In comments under the question, you say that it takes 2 seconds "to return N rows without the ORDER BY". That makes sense: without the ORDER BY you will just get the first N rows encountered, as soon as they are encountered. With the ORDER BY, the first N rows are returned only after the results have been sorted into the correct order.
If the query is being used in a situation where getting the first N rows fast is important (e.g. an online report with pagination) then you could try adding a FIRST_ROWS or FIRST_ROWS_n hint to the query, to try to persuade it to use the index. See Choosing an Optimizer Goal
Use the EXPLAIN statement to see where the issue is
EXPLAIN SELECT NAME, MY_FUNCTION(NAME) FROM TITLES ORDER BY NAME;
Sounds weird. What's name column type?
Have you checked for defective hardware errors? Maybe (just maybe) your query with the order by clause is using your index, and your index is located in a defective disk (it could be in a different disk from the table if they are located in different tablespaces).

SQL Server UNION - What is the default ORDER BY Behaviour

If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.