Related rows ordering when using JOIN without ORDER BY - sql

Let's say we have two tables:
user:
id,name
1,bob
2,alice
user_group:
id,user_id,group
1,1,g1
2,1,g2
3,2,g2
4,2,g3
We don't have guarantees that on each execution of SELECT * FROM user without ORDER BY result set will have the same order. But what about related rows in joins?
For example,
SELECT user.name, user_group.group FROM user INNER JOIN user_group ON (user.id = user_group.user_id);. Will the related(joined) rows be adjacent in the result set(take PostgreSQL for ex.)? By that I imply:
bob,g1
bob,g2
alice,g2
alice,g3
OR
alice,g3
alice,g2
bob,g2
bob,g1
and NOT this:
bob,g1
alice,g2
bob,g2
alice,g3
The order of users doesn't matter, the order of groups within each user too

It is a fundamental rule in SQL that you can never rely on the ordering of a result set unless you add an ORDER BY. If you have no ORDER BY, the ordering of the result set can, among others, depend on
the order in which PostgreSQL reads the individual tables – it could be in index order or in sequential order, and even with a sequential scan you don't always get the same order (unless you disable synchronize_seqscans)
the join strategy chosen (nested loop, hash join or merge join)
the number of rows returned by the query (if you use a cursor, PostgreSQL optimizes the query so that the first rows can be returned quickly)
That said, with your specific example and PostgreSQL as database, I think that all join strategies will not return the result set in the order you describe as undesirable. But I wouldn't rely on that: often, the optimizer finds a surprising way to process a query.
The desire to save yourself an ORDER BY often comes from a wish to optimize processing speed. But correctness is more important than speed, and PostgreSQL can often find a way to return the result in the desired order without having to sort explicitly.

Related

Row Order in SQL

I wanted to know if the row order returned by a query mattered?
I'm not using a SQL service yet, just working with plain tables and Excel.
For example if I do a left join on two tables, my take is that all the rows from the left or first table to be mentioned will be the first ones in my resulting table, whether there are coincidences on the right one or not. But a classmate ordered the results so he placed the rows with coincidences first and the ones without, with null values, at the end.
SQL tables represent unordered sets. SQL results sets are unordered unless you explicitly have an ORDER BY for the outermost SELECT.
This is always true and is a fundamental part of the language. Your class should have covered this on day 1.
The results from a query without an ORDER BY may look like they are in a particular order. However, you should not depend on that -- or, you depend on that at your peril. The rule is simple: without an ORDER BY, you do not know the ordering of the result set.

Why does the query optimizer use sort after merge join?

Consider this query:
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer.dbo.pt
where exists (select 1 from [dbo].[dt] dt
where pt.map=dt.map and dt.pda=pt.pda and dt.canceled=0)
except
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer_reporting.dbo.pt
I made this to make sure there is no data difference in the same table (pt) between a replication publisher database(appnameV2_Developer) and its subscriber database(appnameV2_Developer_reporting). The specific replication article has a semijoin on dt.
dt is a transaction header table with PK (map,pda)
pt is a transaction detail table with PK (map,pda,line)
Here's the execution plan
So, we have a Right Semi Join merge join. I would expect its result to be ordered by (map,pda,line). But then, a sort operator on (map,pda,line) is called.
Why does this sort occur (or, more accurately: why is the data not already sorted by that point)? Is the query optimizer lacking the logic of "when merge joining then its output is (still) sorted on the join predicates"? Am I missing something?
Because it decided to use a "Merge Join" to execute the EXCEPT clause. In order to perform a Merge Join both datasets must have the same ordering.
The thing is, the inner Merge Join (before the EXCEPT) is based on the table dt, not on pt. Therefore, the resulting rows won't have the same ordering as the other side of the EXCEPT, that is based on pt.
Why does SQL Server do that? Not clear. I would have done it differently. Maybe the stats are not updated. Maybe there is small amount of rows where the strategy does not matter too much.
The results from the first merge will be sorted by map, pda, line. However, you yourself mentioned join predicates, and the join predicates for this first merge are only based on map, pda (they're the predicates from inside the exists clause, except the cancelled one has been pushed down to the index scan). All that that first merge required was input sorted by map and pda, and so that's the only sort order guaranteed on that data, so far as the rest of the query is concerned.
But as we know, the outputs from this first merge were actually derived from input that was additionally sorted by line. It appears the optimizer isn't currently able to spot this circumstance. It may be that the order of optimizations mean that it's unlikely ever to recognise this situation. So currently, it introduces the extra sort.

Splitting large table into 2 dataframes via JDBC connection in RStudio

Through R I connect to a remotely held database. The issue I have is my hardware isn't so great and the dataset contains tens of millions of rows with about 10 columns per table. When I run the below code, at the df step, I get a "Not enough RAM" error from R:
library(DatabaseConnector)
conn <- connect(connectionDetails)
df <- querySql(conn,"SELECT * FROM Table1")
What I thought about doing was splitting the tables into two parts any filter/analyse/combine as needed going forward. I think because I use the conn JDBC conection I have to use SQL syntax to make it work. With SQL, I start with the below code:
df <- querySql(conn,"SELECT TOP 5000000 FROM Table1")
And then where I get stuck is how do I create a second dataframe starting with n - 5000000 rows and ending at the final row, retrieved from Table1.
I'm open to suggestions but I think there are two potential answers to this question. The first is to work within the querySql to get it working. The second is to use an R function other than querySql (no idea what this would look like). I'm limited to R due to work environment.
The SQL statement
SELECT TOP 5000000 * from Table1
is not doing what you think it's doing.
Relational tables are conceptually unordered.
A relation is defined as a set of n-tuples. In both mathematics and the relational database model, a set is an unordered collection of unique, non-duplicated items, although some DBMSs impose an order to their data.
Selecting from a table produces a result-set. Result-sets are also conceptually unordered unless and until you explicitly specify an order for them, which is generally done using an order by clause.
When you use a top (or limit, depending on the DBMS) clause to reduce the number of records to be returned by a query (let's call these the "returned records") below the number of records that could be returned by that query (let's call these the "selected records") and if you have not specified an order by clause, then it is conceptually unpredictable and random which of the selected records will be chosen as the returned records.
Since you have not specified an order by clause in your query, you are effectively getting 5,000,000 unpredictable and random records from your table. Every single time you run the query you might get a different set of 5,000,000 records (conceptually, at least).
Therefore, it doesn't make sense to ask about how to get a second result-set "starting with n - 5000000 and ending at the final row". There is no n, and there is no final row. The choice of returned records was not deterministic, and the DBMS does not remember such choices of past queries. The only conceivable way such information could be incorporated into a subsequent query would be to explicitly include it in the SQL, such as by using a not in condition on an id column and embedding id values from the first query as literals, or doing some kind of negative join, again, involving the embedding of id values as literals. But obviously that's unreasonable.
There are two possible solutions here.
1: order by with limit and offset
Take a look at the PostgreSQL documentation on limit and offset. First, just to reinforce the point about lack of order, take note of the following paragraphs:
When using LIMIT, it is important to use an ORDER BY clause that constrains the result rows into a unique order. Otherwise you will get an unpredictable subset of the query's rows. You might be asking for the tenth through twentieth rows, but tenth through twentieth in what ordering? The ordering is unknown, unless you specified ORDER BY.
The query optimizer takes LIMIT into account when generating query plans, so you are very likely to get different plans (yielding different row orders) depending on what you give for LIMIT and OFFSET. Thus, using different LIMIT/OFFSET values to select different subsets of a query result will give inconsistent results unless you enforce a predictable result ordering with ORDER BY. This is not a bug; it is an inherent consequence of the fact that SQL does not promise to deliver the results of a query in any particular order unless ORDER BY is used to constrain the order.
Now, this solution requires that you specify an order by clause that fully orders the result-set. An order by clause that only partially orders the result-set will not be enough, since it will still leave room for some unpredictability and randomness.
Once you have the order by clause, you can then repeat the query with the same limit value and increasing offset values.
Something like this:
select * from table1 order by id1, id2, ... limit 5000000 offset 0;
select * from table1 order by id1, id2, ... limit 5000000 offset 5000000;
select * from table1 order by id1, id2, ... limit 5000000 offset 10000000;
...
2: synthesize a numbering column and filter on it
It is possible to add a column to the select clause which will provide a full order for the result-set. By wrapping this SQL in a subquery, you can then filter on the new column and thereby achieve your own pagination of the data. In fact, this solution is potentially slightly more powerful, since you could theoretically select discontinuous subsets of records, although I've never seen anyone actually do that.
To compute the ordering column, you can use the row_number() partition function.
Importantly, you will still have to specify id columns by which to order the partition. This is unavoidable under any conceivable solution; there always must be some deterministic, predictable record order to guide stateless paging through data.
Something like this:
select * from (select *, row_number() over (id1, id2, ...) rn from table1) t1 where rn>0 and rn<=5000000;
select * from (select *, row_number() over (id1, id2, ...) rn from table1) t1 where rn>5000000 and rn<=10000000;
select * from (select *, row_number() over (id1, id2, ...) rn from table1) t1 where rn>10000000 and rn<=15000000;
...
Obviously, this solution is more complicated and verbose than the previous one. And the previous solution might allow for performance optimizations not possible under the more manual approach of partitioning and filtering. Hence I would recommend the previous solution.
My above discussion focuses on PostgreSQL, but other DBMSs should provide equivalent features. For example, for SQL Server, see Equivalent of LIMIT and OFFSET for SQL Server?, which shows an example of the synthetic numbering solution, and also indicates that (at least as of SQL Server 2012) you can use OFFSET {offset} ROWS and FETCH NEXT {limit} ROWS ONLY to achieve limit/offset functionality.

hsqldb: do I have to ORDER BY to ensure consistent selection order?

I have a table containing samples. The inserted samples are already naturally ordered by the timestamp.
My question is this - when I SELECT from the table do I have to use the ORDER BY clause to ensure the fetched samples are ordered by the timestamp?
Rows in a relation database are NOT sorted (Picture them as balls in a basket. Which one is the "first"?)
The only way (really, the only) to get a consistently sorted result is to use ORDER BY.
You cannot rely on side effects of joins, group by. UNION, index retrieval or similar operators. They will never guarantee an order. The DBMS is free to choose to return the rows in whatever order it thinks is the fastest unless you specify an ORDER BY.
If an HSQLDB table T has a column C as primary key, or has any index on that column,
SELECT FROM T ORDER BY C
will return ordered rows without extra ORDER BY processing.
If there is a condition on the select, which uses an index on a different column, you can still force the use of the index for ORDER BY:
SELECT FROM T WHERE <some condition> ORDER BY C USING INDEX
But in this case, you should only use USING INDEX if most of the rows of the table will be returned. Otherwise it is better to leave the engine use the other index to reduce the table scan time.
USING INDEX is ignored if there is no index to use for ORDER BY.

For SQL select returning more than 1 value, how are they sorted when Id is GUID?

I'm wondering how SQL Server orders data that is returned from a query and the Id columns of the respective tables are all of type uniqueidentifier.
I'm using NHibernate GuidComb when creating all of the GUIDs and do things like:
Sheet sheet = sheetRepository.Get(_SheetGuid_); // has many lines items
IList<SheetLineItem> lineItems = sheet.LineItems;
I'm just trying to figure out how they'll be ordered when I do something like:
foreach (SheetLineItem lineItem in lineItems)
I can't see to find a good article on the way GUIDs are compared by SQL when being ordered, if that's what's happening.
GUIDs are sorted this way by the ORDER BY. Quoting the article...
0..3 are evaluated in left to right order and are the less important, then
4..5 are evaluated in left to right order, then
6..7 are evaluated in left to right order, then
8..9 are evaluated in right to left order, then
A..F are evaluated in right to left order and are the most important
Unless you incude an ORDER BY clause SQL Server doesn't guarantee any order on the results. It may sem to come back in the some order consistently (e.g. clustered index order) but you can't be sure this will always be the case (e.g. if the query is split and executed on multiple threads, when the results are combined, the order may be different on each execution since the threads may complete in different orders).
The only way to get a particular order is to use an ORDER BY clause. In NHibernate, this would be achieved by speifying an order-by="..." on your bags (or equivalent) in your mapping files.
See the NHibernate docs for more info on "order-by": http://nhibernate.info/doc/nh/en/index.html