Row Order in SQL - sql

I wanted to know if the row order returned by a query mattered?
I'm not using a SQL service yet, just working with plain tables and Excel.
For example if I do a left join on two tables, my take is that all the rows from the left or first table to be mentioned will be the first ones in my resulting table, whether there are coincidences on the right one or not. But a classmate ordered the results so he placed the rows with coincidences first and the ones without, with null values, at the end.

SQL tables represent unordered sets. SQL results sets are unordered unless you explicitly have an ORDER BY for the outermost SELECT.
This is always true and is a fundamental part of the language. Your class should have covered this on day 1.
The results from a query without an ORDER BY may look like they are in a particular order. However, you should not depend on that -- or, you depend on that at your peril. The rule is simple: without an ORDER BY, you do not know the ordering of the result set.

Related

Related rows ordering when using JOIN without ORDER BY

Let's say we have two tables:
user:
id,name
1,bob
2,alice
user_group:
id,user_id,group
1,1,g1
2,1,g2
3,2,g2
4,2,g3
We don't have guarantees that on each execution of SELECT * FROM user without ORDER BY result set will have the same order. But what about related rows in joins?
For example,
SELECT user.name, user_group.group FROM user INNER JOIN user_group ON (user.id = user_group.user_id);. Will the related(joined) rows be adjacent in the result set(take PostgreSQL for ex.)? By that I imply:
bob,g1
bob,g2
alice,g2
alice,g3
OR
alice,g3
alice,g2
bob,g2
bob,g1
and NOT this:
bob,g1
alice,g2
bob,g2
alice,g3
The order of users doesn't matter, the order of groups within each user too
It is a fundamental rule in SQL that you can never rely on the ordering of a result set unless you add an ORDER BY. If you have no ORDER BY, the ordering of the result set can, among others, depend on
the order in which PostgreSQL reads the individual tables – it could be in index order or in sequential order, and even with a sequential scan you don't always get the same order (unless you disable synchronize_seqscans)
the join strategy chosen (nested loop, hash join or merge join)
the number of rows returned by the query (if you use a cursor, PostgreSQL optimizes the query so that the first rows can be returned quickly)
That said, with your specific example and PostgreSQL as database, I think that all join strategies will not return the result set in the order you describe as undesirable. But I wouldn't rely on that: often, the optimizer finds a surprising way to process a query.
The desire to save yourself an ORDER BY often comes from a wish to optimize processing speed. But correctness is more important than speed, and PostgreSQL can often find a way to return the result in the desired order without having to sort explicitly.

MS Access SQL - Removing Duplicates From Query

MS Access SQL - This is a generic performance-related duplicates question. So, I don't have a specific example query, but I believe I have explained the situation below clearly and simply in 3 statements.
I have a standard/complex SQL query that Selects many columns; some computed, some with asterisk, and some by name - e.g. (tab1.*, (tab2.co1 & tab2.col2) as computedFld1, tab3.col4, etc).
This query Joins about 10 tables. And the Where clause is based on user specified filters that could be based on any of the fields present in all 10 tables.
Based on these filters, I can sometimes get records with the same tab4.ID value.
Question: What is the best way to eliminate duplicate result rows with the same tab4.ID value. I don't care which rows get eliminated. They will differ in non-important ways.
Or, if important, they will differ in that they will have different tab5.ID values; and I want to keep the result rows with the LARGEST tab5.ID values.
But if the first query performs better than the second, then I really don't care which rows get eliminated. The performance is more important.
I have worked on this most of the morning and I am afraid that the answer to this is above my pay scale. I have tried Group By tab4.ID, but can't use "*" in Select clause; and many other things that I just keep bumping my head against a wall.
Access does not support CTEs but you can do something similar with saved queries.
So first alias the columns that have same names in your query, something like:
SELECT tab4.ID AS tab4_id, tab5.ID AS tab5_id, ........
and then save your query for example as myquery.
Then you can use this saved query like this:
SELECT q1.*
FROM myquery AS q1
WHERE q1.tab5_id = (SELECT MAX(q2.tab5_id) FROM myquery AS q2 WHERE q2.tab4_id = q1.tab4_id)
This will return 1 row for each tab4_id if there are no duplicate tab5_ids for each tab4_id.
If there are duplicates then you must provide additional conditions.

Why does the query optimizer use sort after merge join?

Consider this query:
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer.dbo.pt
where exists (select 1 from [dbo].[dt] dt
where pt.map=dt.map and dt.pda=pt.pda and dt.canceled=0)
except
select
map,line,pda,item,qty,qty_gift,pricelist,price,linevalue,vat,
vat_value,disc_perc,disc_value,dt_disc_value,netvalue,imp_qty,
imp_value,exp_qty,exp_value,price1,price2,price3,price4,
justification,notes
from appnameV2_Developer_reporting.dbo.pt
I made this to make sure there is no data difference in the same table (pt) between a replication publisher database(appnameV2_Developer) and its subscriber database(appnameV2_Developer_reporting). The specific replication article has a semijoin on dt.
dt is a transaction header table with PK (map,pda)
pt is a transaction detail table with PK (map,pda,line)
Here's the execution plan
So, we have a Right Semi Join merge join. I would expect its result to be ordered by (map,pda,line). But then, a sort operator on (map,pda,line) is called.
Why does this sort occur (or, more accurately: why is the data not already sorted by that point)? Is the query optimizer lacking the logic of "when merge joining then its output is (still) sorted on the join predicates"? Am I missing something?
Because it decided to use a "Merge Join" to execute the EXCEPT clause. In order to perform a Merge Join both datasets must have the same ordering.
The thing is, the inner Merge Join (before the EXCEPT) is based on the table dt, not on pt. Therefore, the resulting rows won't have the same ordering as the other side of the EXCEPT, that is based on pt.
Why does SQL Server do that? Not clear. I would have done it differently. Maybe the stats are not updated. Maybe there is small amount of rows where the strategy does not matter too much.
The results from the first merge will be sorted by map, pda, line. However, you yourself mentioned join predicates, and the join predicates for this first merge are only based on map, pda (they're the predicates from inside the exists clause, except the cancelled one has been pushed down to the index scan). All that that first merge required was input sorted by map and pda, and so that's the only sort order guaranteed on that data, so far as the rest of the query is concerned.
But as we know, the outputs from this first merge were actually derived from input that was additionally sorted by line. It appears the optimizer isn't currently able to spot this circumstance. It may be that the order of optimizations mean that it's unlikely ever to recognise this situation. So currently, it introduces the extra sort.

Is the GROUP BY clause applied after the WHERE clause in Hive?

Suppose I have the following SQL:
select user_group, count(*)
from table
where user_group is not null
group by user_group
Suppose further that 99% of the data has null user_group.
Will this discard the rows with null before the GROUP BY, or will one poor reducer end up with 99% of the rows that are later discarded?
I hope it is the former. That would make more sense.
Bonus points if you say what will happen by Hive version. We are using 0.11 and migrating to 0.13.
Bonus points if you can point to any documentation that confirms.
Sequence
FROM & JOINs determine & filter rows
WHERE more filters on the rows
GROUP BY combines those rows into groups
HAVING filters groups
SELECT
ORDER BY arranges the remaining rows/groups
The first step is always the FROM clause. In your case, this is pretty straight-forward, because there's only one table, and there aren't any complicated joins to worry about. In a query with joins, these are evaluated in this first step. The joins are assembled to decide which rows to retrieve, with the ON clause conditions being the criteria for deciding which rows to join from each table. The result of the FROM clause is an intermediate result. You could think of this as a temporary table, consisting of combined rows which satisfy all the join conditions. (In your case the temporary table isn't actually built, because the optimizer knows it can just access your table directly without joining to any others.)
The next step is the WHERE clause. In a query with a WHERE clause, each row in the intermediate result is evaluated according to the WHERE conditions, and either discarded or retained. So null will be discarded before going to Group by clause
Next comes the GROUP BY. If there's a GROUP BY clause, the intermediate result is now partitioned into groups, one group for every combination of values in the columns in the GROUP BY clause.
Now comes the HAVING clause. The HAVING clause operates once on each group, and all rows from groups which do not satisfy the HAVING clause are eliminated.
Next comes the SELECT. From the rows of the new intermediate result produced by the GROUP BY and HAVING clauses, the SELECT now assembles the columns it needs.
Finally, the last step is the ORDER BY clause.
This query discards the rows with NULL before the GROUP BY operation.
Hope this link will be useful:-
http://dev.hortonworks.com.s3.amazonaws.com/HDPDocuments/HDP2/HDP-2.2.0/bk_dataintegration/content/hive-013-feature-subqueries-in-where-clauses.html

For SQL select returning more than 1 value, how are they sorted when Id is GUID?

I'm wondering how SQL Server orders data that is returned from a query and the Id columns of the respective tables are all of type uniqueidentifier.
I'm using NHibernate GuidComb when creating all of the GUIDs and do things like:
Sheet sheet = sheetRepository.Get(_SheetGuid_); // has many lines items
IList<SheetLineItem> lineItems = sheet.LineItems;
I'm just trying to figure out how they'll be ordered when I do something like:
foreach (SheetLineItem lineItem in lineItems)
I can't see to find a good article on the way GUIDs are compared by SQL when being ordered, if that's what's happening.
GUIDs are sorted this way by the ORDER BY. Quoting the article...
0..3 are evaluated in left to right order and are the less important, then
4..5 are evaluated in left to right order, then
6..7 are evaluated in left to right order, then
8..9 are evaluated in right to left order, then
A..F are evaluated in right to left order and are the most important
Unless you incude an ORDER BY clause SQL Server doesn't guarantee any order on the results. It may sem to come back in the some order consistently (e.g. clustered index order) but you can't be sure this will always be the case (e.g. if the query is split and executed on multiple threads, when the results are combined, the order may be different on each execution since the threads may complete in different orders).
The only way to get a particular order is to use an ORDER BY clause. In NHibernate, this would be achieved by speifying an order-by="..." on your bags (or equivalent) in your mapping files.
See the NHibernate docs for more info on "order-by": http://nhibernate.info/doc/nh/en/index.html