UNION Statement Showing Inconsistent Results - sql

I have a SQL query that consists of two SELECT statements which are UNION'ed together. When run individually they the first SELECT returns 10 records and the second SELECT returns 1 record, so when I UNION the two SELECTs I would expect to get 11 records returned but this is not the case, I'm only getting 9 records.
Due to the nature of the SQL I can't actually post it here but it consists of numerous JOINS across 5 tables. Everything being returned is correct and valid.
Just wondering if anyone has seen this issue occur when UNION'ing two SELECT statements and if anyone has any advice on what could be the cause or even point me in the right direction, thanks.

UNION remove duplicates by default. To prevent duplicates from being removed UNION ALL should be used.
Quoting the documentation:
The default behavior for UNION is that duplicate rows are removed from the result. The optional DISTINCT keyword has no effect other than the default because it also specifies duplicate-row removal. With the optional ALL keyword, duplicate-row removal does not occur and the result includes all matching rows from all the SELECT statements.

By default, Oracle applies an implicit distinct clause to the result of a union. You may want to check whether the results of your separate queries include common items.
If you do not want this behavior, you need to use the UNION ALL clause instead.

try to use UNION ALL instead of only UNION. UNION only returns distinct rows. Check this out.

Related

How does SQL UNION operator identify duplicates

Executing the following SQL (on an PostgreSQL data base) results in 9 rows, even tough the data sets from both tables are obviously not completely identical.
removed
Result:
removed
Why does it not result in 13 rows?
Using UNION ALL does the trick, but I am wondering how SQL UNION operator identifies duplicates?
UNION removes duplicates from the result set. It guarantees that the result has no duplicates at all. So, it removes duplicates both within tables and between tables.
You seem to have total duplicates within the tables. They are removed.

Postgres all subqueries in coalesce executed

COALESCE in Postgres is a function that returns the first parameter not null.
So I used coalesce in subqueries like:
SELECT COALESCE (
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...),
( SELECT * FROM users WHERE... ORDER BY ...)
);
I change the where in any query and they contain lots of params and CASE, also different ORDER BY clauses.
This is because I always want to return something but giving priorities.
What I noticed while issuing EXPLAIN ANALYZE is that any query is executed despite the first one actually returns NOT a null value.
I would expect the engine to run only the first one query and not the following ones if it returns not null.
This way I could have a bad performance.
So am I doing any bad practice and is it better to run the queries separately for performance reason?
EDIT:
Sorry you where right I don’t select * but I select only one column. I didn’t post my code because I am not interested in my query but it’s a generic question to understand how the engine is working. So I reproduce a very simple fiddle here http://sqlfiddle.com/#!17/a8aa7/4
I may be wrong but I think it behaves as I was telling: it runs all the subqueries despite the first one already returns a not null value
EDIT 2: ok I read only now it says never executed. So the other two queries aren’t getting executed. What confused me was the fact they were included in the query plan.
Anyways it’s still important for my question. Is it better to run all the queries separately for performance reasons? Because it seems like that even if the first one returns a not null value the other two subqueries can slow down the performance
For separate SELECT queries, I suggest to use UNION ALL / LIMIT 1 instead. Based on your fiddle:
(select user_id from users order by age limit 1) -- misleading example, see below
UNION ALL
(select user_id from users where user_id=1)
UNION ALL
(select user_id from users order by user_id DESC limit 1)
LIMIT 1;
db<>fiddle here
For three reasons:
Works for any SELECT list: single expressions (your fiddle), multiple or whole row (your example in the question).
You can distinguish actual NULL values from "no row". Since user_id is the PK in the example (and hence, NOT NULL), the problem cannot surface in the example. But with an expression that can be NULL, COALESCE cannot distinguish between both, "no row" is coerced to NULL for the purpose of the query. See:
Return a value if no record is found
Faster.
Aside, your first SELECT in the example makes this a wild-goose chase. It returns a row if there is at least one. The rest is noise in this case.
Related:
PostgreSQL combine multiple select statements
SQL - does order of OR conditions matter?
Way to try multiple SELECTs till a result is available?

I have a large query, how do I debug this?

So, I get this error message:
EDT ERROR: syntax error at or near "union" at character 436
The query in question is a large query that consists of 12 smaller queries all connected together with UNION ALL, and each small query has two inner join statements. So, something like:
SELECT table.someid as id
,table.lastname as name
,table2.groupname as groupname
, 'Leads ' as Type
from table
inner join table3 on table3.specificid = table.someid
INNER JOIN table2 on table3.specificid=table2.groupid
where table3.deleted=0
and table.someid > 0
and table2.groupid in ('2','3','4')
LIMIT 5
UNION all
query2....
Note that table2 and table3 are the same tables in each query, and the fields from table2 and table3 are also the same, I think.
Quick question (I am still kinda new to all this):
What does 'Leads ' as Type mean? Unlike the other statements preceding an AS, this one isn't written like table.something.
Quick edit question: What does table2.groupid in ('2','3','4') mean?
I checked each small query one by one, each one works and returns a result, though the results are always empty for some reason(this may or may not be dependent on the user logged in though, as some PHP code generated this query).
As for the results themselves, most of them look something like this (they are arranged horizontally though):
id(integer)
name (character varying(80))
groupname (character varying(100))
type (unknown)
The difference in the results are twofold:
1)Most of the results contain the same field names but quite a few of them have different field lengths. Like some will say character varying (80), while others will say character varying (100), please correct me if this is actually not field length.
2)2 of the queries contain different fields, but only the id field is different, and it's probably because they don't have the "as id" part.
I am not quite sure of what the requirements of UNION ALL are, but if I think, it is meant to only work if all the fields are the same, but if that funky number changes (the one in the brackets), then are the fields considered to be different even if they have the same name?
Also, what's strange is that some of the queries returned the exact same fields, with the same field length, so I tried to UNION ALL only those queries, but no luck, still got a syntax error at UNION.
Another important thing I should mention is that the DB used to be MySQL, but we changed to PostGreSQL, so this bug might be a result of the change (i.e. code that might work in MySQL but not in PostGres).
Thanks for your time.
You can have only one "LIMIT xxx" clause. At the end of the query, and not before the UNION.
The error you get is due to missing parentheses here:
...
LIMIT 5
UNION all
...
The manual:
(ORDER BY and LIMIT can be attached to a subexpression if it is
enclosed in parentheses. Without parentheses, these clauses will be
taken to apply to the result of the UNION, not to its right-hand input
expression.)
Later example:
Sum results of a few queries and then find top 5 in SQL
The only real way I have found to debug big queries is to break it into understandable parts and debug each subexpression independently:
Does each show the expected rows?
Are the resulting fields and types as expected?
For union, do the result fields and types exactly match corresponding other subexpressions?

In SQL, 'distinct' reduces the number of result rows from one to zero

I have a SQL statement of the following structure:
select distinct ...
from table1,
(select from table2, table3, table4 where ...)
where ...
order by ...
With certain values in the where clauses, the statement returns zero rows in the result set. When I remove the 'distinct' keyword, it returns a single row. I would expect to see a single result row in both cases. Is there some property of the 'distinct' keyword that I am not aware of and that causes this behavior?
The database is Oracle 11g.
What you describe is not the expected behaviour of DISTINCT. This is:
SQL> select * from dual
2 /
D
-
X
1 row selected.
SQL> select distinct * from dual
2 /
D
-
X
1 row selected.
SQL>
So, if what you say is happening really is what is happening then it's a bug. However, you also say it's a rare occurrence which means there is a good chance it is some peculiarity in your data and/or transient conditions in your environment, and not a bug.
You need to create a reproducible test case, for two reasons. Partly, nobody will be able to investigate your problem without one. But mainly because building a test case is an investigation in its own right: attempting to isolate the precise combination of data and/or ambient factors often generates the insight which leads to a solution.
It turned out that one of the sub-selects resulted in a data set that contained, among others, a row where every column was NULL. It seems that this row influenced the evaluation of the DISTINCT in a non-obvious way (at least to me). Maybe this is due to some under-the-hood SQL optimizations. After I removed the cause of this NULL-filled row, the problem is gone and the statement evaluates to one row in the result as it should.

SQL Server UNION - What is the default ORDER BY Behaviour

If I have a few UNION Statements as a contrived example:
SELECT * FROM xxx WHERE z = 1
UNION
SELECT * FROM xxx WHERE z = 2
UNION
SELECT * FROM xxx WHERE z = 3
What is the default order by behaviour?
The test data I'm seeing essentially does not return the data in the order that is specified above. I.e. the data is ordered, but I wanted to know what are the rules of precedence on this.
Another thing is that in this case xxx is a View. The view joins 3 different tables together to return the results I want.
There is no default order.
Without an Order By clause the order returned is undefined. That means SQL Server can bring them back in any order it likes.
EDIT:
Based on what I have seen, without an Order By, the order that the results come back in depends on the query plan. So if there is an index that it is using, the result may come back in that order but again there is no guarantee.
In regards to adding an ORDER BY clause:
This is probably elementary to most here but I thought I add this.
Sometimes you don't want the results mixed, so you want the first query's results then the second and so on. To do that I just add a dummy first column and order by that. Because of possible issues with forgetting to alias a column in unions, I usually use ordinals in the order by clause, not column names.
For example:
SELECT 1, * FROM xxx WHERE z = 'abc'
UNION ALL
SELECT 2, * FROM xxx WHERE z = 'def'
UNION ALL
SELECT 3, * FROM xxx WHERE z = 'ghi'
ORDER BY 1
The dummy ordinal column is also useful for times when I'm going to run two queries and I know only one is going to return any results. Then I can just check the ordinal of the returned results. This saves me from having to do multiple database calls and most empty resultset checking.
Just found the actual answer.
Because UNION removes duplicates it does a DISTINCT SORT. This is done before all the UNION statements are concatenated (check out the execution plan).
To stop a sort, do a UNION ALL and this will also not remove duplicates.
If you care what order the records are returned, you MUST use an order by.
If you leave it out, it may appear organized (based on the indexes chosen by the query plan), but the results you see today may NOT be the results you expect, and it could even change when the same query is run tomorrow.
Edit: Some good, specific examples: (all examples are MS SQL server)
Dave Pinal's blog describes how two very similar queries can show a different apparent order, because different indexes are used:
SELECT ContactID FROM Person.Contact
SELECT * FROM Person.Contact
Conor Cunningham shows how the apparent order can change when the table gets larger (if the query optimizer decides to use a parallel execution plan).
Hugo Kornelis proves that the apparent order is not always based on primary key. Here is his follow-up post with explanation.
A UNION can be deceptive with respect to result set ordering because a database will sometimes use a sort method to provide the DISTINCT that is implicit in UNION , which makes it look like the rows are deliberately ordered -- this doesn't apply to UNION ALL for which there is no implicit distinct, of course.
However there are algorithms for the implicit distinct, such as Oracle's hash method in 10g+, for which no ordering will be applied.
As DJ says, always use an ORDER BY
It's very common to come across poorly written code that assumes table data is returned in insert order, and 95% of the time the coder gets away with it and is never aware that this is a problem as on many common databases (MSSQL, Oracle, MySQL). It is of course a complete fallacy and should always be corrected when it's come across, and always, without exception, use an Order By clause yourself.