How does SQL UNION operator identify duplicates - sql

Executing the following SQL (on an PostgreSQL data base) results in 9 rows, even tough the data sets from both tables are obviously not completely identical.
removed
Result:
removed
Why does it not result in 13 rows?
Using UNION ALL does the trick, but I am wondering how SQL UNION operator identifies duplicates?

UNION removes duplicates from the result set. It guarantees that the result has no duplicates at all. So, it removes duplicates both within tables and between tables.
You seem to have total duplicates within the tables. They are removed.

Related

MS Access SQL - Removing Duplicates From Query

MS Access SQL - This is a generic performance-related duplicates question. So, I don't have a specific example query, but I believe I have explained the situation below clearly and simply in 3 statements.
I have a standard/complex SQL query that Selects many columns; some computed, some with asterisk, and some by name - e.g. (tab1.*, (tab2.co1 & tab2.col2) as computedFld1, tab3.col4, etc).
This query Joins about 10 tables. And the Where clause is based on user specified filters that could be based on any of the fields present in all 10 tables.
Based on these filters, I can sometimes get records with the same tab4.ID value.
Question: What is the best way to eliminate duplicate result rows with the same tab4.ID value. I don't care which rows get eliminated. They will differ in non-important ways.
Or, if important, they will differ in that they will have different tab5.ID values; and I want to keep the result rows with the LARGEST tab5.ID values.
But if the first query performs better than the second, then I really don't care which rows get eliminated. The performance is more important.
I have worked on this most of the morning and I am afraid that the answer to this is above my pay scale. I have tried Group By tab4.ID, but can't use "*" in Select clause; and many other things that I just keep bumping my head against a wall.
Access does not support CTEs but you can do something similar with saved queries.
So first alias the columns that have same names in your query, something like:
SELECT tab4.ID AS tab4_id, tab5.ID AS tab5_id, ........
and then save your query for example as myquery.
Then you can use this saved query like this:
SELECT q1.*
FROM myquery AS q1
WHERE q1.tab5_id = (SELECT MAX(q2.tab5_id) FROM myquery AS q2 WHERE q2.tab4_id = q1.tab4_id)
This will return 1 row for each tab4_id if there are no duplicate tab5_ids for each tab4_id.
If there are duplicates then you must provide additional conditions.

When are extra columns removed in Teradata SQL?

I understand that the order of operations for SQL in Teradata is as follows:
FROM clause
WHERE clause
GROUP BY clause
HAVING clause
SELECT clause
ORDER BY clause
This is from this link.
Does this mean that any extra, unneeded columns in the tables I am joining are always removed at the very end (when SELECT is performed)? Do those extra unselected columns take up spool space until they are finally dropped?
So if I am joining Table A (5 columns) with Table B (10 columns), the intermediate result right after the join is 14 columns (with 1 common key). But let's say I'm ultimately only selecting 3 columns at the end.
Does the query optimizer always include all 14 columns in the intermediate result (thus taking up spool space) or is it smart enough to only include the needed 3 columns in the intermediate result?
If it is smart enough to do this, then I could save spool space by rewriting every table I'm joining to as a subquery of ONLY the columns I need from that table.
Thank you for your help.
You are confusing the compiling and execution of queries.
Those are not the "order of operations". What you have described is the order of "interpreting the query". This occurs during the compilation phase, when the identifiers (column and table names and aliases) are interpreted.
SQL is a descriptive language. A SQL query describes the result set. It does not describe how the data is processed (a procedural language would do that).
As for not reading columns. Teradata is probably smart enough to read the columns it needs from the data pages and not bring along unreferenced columns throughout the processing.

How can I optimize an SQL query (Using Indexes but not bitmap indexes)?

I have an SQL query, example:
SELECT * FROM TAB1 NATURAL JOIN TAB2 WHERE TAB1.COL1 = 'RED'
How can I optimize this query to use indexes but not bitmap indexes in Oracle?
NOTE: This answers the original version of the question.
First, don't use NATURAL JOIN. It is an abomination because it does not use properly declared foreign key relationships. It simply uses columns with the same name, and that can produce misleading results.
Second, the query is syntactically incorrect for two reasons. First, "Red" is a reference to a column, not a string value. Does the table have a column named "Red". The second reason is that you have a self join, so ROW1 is ambiguous.
That rings up the larger issue. Your query basically makes no sense at all. You are joining the table to itself, returning duplicate columns. What are the results? Pretty indeterminate:
If any column contains a NULL value, then no rows are returned.
If all the rows are duplicates (with no NULL values), then you'll get a result set with the N^2 rows and duplicate columns, where N is the number of rows in the table.
I cannot think of any use for the query. I see no reason to try to optimize it.
If you have a real query that you want to discuss, I would suggest that you ask another question.

Join 50 millions query one to one row

I have two tables having 50 million unique rows each.
Row number from one table corresponds to row number in the second table.
i.e
1st row in the 1st table joins with 1st row in the second table, 2nd row in first table joins with 2nd row in the second table and so on. Doing inner join is costly.
It takes more than 5 hours on clusters. Is there an efficient way to do this in SQL?
To start with: tables are just sets. So the row number of a record can be considered pure coincidence. You must not join two tables based on row numbers. So you would join on IDs rather than on row numbers.
There is nothing more efficient than a simple inner join. As the whole tables must be read, you might not even gain anything from indexes (but as we are talking of IDs, there will be indexes anyhow, so nothing we must ponder on).
Depending on the DBMS you may be able to parallelize the query. In Oracle for example you would use a hint such as /*+ parallel( tablename , parallel_factor ) */.
Try to sort both tables by rows (if isnt sorted),then use normal SELECT (maybe you could use LIMIT to get it part by part) for both tables anddata connect line by line wherever you want

UNION Statement Showing Inconsistent Results

I have a SQL query that consists of two SELECT statements which are UNION'ed together. When run individually they the first SELECT returns 10 records and the second SELECT returns 1 record, so when I UNION the two SELECTs I would expect to get 11 records returned but this is not the case, I'm only getting 9 records.
Due to the nature of the SQL I can't actually post it here but it consists of numerous JOINS across 5 tables. Everything being returned is correct and valid.
Just wondering if anyone has seen this issue occur when UNION'ing two SELECT statements and if anyone has any advice on what could be the cause or even point me in the right direction, thanks.
UNION remove duplicates by default. To prevent duplicates from being removed UNION ALL should be used.
Quoting the documentation:
The default behavior for UNION is that duplicate rows are removed from the result. The optional DISTINCT keyword has no effect other than the default because it also specifies duplicate-row removal. With the optional ALL keyword, duplicate-row removal does not occur and the result includes all matching rows from all the SELECT statements.
By default, Oracle applies an implicit distinct clause to the result of a union. You may want to check whether the results of your separate queries include common items.
If you do not want this behavior, you need to use the UNION ALL clause instead.
try to use UNION ALL instead of only UNION. UNION only returns distinct rows. Check this out.