2 Left join for 2 tables? - sql

I am on MySQL:
I have 2 table, one is the main table, the other is an accessory table that contains some information supporting the records of the main table.
Example:
table portal:
id title desc
12 "aaa" "desc"
13 "bbb" "desc"
[etc]
secondary table (omitting the primary id field)
type portalid
x 12
2 12
3 12
4 12
5 12
1 13
2 13
4 13
I need to select every record in the table portal that got a record in the secondary table with type = 4 but != 5.
Example:
SELECT *
FROM portal,secondary_table s
WHERE portal.id=s.portalid
AND type of secondary_table is 4 and is not 5
Results:
In this case only the record 13 of portal should be returned because the record 12 got both type 4 and 5.
Please note I asked a similar question but considering only one table, and with that query took over 50 secs to be elaborated.
Thanks for any help

You should consider rephrasing it using NOT EXISTS clauses. If all you want are records from portal, then a double EXISTS clause will work and very clearly reveal the query intentions
SELECT *
FROM portal
WHERE EXISTS (select * from secondary_table s1
where portal.id=s1.portalid
and s1.type=4)
AND NOT EXISTS (select * from secondary_table s2
where portal.id=s2.portalid
and s2.type=5)
However, due to how MySQL process EXISTS clauses (even though it is clearer), you can trade off clarify for performance using LEFT JOIN / IS NULL. Please read the following link, however the performance of each query may vary with specific data distribution, so try both and use whichever works better for your data.
NOT IN vs. NOT EXISTS vs. LEFT JOIN / IS NULL: MySQL
The LEFT JOIN / IS NULL form would be written
SELECT *
FROM portal
JOIN secondary_table s1 ON portal.id=s1.portalid and s1.type=4
LEFT JOIN secondary_table s2 ON portal.id=s2.portalid and s2.type=5
WHERE s2.portalid IS NULL
The order of the tables (portal, inner, left) is to allow processing the first two tables (portal + secondary/type=4) and trimming the result set early before launching into the LEFT (outer) JOIN (that retains everything from the left side) for the existential test.

This is why you should avoid the older FROM A,B syntax - it's less powerful with respect to certain things. Use explicit join types (LEFT/RIGHT/INNER/FULL/CROSS) instead.
SELECT <columns>
FROM portal p
LEFT JOIN secondary s1 ON p.id=s1.portalid AND s1.type = 5
INNER JOIN secondary s2 ON p.id=s2.portalid AND s2.type = 4
WHERE s1.type IS NULL

I will use this query that's very similar to EXISTS of richard:
SELECT * FROM portal
WHERE id IN (SELECT portalid FROM sec WHERE type=4)
AND id NOT IN (SELECT portalid FROM sec WHERE type=5)
imo it's even more readable.

Related

left -join-on-concat-fields

I have to join (required to use join) these 2 tables table and table B.
I have table A
CoverID
4!12569
3!18175
1!478931
And Table B
ID
Accountid
4
12569
3
18175
1
478931
Please advise how can I join the table using concat
From TableA A
Left join TableB B
On A.CoverId = (concat(B.ID, ‘!’, B.AccountId))
You need to make sure the data types match. So you’ll have to play around with whether or not you need to cast the data to match within the concatenation besides that it should work.

How to merge missing records from the backup table to the original one

I have 2 schema of database (on oracle 11.g) : BackUp and Original.
The table TAX has the same structure on both schema.
It hasn't a primary key but has as index :
TAXCOD, COUNTRY, LONG_SHORT and FUND.
I want to select the records that are missing from the original table and found on the backup.
I checked the Left Outer Join query on this Link
http://blog.codinghorror.com/a-visual-explanation-of-sql-joins/
And, I wrote this query according to what is mentioned :
select *
from BackUp.TAX x
left outer join Original.TAX y
on (x.TAXCOD = y.TAXCOD and
x.COUNTRY = y.COUNTRY and
x.LONG_SHORT = y.LONG_SHORT and
x.FUND = y.FUND)
where y.TAXCOD is NULL
and y.COUNTRY is NULL
and y.LONG_SHORT is NULL
and y.FUND is NULL
but this query brings even commun records between the 2 tables and not just missing one .
Please could you explain for me where the problem lies.
Thanks for advance.
I'd use EXCEPT to find rows in one table that does not exist in the other table:
select * from BackUp.TAX
except all
select * from Original.TAX
Or, if you just want to match the mentioned columns, use NOT EXISTS:
select * from BackUp.TAX x
where NOT EXISTS (select 1 from Original.TAX y
where x.TAXCOD = y.TAXCOD
and x.COUNTRY = y.COUNTRY
and x.LONG_SHORT = y.LONG_SHORT
and x.FUND = y.FUND)

Query on dremel-linked tables stopped working

I have 3 dremel-linked tables: 2 identity tables and 1 table connecting 2 identity tables.
Table A (4500 rows):
a_id (key);
a_attr1;
a_attr2.
Table B (1500 rows):
b_id (key);
b_attr1;
b_attr2.
Table C (700 rows):
a_id;
b_id.
The simplified query is:
SELECT
A.a_id,
a_attr1,
GROUP_CONCAT(STRING(b_attr1)) AS b_attr1,
STRFTIME_UTC_USEC(NOW(), '%a %e-%b-%Y %R %Z'),
SUM(b_attr2) AS b_attr2
FROM [dataspace_name]:[project_name]:[dataset_name].A
LEFT OUTER JOIN
(SELECT
b_id,
b_attr1,
b_attr2,
a_id
FROM [dataspace_name]:[project_name]:[dataset_name].B
JOIN [dataspace_name]:[project_name]:[dataset_name].C
ON [dataspace_name]:[project_name]:[dataset_name].B.b_id = [dataspace_name]:{project_name]:[dataset_name].C.b_id
) AS BC
ON A.a_id = BC.a_id
WHERE
a_attr2 = 1
GROUP BY
a_attr1
HAVING
(b_attr2 IS NULL) OR (b_attr2 > 0)
ORDER BY
a_attr1
;
This query was running fine for several months until last Monday, 5/13/2013.
The error message I get is:
Large table C must appear as the leftmost table in a join query.
I tried to re-write the query following the error message and swapping the tables, but I get the same message about most right table.
Any advice on what may be causing the failure and how to fix the query is much appreciated.
It was solved, the query is working again, and if anybody interested, it was done by replacing JOIN [dataspace_name]:[project_name]:[dataset_name].C with JOIN EACH [dataspace_name]:[project_name]:[dataset_name].C

No duplicates in SQL query

I'm doing a select in MySQL with an inner join:
SELECT DISTINCT tblcaritem.caritemid, tblcar.icarid
FROM tblcaritem
INNER JOIN tblprivatecar ON tblcaritem.partid = tblprivatecar.partid
INNER JOIN tblcar ON tblcaritem.carid = tblcar.carid
WHERE tblcaritem.userid=72;
Sometimes I get duplicates of tblcaritem.caritemid in the result. I want to make sure to never get duplicates of tblcaritem.caritemid, but how can I do that? I tried to use DISTINCT but it just checked if the whole row is a duplicate, I want to check for only tblcaritem.caritemid, is there a way?
Sorry if I didn't explain it very well, I'm not the best of SQL queries.
GROUP BY tblcaritem.caritemid
The problem here is just as you describe: you're checking for the uniqueness of the whole row, your dataset ends up looking like this:
CarItemId CarId
--------- -----
1 1
1 2
1 3
2 1
2 2
3 3
You want unique CarItemIds with no duplicates, meaning that you also want unique CarIds ---- alright, you have 3 CarIds, which one should SQL Server pick?
You don't have much choice here except to aggregate those extra CarIds away:
SELECT tblcaritem.caritemid, max(tblcar.icarid)
FROM tblcaritem
INNER JOIN tblprivatecar ON tblcaritem.partid = tblprivatecar.partid
INNER JOIN tblcar ON tblcaritem.carid = tblcar.carid
WHERE tblcaritem.userid=72
GROUP BY tblcaritem.caritemid
If you put 2 fields in a SELECT DISTINCT query, it will return rows where the combination of the 2 fields is unique, not just where each field is unique.
You are going to have to run 2 separate queries if you want only unique results for each column.
You could do an aggregate function on the other row... something like max or min
SELECT tblcaritem.caritemid, max(tblcar.icarid)
FROM tblcaritem
INNER JOIN tblprivatecar ON tblcaritem.partid = tblprivatecar.partid
INNER JOIN tblcar ON tblcaritem.carid = tblcar.carid
WHERE tblcaritem.userid=72
group by tblcaritem.caritemid;

NOT IN vs NOT EXISTS

Which of these queries is the faster?
NOT EXISTS:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE NOT EXISTS (
SELECT 1
FROM Northwind..[Order Details] od
WHERE p.ProductId = od.ProductId)
Or NOT IN:
SELECT ProductID, ProductName
FROM Northwind..Products p
WHERE p.ProductID NOT IN (
SELECT ProductID
FROM Northwind..[Order Details])
The query execution plan says they both do the same thing. If that is the case, which is the recommended form?
This is based on the NorthWind database.
[Edit]
Just found this helpful article:
http://weblogs.sqlteam.com/mladenp/archive/2007/05/18/60210.aspx
I think I'll stick with NOT EXISTS.
I always default to NOT EXISTS.
The execution plans may be the same at the moment but if either column is altered in the future to allow NULLs the NOT IN version will need to do more work (even if no NULLs are actually present in the data) and the semantics of NOT IN if NULLs are present are unlikely to be the ones you want anyway.
When neither Products.ProductID or [Order Details].ProductID allow NULLs the NOT IN will be treated identically to the following query.
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
The exact plan may vary but for my example data I get the following.
A reasonably common misconception seems to be that correlated sub queries are always "bad" compared to joins. They certainly can be when they force a nested loops plan (sub query evaluated row by row) but this plan includes an anti semi join logical operator. Anti semi joins are not restricted to nested loops but can use hash or merge (as in this example) joins too.
/*Not valid syntax but better reflects the plan*/
SELECT p.ProductID,
p.ProductName
FROM Products p
LEFT ANTI SEMI JOIN [Order Details] od
ON p.ProductId = od.ProductId
If [Order Details].ProductID is NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
The reason for this is that the correct semantics if [Order Details] contains any NULL ProductIds is to return no results. See the extra anti semi join and row count spool to verify this that is added to the plan.
If Products.ProductID is also changed to become NULL-able the query then becomes
SELECT ProductID,
ProductName
FROM Products p
WHERE NOT EXISTS (SELECT *
FROM [Order Details] od
WHERE p.ProductId = od.ProductId)
AND NOT EXISTS (SELECT *
FROM [Order Details]
WHERE ProductId IS NULL)
AND NOT EXISTS (SELECT *
FROM (SELECT TOP 1 *
FROM [Order Details]) S
WHERE p.ProductID IS NULL)
The reason for that one is because a NULL Products.ProductId should not be returned in the results except if the NOT IN sub query were to return no results at all (i.e. the [Order Details] table is empty). In which case it should. In the plan for my sample data this is implemented by adding another anti semi join as below.
The effect of this is shown in the blog post already linked by Buckley. In the example there the number of logical reads increase from around 400 to 500,000.
Additionally the fact that a single NULL can reduce the row count to zero makes cardinality estimation very difficult. If SQL Server assumes that this will happen but in fact there were no NULL rows in the data the rest of the execution plan may be catastrophically worse, if this is just part of a larger query, with inappropriate nested loops causing repeated execution of an expensive sub tree for example.
This is not the only possible execution plan for a NOT IN on a NULL-able column however. This article shows another one for a query against the AdventureWorks2008 database.
For the NOT IN on a NOT NULL column or the NOT EXISTS against either a nullable or non nullable column it gives the following plan.
When the column changes to NULL-able the NOT IN plan now looks like
It adds an extra inner join operator to the plan. This apparatus is explained here. It is all there to convert the previous single correlated index seek on Sales.SalesOrderDetail.ProductID = <correlated_product_id> to two seeks per outer row. The additional one is on WHERE Sales.SalesOrderDetail.ProductID IS NULL.
As this is under an anti semi join if that one returns any rows the second seek will not occur. However if Sales.SalesOrderDetail does not contain any NULL ProductIDs it will double the number of seek operations required.
Also be aware that NOT IN is not equivalent to NOT EXISTS when it comes to null.
This post explains it very well
http://sqlinthewild.co.za/index.php/2010/02/18/not-exists-vs-not-in/
When the subquery returns even one null, NOT IN will not match any
rows.
The reason for this can be found by looking at the details of what the
NOT IN operation actually means.
Let’s say, for illustration purposes that there are 4 rows in the
table called t, there’s a column called ID with values 1..4
WHERE SomeValue NOT IN (SELECT AVal FROM t)
is equivalent to
WHERE SomeValue != (SELECT AVal FROM t WHERE ID=1)
AND SomeValue != (SELECT AVal FROM t WHERE ID=2)
AND SomeValue != (SELECT AVal FROM t WHERE ID=3)
AND SomeValue != (SELECT AVal FROM t WHERE ID=4)
Let’s further say that AVal is NULL where ID = 4. Hence that !=
comparison returns UNKNOWN. The logical truth table for AND states
that UNKNOWN and TRUE is UNKNOWN, UNKNOWN and FALSE is FALSE. There is
no value that can be AND’d with UNKNOWN to produce the result TRUE
Hence, if any row of that subquery returns NULL, the entire NOT IN
operator will evaluate to either FALSE or NULL and no records will be
returned
If the execution planner says they're the same, they're the same. Use whichever one will make your intention more obvious -- in this case, the second.
Actually, I believe this would be the fastest:
SELECT ProductID, ProductName
FROM Northwind..Products p
outer join Northwind..[Order Details] od on p.ProductId = od.ProductId)
WHERE od.ProductId is null
I have a table which has about 120,000 records and need to select only those which does not exist (matched with a varchar column) in four other tables with number of rows approx 1500, 4000, 40000, 200. All the involved tables have unique index on the concerned Varchar column.
NOT IN took about 10 mins, NOT EXISTS took 4 secs.
I have a recursive query which might had some untuned section which might have contributed to the 10 mins, but the other option taking 4 secs explains, atleast to me that NOT EXISTS is far better or at least that IN and EXISTS are not exactly the same and always worth a check before going ahead with code.
I was using
SELECT * from TABLE1 WHERE Col1 NOT IN (SELECT Col1 FROM TABLE2)
and found that it was giving wrong results (By wrong I mean no results). As there was a NULL in TABLE2.Col1.
While changing the query to
SELECT * from TABLE1 T1 WHERE NOT EXISTS (SELECT Col1 FROM TABLE2 T2 WHERE T1.Col1 = T2.Col2)
gave me the correct results.
Since then I have started using NOT EXISTS every where.
In your specific example they are the same, because the optimizer has figured out what you are trying to do is the same in both examples. But it is possible that in non-trivial examples the optimizer may not do this, and in that case there are reasons to prefer one to other on occasion.
NOT IN should be preferred if you are testing multiple rows in your outer select. The subquery inside the NOT IN statement can be evaluated at the beginning of the execution, and the temporary table can be checked against each value in the outer select, rather than re-running the subselect every time as would be required with the NOT EXISTS statement.
If the subquery must be correlated with the outer select, then NOT EXISTS may be preferable, since the optimizer may discover a simplification that prevents the creation of any temporary tables to perform the same function.
Database table model
Let’s assume we have the following two tables in our database, that form a one-to-many table relationship.
The student table is the parent, and the student_grade is the child table since it has a student_id Foreign Key column referencing the id Primary Key column in the student table.
The student table contains the following two records:
id
first_name
last_name
admission_score
1
Alice
Smith
8.95
2
Bob
Johnson
8.75
And, the student_grade table stores the grades the students received:
id
class_name
grade
student_id
1
Math
10
1
2
Math
9.5
1
3
Math
9.75
1
4
Science
9.5
1
5
Science
9
1
6
Science
9.25
1
7
Math
8.5
2
8
Math
9.5
2
9
Math
9
2
10
Science
10
2
11
Science
9.4
2
SQL EXISTS
Let’s say we want to get all students that have received a 10 grade in Math class.
If we are only interested in the student identifier, then we can run a query like this one:
SELECT
student_grade.student_id
FROM
student_grade
WHERE
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
ORDER BY
student_grade.student_id
But, the application is interested in displaying the full name of a student, not just the identifier, so we need info from the student table as well.
In order to filter the student records that have a 10 grade in Math, we can use the EXISTS SQL operator, like this:
SELECT
id, first_name, last_name
FROM
student
WHERE EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade = 10 AND
student_grade.class_name = 'Math'
)
ORDER BY id
When running the query above, we can see that only the Alice row is selected:
id
first_name
last_name
1
Alice
Smith
The outer query selects the student row columns we are interested in returning to the client. However, the WHERE clause is using the EXISTS operator with an associated inner subquery.
The EXISTS operator returns true if the subquery returns at least one record and false if no row is selected. The database engine does not have to run the subquery entirely. If a single record is matched, the EXISTS operator returns true, and the associated other query row is selected.
The inner subquery is correlated because the student_id column of the student_grade table is matched against the id column of the outer student table.
SQL NOT EXISTS
Let’s consider we want to select all students that have no grade lower than 9. For this, we can use NOT EXISTS, which negates the logic of the EXISTS operator.
Therefore, the NOT EXISTS operator returns true if the underlying subquery returns no record. However, if a single record is matched by the inner subquery, the NOT EXISTS operator will return false, and the subquery execution can be stopped.
To match all student records that have no associated student_grade with a value lower than 9, we can run the following SQL query:
SELECT
id, first_name, last_name
FROM
student
WHERE NOT EXISTS (
SELECT 1
FROM
student_grade
WHERE
student_grade.student_id = student.id AND
student_grade.grade < 9
)
ORDER BY id
When running the query above, we can see that only the Alice record is matched:
id
first_name
last_name
1
Alice
Smith
So, the advantage of using the SQL EXISTS and NOT EXISTS operators is that the inner subquery execution can be stopped as long as a matching record is found.
They are very similar but not really the same.
In terms of efficiency, I've found the left join is null statement more efficient (when an abundance of rows are to be selected that is)
If the optimizer says they are the same then consider the human factor. I prefer to see NOT EXISTS :)
It depends..
SELECT x.col
FROM big_table x
WHERE x.key IN( SELECT key FROM really_big_table );
would not be relatively slow the isn't much to limit size of what the query check to see if they key is in. EXISTS would be preferable in this case.
But, depending on the DBMS's optimizer, this could be no different.
As an example of when EXISTS is better
SELECT x.col
FROM big_table x
WHERE EXISTS( SELECT key FROM really_big_table WHERE key = x.key);
AND id = very_limiting_criteria