Why does outer reference in SQL subquery produce different results? - sql

I run two SQL queries: The first one have an outer reference to the table inside subquery. In the second one I add the same table inside subquery. The results are different, it fails due to multiple rows.
The first one runs on Oracle, but fails on Spark-SQL. Therefore I am looking for a solution similar to Oracle SQl as in the first SQL code.
Query 1:
select *,
(select N_CODE
from table2 f
where f.ID1 = (select min(f.ID1)
from table1 a left join table2 f on a.ID2 = f.ID2
where a.ID2 = table1.ID2
)
) AS CODE
from table1
Query 2:
select *,
(select N_CODE
from table1 t, table2 f
where f.ID1 = (select min(f.ID1)
from table1 a left join table2 f on a.ID2 = f.ID2
where a.ID2 = t.ID2
)
) AS CODE
from table1
The second one is my solution to the first one in Spark SQL, but it fails on both Oracle and Spark. How can I run the first query on Spark SQL similar to Oracle?
Please do not modify the structure of the query.

Oracle supports multiple inner queries but spark does not. The best way to overcome it is to split your super query into pieces and use join them.
For instance run this part and save it as a table3:
select min(table2 .ID1)
from table1 a left join table2 f on a.ID2 = f.ID2
where a.ID2 = t.ID2
from table2
Then use it for your main query:
....
where f.ID1 = table3

Related

BigQuery Full outer join producing "left join" results

I have 2 tables, both of which contain distinct id values. Some of the id values might occur in both tables and some are unique to each table. Table1 has 10,910 rows and Table2 has 11,304 rows
When running a left join query:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
I get a total of 10,896 rows or 10,896 ids shared across both tables.
However, when I run a FULL OUTER JOIN on the 2 tables like this:
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
I get total of 10,896 rows, but I was expecting all 10,910 rows from table1.
I am wondering if there is an issue with my query syntax.
As you are using EACH - it looks like you are running your queries in Legacy SQL mode.
In BigQuery Legacy SQL - COUNT(DISTINCT) function is probabilistic - gives statistical approximation and is not guaranteed to be exact.
You can use EXACT_COUNT_DISTINCT() function instead - this one gives you exact number but a little more expensive on back-end
Even better option - just use Standard SQL
For your specific query you will only need to remove EACH keyword and it should work as a charm
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
JOIN table2 b on a.id = b.id
and
#standardSQL
SELECT COUNT(DISTINCT a.id)
FROM table1 a
FULL OUTER JOIN table2 b on a.id = b.id
I added the original query as a subquery and counted ids and produced the expected results. Still a little strange, but it works.
SELECT EXACT_COUNT_DISTINCT(a.id)
FROM
(SELECT a.id AS a.id,
b.id AS b.id
FROM table1 a FULL OUTER JOIN EACH table2 b on a.id = b.id))
It is because you count in both case the number of non-null lines for table a by using a count(distinct a.id).
Use a count(*) and it should works.
You will have to add coalesce... BigQuery, unlike traditional SQL does not recognize fields unless used explicitly
SELECT COUNT(DISTINCT coalesce(a.id,b.id))
FROM table1 a
FULL OUTER JOIN EACH table2 b on a.id = b.id
This query will now take full effect of full outer join :)

AND/OR statements in SQL JOIN

I have two tables: table1 and table2. I can join them using id1 or id2. I prefer to use id1, but as in some rows id1 is missing, so I have to use id2. Is the following syntax correct:
SELECT *
FROM table1 as a
LEFT JOIN table2 as b
ON (a.id1 is not null and a.id1 = b.id1) or
(a.id2 is not null and a.id2 = b.id2)
It returns some results but I want to be sure if it is valid as I haven't seen it used before.
Are there better ways to do this?
Looks like you have a decent answer in the comments, but to toss another possibility into the ring, you could run both queries and union them.
select *
from table1 as a
LEFT JOIN table2 as b
on a.id1 = b.id1
union
select *
from table1 as a
LEFT JOIN table2 as b
on a.id2 = b.id2
The union will eliminate any duplicates between the sets, and will return records where either condition is true, much like your or statement. Performance wise, the union is probably a little slower, but gives you easier control over the sets. For instance if you only want set 2 to return results when id1 is null, just add it to the where clause. Anyway hope that helps.

Use Temporary Tables or Nested Select in retrieving data from multiple table?

I've got a query similar to this below where data are retrieve from multiple tables.. The problem is if this table is to retrieve multiple data... the process would definitely would it be better or more efficient to use nested select or temp table to optimize my select statement... and how should I be grouping my joins...
Select a.Name,
b.type,
c.color,
d.group,
e.location
f.quantity
g.cost
from Table1 a
INNER JOIN Table2 b ON a.ID=b.ItemCode
INNER JOIN TABLE3 c ON b.ItemCOde = c.groupID
INNER JOIN TABLE4 d ON c.groupID = d.batchID
LEFT JOIN TABLE5 e ON d.batchID = e.PostalID
LEFT JOIN TABLE6 f ON e.PostalID = f.CountID
LEFT JOIN TABLE7 g ON f.CountID = g.InventoryNo
The order of join could be important: start with the most selective table(s) and continue with least selective table(s).
Nested queries vs. temp table: it's old dilemma and there is no "magic" solution. In some cases temp table can improve performance. The truth is: every query is different story. Try with both solution and analyze query execution plan.
This might work..!!!
Select a.Name,b.type,c.color,d.group,e.locationf.quantity,g.cost
from Table1 a,Table2 b,TABLE3 c,TABLE4 d,TABLE5 e,TABLE6 f,TABLE7 g
where a.ID=b.ItemCode,b.ItemCOde = c.groupID,c.groupID = d.batchID,
d.batchID = e.PostalID,
e.PostalID = f.CountID,f.CountID = g.InventoryNo;

How to select records from a Table that has a certain number of rows in a related table in SQL Server?

Not quite sure how to ask this, but I have 2 tables that are related in a 1 to many relationship, I need to select all records in the "1" table that have less than three records in the "many' table.
select b.foreignkey,count(b.foreignkey) as bidcount
from b
where b.foreignkey in (select a.id from a) and bidcount< 3
group by b.foreignkey
this doesn't work at all I know but I am at a loss how to do this.
I need to in the end select all the records from the "a" table based on this criteria. Sorry if that is confusing!
Just using your code, not tested:
SELECT
b.foreignkey,
count(b.foreignkey) as bidcount
FROM
b
WHERE
b.foreignkey IN (SELECT a.id FROM a)
GROUP BY
b.foreignkey
HAVING
count(b.foreignkey) < 3
Try this:
SELECT t1.id,COUNT(t2.parentId)
FROM table1 as t1
INNER JOIN table2 as t2
ON t1.id = t2.parentId
GROUP BY t1.id
HAVING COUNT(t2.parentId) < 3
You didn't mention which version of SQL Server you're using - if you're on SQL Server 2005 or newer, you could use this CTE (Common Table Expression):
;WITH ChildRows AS
(
SELECT A.Id, COUNT(b.Id) AS 'BCount'
FROM
dbo.TableA A
INNER JOIN
dbo.TableB B ON B.TableAId = A.Id
)
SELECT A.*, R.BCount
FROM dbo.TableA A
INNER JOIN ChildRows R ON A.Id = R.Id
The inner SELECT lists the Id columns from TableA and the count of the child rows associated with those (using the INNER JOIN to TableB) - and the outer SELECT just builds on top of that result set and shows all fields from table A (and the count from the B table)
if you want to return all fields of your (1) table in one query, I suggest you consider using CROSS APPLY:
SELECT t1.* FROM table_1 t1
CROSS APPLY (SELECT COUNT(*) cnt FROM Table_Many t2 WHERE t2.fk = t1.pk) a
where a.cnt < 3
in some particular cases, based on your indices and db structure, this query may run 4 times faster than the GROUP BY method
you have posted this question in sql server, I have a answer in oracle database system (don't know whether it will run in sql server as well or not)
this is as follow-
select [desired column list] from
(select b.*, count(*) over (partition by b.foreignkey) c_1
from b
where b.foreignkey in (select a.id from a) )
where c_1 < 3 ;
i hope it should work on sql server as well...
if not please let me update ..

SQL style question: INNER JOIN in FROM clause or WHERE clause?

If you are going to join multiple tables in a SQL query, where do you think is a better place to put the join statement: in the FROM clause or the WHERE clause?
If you are going to do it in the FROM clause, how do you format it so that it is clear and readable? (I'm talking about indents, newlines, whitespace in general.)
Are there any advantages/disadvantages to each?
I tend to use the FROM clause, or rather the JOIN clause itself, indenting like this (and using aliases):
SELECT t1.field1, t2.field2, t3.field3
FROM table1 t1
INNER JOIN table2 t2
ON t1.id1 = t2.id1
INNER JOIN table3 t3
ON t1.id1 = t3.id3
This keeps the join condition close to where the join is made. I find it easier to understand this way then trying to look through the WHERE clause to figure out what exactly is joined how.
When making OUTER JOINs (ANSI-89 or ANSI-92), filtration location matters because criteria specified in the ON clause is applied before the JOIN is made. Criteria against an OUTER JOINed table provided in the WHERE clause is applied after the JOIN is made. This can produce very different result sets.
In comparison, it doesn't matter for INNER JOINs if the criteria is provided in the ON or WHERE clauses -- the result will be the same. That said, I strive to keep the WHERE clause clean -- anything related to JOINed tables will be in their respective ON clause. Saves hunting through the WHERE clause, which is why ANSI-92 syntax is more readable.
I prefer the FROM clause if for no other reason that it distinguishes between filtering results (from a Cartesian product) merely between foreign key relationships and between a logical restriction. For example:
SELECT * FROM Products P JOIN ProductPricing PP ON P.Id = PP.ProductId
WHERE PP.Price > 10
As opposed to
SELECT * FROM Products P, ProductPricing PP
WHERE P.Id = PP.ProductID AND Price > 10
I can look at the first one and instantly know that the only logical restriction I'm placing is the price, as opposed to the implicit machinery of joining tables together on the relationship key.
I almost always use the ANSI 92 joins because it makes it clear that these conditions are for JOINING.
Typically I write it this way
FROM
foo f
INNER JOIN bar b
ON f.id = b.id
sometimes I write it this way when it trivial
FROM
foo f
INNER JOIN bar b ON f.id = b.id
INNER JOIN baz b2 ON b.id = b2.id
When its not trivial I do the first way
e.g.
FROM
foo f
INNER JOIN bar b
ON f.id = b.id
and b.type = 1
or
FROM
foo f
INNER JOIN (
SELECT max(date) date, id
FROM foo
GROUP BY
id) lastF
ON f.id = lastF.id
and f.date = lastF.Date
Or really the weird (not sure if I got the parens correctly but its supposed to be an LEFT join to table bar but bar needs an inner join to baz)
FROM
foo f
LEFT JOIN (bar b
INNER JOIN baz b2
ON b.id = b2.id
)ON f.id = b.id
You should put joins in Join clauses which means the From clause. A different question could be had about where to put filtering statements.
With respect to indenting, there are many styles. My preference is to indent related joins and keep main clauses like Select, From, Where, Group By, Having and Order By indented at the same level. In addition, I put each of these main attributes and the first line of an On clause on its own line.
Select ..
From Table1
Join Table2
On Table2.FK = Table1.PK
And Table2.OtherCol = '12345'
And Table2.OtherCol2 = 9876
Left Join (Table3
Join Table4
On Table4.FK = Table3.PK)
On Table3.FK = Table2.PK
Where ...
Group By ...
Having ...
Order By ...
Use the FROM clause to be compliant with ANSI-92 standards.
This:
select *
from a
inner join b
on a.id = b.id
where a.SomeColumn = 'x'
Not this:
select *
from a, b
where a.id = b.id
and a.SomeColumn = 'x'
I definitely always do my JOINS (of whatever type) in my FROM clause.
The way I indent them is this:
SELECT fields
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.t1_id
INNER JOIN table3 t3 ON t1.id = t3.t1_id
AND
t2.id = t3.t2_id
In fact, I'll generally go a step farther and move as much of my constraining logic from the WHERE clause to the FROM clause, because this (at least in MS SQL) front-loads the constraint, meaning that it reduces the size of the recordset sooner in the query construction (I've seen documentation that contradicts this, but my execution plans are invariably more efficient when I do it this way).
For example, if I wanted to only select things in the above query where t3.id = 3, you could but that in the WHERE clause, or you could do it this way:
SELECT fields
FROM table1 t1
INNER JOIN table2 t2 ON t1.id = t2.t1_id
INNER JOIN table3 t3 ON t1.id = t3.t1_id
AND
t2.id = t3.t2_id
AND
t3.id = 3
I personally find queries laid out in this way to be very readable and maintainable, but this is certainly a matter of personal preference, so YMMV.
Regardless, I hope this helps.
ANSI joins. I omit any optional keywords from the SQL as they only add noise to the equation. There's no such thing as a left inner join, is there? And by default, a simple join is an inner join, so there's no particular point to saying 'inner join'.
Then I column align things as much as possible.
The point being that a large complex SQL query can be very difficult to comprehend, so the more order that is imposed on it to make it more readable, the better. Any body looking at the query to fix, modify or tune it, needs to be able to answer a few things off right off the bat:
what tables/views are involved in the query?
what are the criteria for each join? What's the cardinality of each join?
what/how many columns are returned by the query
I like to write my queries so they look something like this:
select PatientID = rpt.ipatientid ,
EventDate = d.dEvent ,
Side = d.cSide ,
OutsideHistoryDate = convert(nchar, d.devent,112) ,
Outcome = p.cOvrClass ,
ProcedureType = cat.ctype ,
ProcedureCategoryMajor = cat.cmajor ,
ProcedureCategoryMinor = cat.cminor
from dbo.procrpt rpt
join dbo.procd d on d.iprocrptid = rpt.iprocrptid
join dbo.proclu lu on lu.iprocluid = d.iprocluid
join dbo.pathlgy p on p.iProcID = d.iprocid
left join dbo.proccat cat on cat.iproccatid = lu.iproccatid
where procrpt.ipatientid = #iPatientID