Hive - Multiple sub-queries in where clause is failing - hive

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?

Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

Related

How did this old SQL query work without a join in the subquery

Here is the T-SQL. The code has been around for years and it was handed to me to migrate to another SQL server. It apparently works, but I don't know why. The execution plan doesn't show any predicates being used, so how does it know which rows to exclude. If I run the subquery I get 1146 rows with the value 1
SELECT EM.PERSON_ID
FROM EMP_BEN_ELECTS EBE, EMPLOYEE_MAP EM
WHERE EBE.BW_ID = EM.BW_ID
AND CHANGE_BENEFIT_EVENT_DATE IS NULL
AND OPTION_ID <> 'WAIVE'
AND NOT EXISTS (SELECT 1 FROM EMPLOYEE_BILLING WHERE BILLING_GROUPING_ID
IN('HWMONTHLY','HWINDIVIDUALBILLED') AND END_DATE IS NULL)
I plan rewrite it without the subquery and use a left join instead, but this just boggled me that it works. The only time I seen code written like this without the join being qualified was when I seen code coming from an Oracle developer.
The subquery of NOT EXISTS is not used to return any (of the 1146) rows.
It is used to check if at least 1 row exists in the table EMPLOYEE_BILLING with the specified conditions:
BILLING_GROUPING_ID IN('HWMONTHLY','HWINDIVIDUALBILLED') AND END_DATE IS NULL
If there is such a row, then NOT EXISTS returns FALSE and since all the conditions in the WHERE clause of the main query are linked with the operator AND, then the final result is WHERE FALSE, making the query to not return any rows.
Don't rewrite the query with a LEFT join.
EXISTS and NOT EXISTS provide usually better performance than joins.
What you must change though, is that archaic join syntax with the ,.
Change it to a proper INNER join with an ON clause:
SELECT EM.PERSON_ID
FROM EMP_BEN_ELECTS EBE INNER JOIN EMPLOYEE_MAP EM
ON EBE.BW_ID = EM.BW_ID
WHERE CHANGE_BENEFIT_EVENT_DATE IS NULL
AND OPTION_ID <> 'WAIVE'
AND NOT EXISTS (
SELECT 1
FROM EMPLOYEE_BILLING
WHERE BILLING_GROUPING_ID IN('HWMONTHLY','HWINDIVIDUALBILLED')
AND END_DATE IS NULL
)
Also, you should qualify all the column names with the table's name/alias they belong to (CHANGE_BENEFIT_EVENT_DATE and OPTION_ID which I left unqualified because I don't know which alias to use).

where clause conditions in SQL

Below is a join based on where clause:
SELECT a.* FROM TEST_TABLE1 a,TEST_TABLE2 b,TEST_TABLE3 c
WHERE a.COL11 = b.COL11
AND b.COL12 = c.COL12
AND a.COL3 = c.COL13;
I have been learning SQL from online resources and trying to convert it with joins
Two issues:
The original query is confusing. The outer joins (with the (+) suffix) are made irrelevant by the last where condition. Because of that condition, the query should only return records where there is an actual matching c record. So the original query is the same as if there were no such (+) suffixes.
Your query joins TEST_TABLE3 twice, while the first query only joins it once, and there are two conditions that determine how it is joined there. You should not split those conditions over two separate joins.
BTW, it is surprising that the SQL Fiddle site does not show an error, as it makes no sense to use the same alias twice. See for example how MySQL returns the error with the same query on dbfiddle (on all available versions of MySQL):
Not unique table/alias: 'C'
So to get the same result using the standard join notation, all joins should be inner joins:
SELECT *
FROM TEST_TABLE1 A
INNER JOIN TEST_TABLE2 B
ON A.COL11 = B.COL11
INNER JOIN TEST_TABLE3 C
ON A.COL11 = B.COL11
AND B.COL12 = C.COL12;
#tricot correctly pointed out that it's strange to have 2 aliases with the same name and not getting an error. Also, to answer your question :
In the first query, we are firstly performing cross join between all the 3 tables by specifying all the table names. After that, we are filtering the rows using the condition specified in the WHERE clause on output that we got after performing cross join.
In second query, you need to join test_table3 only once. Since now you have all the required aliases A,B,C as in the first query so you can specify 2 conditions after the last join as below:
SELECT A.* FROM TEST_TABLE1 A
LEFT JOIN TEST_TABLE2 B
ON A.COL11 = B.COL11
left join TEST_TABLE3 C
on B.COL12 =C.COL12 AND A.COL3 = C.COL13;

BigQuery : WITH clause behavior in multiple JOIN conditions

For readability, I have defined "org_location_ext" clause in the query as follows.
This "org_location_ext" is first used to join with the main fact-table "LOCATION_SALES".
It is used in other JOIN conditions as well.
According to the BigQuery documentation : https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#with_clause
The WITH clause contains one or more named subqueries which execute
every time a subsequent SELECT statement references them
I want to know the behavior for this case.
Does this query executes the "org_location_ext" WITH clause multiple times ?
Or when the SELECT query gets executed, a temporary table is created for "org_location_ext" and use this temporary table for all the JOINs.
Basically, after the first JOIN with the fact-table , later joins use that "filtered" result for their joins , or do they rerun the WITH clause ?
WITH org_location_ext AS (
SELECT *
FROM ORG_LOC_MASTER AS loc_master
JOIN LOC_REGN1 as regn1 ON loc_master.id = regn1.id
JOIN ...
JOIN ...
)
SELECT
..
org_location_ext.store_class,
org_location_ext.country,
org_location_ext.
..
..
FROM LOCATION_SALES AS sales
JOIN org_location_ext ON org_location_ext.area_id = sales.area_id AND org_location_ext.date = sales.date
JOIN ....
JOIN ....
JOIN COUNTRY_VAT AS vat ON vat.key1 =TBL_Y.key1 AND vat.country_code = org_location_ext.country_code
It depends on the query plan. Consider checking a query plan. You'll see how many times any specific table is accessed.

RedShift SQL subquery with Inner join

I am using AWS Redshift SQL. I want to inner join a sub-query which has group by and inner join inside of it. When I do an outside join; I am getting an error that column does not exist.
Query:
SELECT si.package_weight
FROM "packageproduct" ub "clearpathpin" cp ON ub.cpipr_number = cp.pin_number
INNER JOIN "clearpathpin" cp ON ub.cpipr_number = cp.pin_number
INNER JOIN (
SELECT sf."AWB", SUM(up."weight") AS package_weight
FROM "productweight" up ON up."product_id" = sf."item_id"
GROUP BY sf."AWB"
HAVING sf."AWB" IS NOT NULL
) AS si ON si.item_id = ub.order_item_id
LIMIT 100;
Result:
ERROR: column si.item_id does not exist
It's simply because column si.item_id does not exist
Include item_id in the select statement for the table productweight
and it should work.
There are many things wrong with this query.
For your subquery, you have an ON statement, but it is not joining:
FROM "productweight" up ON up."product_id" = sf."item_id"
When you join the results of this subquery, you are referencing a field that does not exist within the subquery:
SELECT sf."AWB", SUM(up."weight") AS package_weight
...
) AS si ON si.item_id = ub.order_item_id
You should imagine the subquery as creating a new, separate, briefly-existing table. The outer query than joins that temporary table to the rest of the query. So anything not explicitly resulted in the subquery will not be available to the outer query.
I would recommend when developing you write and run the subquery on its own first. Only after it returns the results you expect (no errors, appropriate columns, etc) then you can copy/paste it in as a subquery and start developing the main query.

SQL subquery multiple times error

I am making a subquery but I am getting a strange error
The column 'RealEstateID' was specified multiple times for 'NotSold'.
here is my code
SELECT *
FROM
(SELECT *
FROM RealEstatesInfo AS REI
LEFT JOIN Purchases AS P
ON P.RealEstateID=REI.RealEstateID
WHERE DateBought IS NULL) AS NotSold
INNER JOIN OwnerEstate AS OE
ON OE.RealEstateID=NotSold.RealEstateID
It's on SQL server by the way.
That's because there will be 2 realestiteids in your subquery. You need to change it to explicitly list the columns from both table and only include 1 realestateid. It doesn't matter which as you use it for your join.
If you're very Lazy you can select rei.* and only name the p cols apart from realestateid.
Btw select * is probably never a good idea in sub queries or derived tables or ctes.