I have the following query which has an inner query with an inequality condition that depends on the outer query. It looks like HIVE does not support refering to the outer query from the the inner query by an inequality condition. How can I write this query in HIVE?
SELECT
*
FROM
A
WHERE NOT EXISTS
(
SELECT *
FROM
B
WHERE
B.cust_id = A.cust_id
AND datediff(A.year_month, B.year_month) < 365 * 3
)
The feature you are using is not supported in Hive. But it's possible to rewrite query:
SELECT *
FROM A
LEFT JOIN B ON B.cust_id = A.cust_id
WHERE (datediff(A.year_month, B.year_month) >= 365 * 3) --note >= here
OR B.cust_id is NULL --is not joined
Will Hive accept this query?
SELECT ab.*
FROM (SELECT a.*, b.min_year_month
FROM a JOIN
(SELECT b.cust_id, MAX(b.year_month) as min_year_month
FROM b
GROUP BY b.cust_id
) b
ON a.cust_id = b.cust_id
) ab
WHERE datediff(A.year_month, B.min_year_month) < 365 * 3;
I think that gets the logic right. This returns all records from A where all records in B are in the past three years.
The key idea is to use aggregation and a subquery to get the data necessary.
If you really want the records in B as well for the records, then use another JOIN to get them.
Related
I am trying to have a full outer join between two tables Table1 and Table2 on ID with a query like the following in Teradata. The problem is it acts like inner join.
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
WHERE a.country in ('US','FR')
AND a.create_date = '2021-01-01'
AND b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01';
What I want is something like this:
SELECT * FROM
(
SELECT * FROM Table1 as a
WHERE a.country in ('US','FR')
AND a.create_date = '2021-01-01'
) as ax
FULL OUTER JOIN
(
SELECT * FROM Table2 as b
WHERE b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01'
) as bx
ON ax.ID=bx.ID;
I feel like the second query is not best practice, maybe inefficient and/or hard to read in complicated cases. How can I modify the first query to get the desired output?
I know that this is a fundamental problem and probably there are many other ways to do it (e.g. with USING, HAVING etc) but could not find a basic explanation. Would appreciate a comprehensive answer on alternative solutions as a guide for future reference.
EDIT
The difference in my question to Left Join With Where Clause is that I require a condition in both tables. I cannot figure out where to put the second WHERE condition.
The short answer: Both sets of predicates belong in the ON clause.
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
AND a.country in ('US','FR')
AND a.create_date = '2021-01-01'
AND b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01';
The ON clause both limits the rows that are eligible to participate in the join (pre-join filtering) and specifies how to match rows (join criteria). The WHERE clause filters results (after the join).
A generally less-desirable alternative would be to modify the predicates so as not to filter out the non-matching rows, e.g. assuming ID is NOT NULL in both tables
SELECT *
FROM Table1 AS a
FULL OUTER JOIN Table2 AS b
ON a.ID = b.ID
WHERE (a.country in ('US','FR')
AND a.create_date = '2021-01-01'
OR a.ID IS NULL)
AND (b.country IN ('US','DE','BE')
AND b.create_date = '2021-01-01'
OR b.ID IS NULL);
Logically the ON and WHERE work the same way for INNER JOIN but in that case the net result is the same (and many databases including Teradata will generate the same query plan for INNER JOIN regardless of where you put the filter predicates).
There are two table A and B. Table A has one to many relationship with B.
I want to fetch records from A and corresponding one single record from B (if B has one record),
If there is multiple record in Table B then find the one having status ='Active' find first.
Below is the query, running in oracle, but we want the same functionality running in AWS Athena, however correlated query is not supported in AWS athena sql. Athena supports ANSI Sql.
SELECT b.*
FROM A a ,B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)=1
)
UNION
SELECT * FROM
(SELECT b.*
FROM A a , B b
WHERE a.instruction_id = b.txn_report_instruction_id AND b.txn_report_instruction_id IN
(SELECT b2.txn_report_instruction_id
FROM B b2
WHERE b2.txn_report_instruction_id=b.txn_report_instruction_id
AND b2.status ='ACTIVE'
GROUP BY b2.txn_report_instruction_id
HAVING COUNT(b2.txn_report_instruction_id)> 1
)
)
We need to put all the field in select or in aggregate function when using group by so group by not preferable.
A help would be much appreciated.
[]
2
Output result table
Joining the best row can be achieved with a lateral join.
select *
from a
outer apply
(
select *
from b
where b.txn_report_instruction_id = a.instruction_id
order by case when b.status = 'ACTIVE' then 1 else 2 end
fetch first row only
) bb;
Another option is a window function:
select *
from a
left join
(
select
b.*,
row_number() over (partition by txn_report_instruction_id
order by case when status = 'ACTIVE' then 1 else 2 end) as rn
from b
) bb on bb.txn_report_instruction_id = a.instruction_id and bb.rn = 1;
I don't know about amazon athena's SQL coverage. This is all standard SQL, however, except for OUTER APPLY I think. If I am not mistaken, the SQL standard requires LEFT OUTER JOIN LATERAL (...) ON ... instead, for which you need a dummy ON clause, such as ON 1 = 1. So if above queries fail, there is another option for you :-)
JOIN
SELECT *
FROM a
INNER JOIN (
SELECT b.id, Count(*) AS Count
FROM b
GROUP BY b.id ) AS b ON b.id = a.id;
LATERAL
SELECT *
FROM a,
LATERAL (
SELECT Count(*) AS Count
FROM b
WHERE a.id = b.id ) AS b;
I understand that here join will be computed once and then merge with the main request vs the request for each FROM.
It seems to me that if join will rotate a few rows to one frame then it will be more efficient but if it will be 1 to 1 then LATERAL - I think right?
If I understand you right you are asking which of the two statements is more efficient.
You can test that yourself using EXPLAIN (ANALYZE), and I guess that the answer depends on the data:
If there are few rows in a, the LATERAL join will probably be more efficient if there is an index on b(id).
If there are many rows in a, the first query will probably be more efficient, because it can use a hash or merge join.
I've got 2 tables in BigQuery that I'd like to join. Table 1 has integers, and table 2 has non-overlapping integer ranges (start, end). I'd like to join table 1 and 2 to give me something like this:
-- table 1
value
1
4
9
10
-- table 2
start, end
0,5
6,9
10,15
-- joined
value,start,end
1,0,5
4,0,5
9,6,9
10,10,15
I thought this query would work:
SELECT *
FROM
[table1] a
INNER JOIN [table2] b
ON a.value BETWEEN b.start AND b.end
But that gives me this error
ON clause must be AND of = comparisons of one field name from each
table, with all field names prefixed with table name
I can get the correct result with this CROSS JOIN query:
SELECT *
FROM
[table1] a
CROSS JOIN [table2] b
WHERE a.value BETWEEN b.start AND b.end
But the docs say this should be avoided if possible:
CROSS JOIN operations do not allow ON clauses. CROSS JOIN can return a
large amount of data and might result in a slow and inefficient query
or in a query that exceeds the maximum allowed per-query resources.
Such queries will fail with an error. When possible, prefer queries
that do not use CROSS JOIN
So, is it possible to do an INNER JOIN with a between, or improve the CROSS JOIN some other way?
This is a limitation for BigQuery Legacy SQL.
You should use BigQuery Standard SQL instead:
#standardSQL
SELECT *
FROM
`table1` a
INNER JOIN `table2` b
ON a.value BETWEEN b.start AND b.end
In standard SQL - you should use back-ticks instead of brackets.
Also keep in mind that end is a reserved keyword, so to make above work you need to enclose it in back-ticks also.
See below (along with dummy data from your question):
#standardSQL
WITH table1 AS (
SELECT value
FROM UNNEST([1, 4, 9, 10]) AS value
),
table2 AS (
SELECT chunk.start, chunk.`end`
FROM UNNEST([STRUCT<start INT64, `end` INT64>(0,5),(6,9),(10,15)]) AS chunk
)
SELECT *
FROM `table1` a
INNER JOIN `table2` b
ON a.value BETWEEN b.start AND b.`end`
-- ORDER BY value
Not quite sure how to ask this, but I have 2 tables that are related in a 1 to many relationship, I need to select all records in the "1" table that have less than three records in the "many' table.
select b.foreignkey,count(b.foreignkey) as bidcount
from b
where b.foreignkey in (select a.id from a) and bidcount< 3
group by b.foreignkey
this doesn't work at all I know but I am at a loss how to do this.
I need to in the end select all the records from the "a" table based on this criteria. Sorry if that is confusing!
Just using your code, not tested:
SELECT
b.foreignkey,
count(b.foreignkey) as bidcount
FROM
b
WHERE
b.foreignkey IN (SELECT a.id FROM a)
GROUP BY
b.foreignkey
HAVING
count(b.foreignkey) < 3
Try this:
SELECT t1.id,COUNT(t2.parentId)
FROM table1 as t1
INNER JOIN table2 as t2
ON t1.id = t2.parentId
GROUP BY t1.id
HAVING COUNT(t2.parentId) < 3
You didn't mention which version of SQL Server you're using - if you're on SQL Server 2005 or newer, you could use this CTE (Common Table Expression):
;WITH ChildRows AS
(
SELECT A.Id, COUNT(b.Id) AS 'BCount'
FROM
dbo.TableA A
INNER JOIN
dbo.TableB B ON B.TableAId = A.Id
)
SELECT A.*, R.BCount
FROM dbo.TableA A
INNER JOIN ChildRows R ON A.Id = R.Id
The inner SELECT lists the Id columns from TableA and the count of the child rows associated with those (using the INNER JOIN to TableB) - and the outer SELECT just builds on top of that result set and shows all fields from table A (and the count from the B table)
if you want to return all fields of your (1) table in one query, I suggest you consider using CROSS APPLY:
SELECT t1.* FROM table_1 t1
CROSS APPLY (SELECT COUNT(*) cnt FROM Table_Many t2 WHERE t2.fk = t1.pk) a
where a.cnt < 3
in some particular cases, based on your indices and db structure, this query may run 4 times faster than the GROUP BY method
you have posted this question in sql server, I have a answer in oracle database system (don't know whether it will run in sql server as well or not)
this is as follow-
select [desired column list] from
(select b.*, count(*) over (partition by b.foreignkey) c_1
from b
where b.foreignkey in (select a.id from a) )
where c_1 < 3 ;
i hope it should work on sql server as well...
if not please let me update ..