How to do an inequality join in Apache Drill? - sql

I am trying to run a query in Drill that requires inequality joins (such as ‘on a.event_time >= b.event_time and a.event_time < b.next_event_time’). I am getting the error that Drill does not support inequality joins, and that is also what I am reading online.
Are there any work arounds to use in drill to get the same results without using an inequality join? All I can think of is expanding one of my tables to include duplicate rows for every iteration of the field I am trying to join on, but I am guessing there is a more straightforward way Drill users get around this.

I guess you are trying
SELECT *
FROM Table1
JOIN Table2
ON Table1.time > Table2.time
Can you try ?
SELECT *
FROM Table1, Table2
WHERE Table1.time > Table2.time

This is hacky but I was able to get it to work by duplicating and bundling the logic of the join in the "WHERE" clause, then adding an OR to the opposite of the join.
So for example if you want to do
SELECT * FROM
ORDERS as Ord
LEFT JOIN Customers as Cus
ON Cus.CustomerID = Ord.CustomerID
AND Cus.CustomerType <> 'Employee'
You can do this:
SELECT * FROM
ORDERS as Ord
LEFT JOIN Customers as Cus
ON Cus.CustomerID = Ord.CustomerID
WHERE ((Cus.CustomerID = Ord.CustomerID
AND Cus.CustomerType <> 'Employee') OR (Cus.CustomerID <> Ord.CustomerID))

Related

Is it true that all joins following a left join in a SQL query must also be left joins? Why or why not?

I remember this rule of thumb from back in college that if you put a left join in a SQL query, then all subsequent joins in that query must also be left joins instead of inner joins, or else you'll get unexpected results. But I don't remember what those results are, so I'm wondering if maybe I'm misremembering something. Anyone able to back me up on this or refute it? Thanks! :)
For instance:
select * from customer
left join ledger on customer.id= ledger.customerid
inner join order on ledger.orderid = order.id -- this inner join might be bad mojo
Not that they have to be. They should be (or perhaps a full join at the end). It is a safer way to write queries and express logic.
Your query is:
select *
from customer c left join
ledger l
on c.id = l.customerid inner join
order o
on l.orderid = o.id
The left join says "keep all customers, even if there is no matching record in ledger. The second says, "I have to have a matching ledger record". So, the inner join converts the first to an inner join.
Because you presumably want all customers, regardless of whether there is a match in the other two tables, you would use a left join:
select *
from customer c left join
ledger l
on c.id = l.customerid left join
order o
on l.orderid = o.id
You remember correctly some parts of it!
The thing is, when you chain join tables like this
select * from customer
left join ledger on customer.id= ledger.customerid
inner join order on ledger.orderid = order.id
The JOIN is executed sequentialy, so when customer left join ledger happens, you are making sure all joined keys from customer return (because it's a left join! and you placed customers to the left).
Next,
The results of the former JOIN are joined with order (using inner join), forcing the "the first join keys" to match (1 to 1) with the keys from order so you will end up only with records that were matched in order table as well
Bad mojo? it really depends on what you are trying to accomplish.
If you want to guarantee all records from customers return, you should keep "left joining" to it.
You can, however, make this a little more intuitive to understand (not necessarily a better way of writing SQL!) by writing:
SELECT * FROM
(
(SELECT * from customer) c
LEFT JOIN
(SELECT * from ledger) l
ON
c.id= l.customerid
) c_and_l
INNER JOIN (OR PERHAPS LEFT JOIN)
(SELECT * FROM order) as o
ON c_and_l.orderid (better use c_and_l.id as you want to refer to customerid from customers table) = o.id
So now you understand that c_and_l is created first, and then joined to order (you can imagine it as 2 tables are joining again)

SQLite GROUP_CONCAT from another table, multiple joins

Having trouble with my sql query. Not an SQL expert by any means.
SELECT
transactions.*,
categories.*,
GROUP_CONCAT(tags.tagName) as concatTags
FROM transactions
INNER JOIN categories
ON transactions.category = categories.categoryId
LEFT JOIN TransactionTagRelation AS ttr
ON transactions.transactionId = ttr.transactionId
LEFT JOIN tags
ON tags.tagId = ttr.tagId;
(There's also a where and group by, but didn't think it was relevant to the question).
I'm trying to get:
transactionId1, ...otherStuff..., "tagId1,tagId2,tagId3"
transactionId2, ...otherStuff..., "tagId1,tagId3"
What I have now seems to merge the tags into one transaction or something. I tried adding a GROUP BY transactionID at the end, but it gives a syntax error for some reason. I have a feeling my joins are incorrect, but I wasn't able to get anything better.
Do something like this:
SELECT t.*, c.*,
(SELECT GROUP_CONCAT(tg.tagName)
FROM TransactionTagRelation ttr JOIN
Tags tg
ON tg.tagId = ttr.tagId
WHERE t.transactionId = ttr.transactionId
) as concatTags
FROM transactions t JOIN
categories c
ON t.category = c.categoryId;
This eliminates the GROUP BY in the outer query and allows you to use t.* and c.* in the SELECT.

Restricting inner query with outer query atttribute

I currently have a large SQL query (not mine) which I need to modify. I have a transaction and valuation table. The transaction has a one-to-many relationship with valuations. The two tables are being joined via a foreign key.
I've been asked to prevent any transactions (along with their subsequent valuations) from being returned if no valuations for a transaction exist past a certain date. The way I thought I would achieve this would be to use an inner query, but I need to make the inner query aware of the outer query and the transaction. So something like:
SELECT * FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V WHERE T.VAL_FK = V.ID
WHERE (SELECT COUNT(*) FROM V WHERE V.DATE > <GIVEN DATE>) > 1
Obviously the above wouldn't work as the inner query is separate and I can't reference the outer query V reference from the inner. How would I go about doing this, or is there a simpler way?
This would just be the case of setting the WHERE V.DATE > in the outer query as I want to prevent any valuation for a given transaction if ANY of them exceed a specified date, not just the ones that do.
Many thanks for any help you can offer.
You may looking for this
SELECT *
FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V1 ON T.VAL_FK = V1.ID
WHERE (SELECT COUNT(*)
FROM VALUATION_TABLE V2
WHERE V2.ID = V1.ID AND V2.DATE > <GIVEN DATE>) > 1
SELECT *
FROM TRANSACTION_TABLE T
INNER JOIN VALUATION_TABLE V1 ON T.VAL_FK = V.ID
WHERE V.ID IN ( SELECT ID
FROM VALUATION_TABLE
WHERE DATE > <GIVEN DATE>
)
If execution time is important, you may want to test the various solutions on your actual data and see which works best in your situation.

SQL Perfomance: Which its better WHERE clause or JOIN condition ON [duplicate]

Is there any difference (performance, best-practice, etc...) between putting a condition in the JOIN clause vs. the WHERE clause?
For example...
-- Condition in JOIN
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND CUS.FirstName = 'John'
-- Condition in WHERE
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE CUS.FirstName = 'John'
Which do you prefer (and perhaps why)?
The relational algebra allows interchangeability of the predicates in the WHERE clause and the INNER JOIN, so even INNER JOIN queries with WHERE clauses can have the predicates rearrranged by the optimizer so that they may already be excluded during the JOIN process.
I recommend you write the queries in the most readable way possible.
Sometimes this includes making the INNER JOIN relatively "incomplete" and putting some of the criteria in the WHERE simply to make the lists of filtering criteria more easily maintainable.
For example, instead of:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
AND c.State = 'NY'
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
AND a.Status = 1
Write:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
WHERE c.State = 'NY'
AND a.Status = 1
But it depends, of course.
For inner joins I have not really noticed a difference (but as with all performance tuning, you need to check against your database under your conditions).
However where you put the condition makes a huge difference if you are using left or right joins. For instance consider these two queries:
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
The first will give you only those records that have an order dated later than May 15, 2009 thus converting the left join to an inner join.
The second will give those records plus any customers with no orders. The results set is very different depending on where you put the condition. (Select * is for example purposes only, of course you should not use this in production code.)
The exception to this is when you want to see only the records in one table but not the other. Then you use the where clause for the condition not the join.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderID is null
Most RDBMS products will optimize both queries identically. In "SQL Performance Tuning" by Peter Gulutzan and Trudy Pelzer, they tested multiple brands of RDBMS and found no performance difference.
I prefer to keep join conditions separate from query restriction conditions.
If you're using OUTER JOIN sometimes it's necessary to put conditions in the join clause.
WHERE will filter after the JOIN has occurred.
Filter on the JOIN to prevent rows from being added during the JOIN process.
I prefer the JOIN to join full tables/Views and then use the WHERE To introduce the predicate of the resulting set.
It feels syntactically cleaner.
I typically see performance increases when filtering on the join. Especially if you can join on indexed columns for both tables. You should be able to cut down on logical reads with most queries doing this too, which is, in a high volume environment, a much better performance indicator than execution time.
I'm always mildly amused when someone shows their SQL benchmarking and they've executed both versions of a sproc 50,000 times at midnight on the dev server and compare the average times.
Agree with 2nd most vote answer that it will make big difference when using LEFT JOIN or RIGHT JOIN. Actually, the two statements below are equivalent. So you can see that AND clause is doing a filter before JOIN while the WHERE clause is doing a filter after JOIN.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN (SELECT * FROM dbo.Orders WHERE OrderDate >'20090515') AS ORD
ON CUS.CustomerID = ORD.CustomerID
Joins are quicker in my opinion when you have a larger table. It really isn't that much of a difference though especially if you are dealing with a rather smaller table. When I first learned about joins, i was told that conditions in joins are just like where clause conditions and that i could use them interchangeably if the where clause was specific about which table to do the condition on.
Putting the condition in the join seems "semantically wrong" to me, as that's not what JOINs are "for". But that's very qualitative.
Additional problem: if you decide to switch from an inner join to, say, a right join, having the condition be inside the JOIN could lead to unexpected results.
It is better to add the condition in the Join. Performance is more important than readability. For large datasets, it matters.

Condition within JOIN or WHERE

Is there any difference (performance, best-practice, etc...) between putting a condition in the JOIN clause vs. the WHERE clause?
For example...
-- Condition in JOIN
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND CUS.FirstName = 'John'
-- Condition in WHERE
SELECT *
FROM dbo.Customers AS CUS
INNER JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE CUS.FirstName = 'John'
Which do you prefer (and perhaps why)?
The relational algebra allows interchangeability of the predicates in the WHERE clause and the INNER JOIN, so even INNER JOIN queries with WHERE clauses can have the predicates rearrranged by the optimizer so that they may already be excluded during the JOIN process.
I recommend you write the queries in the most readable way possible.
Sometimes this includes making the INNER JOIN relatively "incomplete" and putting some of the criteria in the WHERE simply to make the lists of filtering criteria more easily maintainable.
For example, instead of:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
AND c.State = 'NY'
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
AND a.Status = 1
Write:
SELECT *
FROM Customers c
INNER JOIN CustomerAccounts ca
ON ca.CustomerID = c.CustomerID
INNER JOIN Accounts a
ON ca.AccountID = a.AccountID
WHERE c.State = 'NY'
AND a.Status = 1
But it depends, of course.
For inner joins I have not really noticed a difference (but as with all performance tuning, you need to check against your database under your conditions).
However where you put the condition makes a huge difference if you are using left or right joins. For instance consider these two queries:
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
The first will give you only those records that have an order dated later than May 15, 2009 thus converting the left join to an inner join.
The second will give those records plus any customers with no orders. The results set is very different depending on where you put the condition. (Select * is for example purposes only, of course you should not use this in production code.)
The exception to this is when you want to see only the records in one table but not the other. Then you use the where clause for the condition not the join.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
WHERE ORD.OrderID is null
Most RDBMS products will optimize both queries identically. In "SQL Performance Tuning" by Peter Gulutzan and Trudy Pelzer, they tested multiple brands of RDBMS and found no performance difference.
I prefer to keep join conditions separate from query restriction conditions.
If you're using OUTER JOIN sometimes it's necessary to put conditions in the join clause.
WHERE will filter after the JOIN has occurred.
Filter on the JOIN to prevent rows from being added during the JOIN process.
I prefer the JOIN to join full tables/Views and then use the WHERE To introduce the predicate of the resulting set.
It feels syntactically cleaner.
I typically see performance increases when filtering on the join. Especially if you can join on indexed columns for both tables. You should be able to cut down on logical reads with most queries doing this too, which is, in a high volume environment, a much better performance indicator than execution time.
I'm always mildly amused when someone shows their SQL benchmarking and they've executed both versions of a sproc 50,000 times at midnight on the dev server and compare the average times.
Agree with 2nd most vote answer that it will make big difference when using LEFT JOIN or RIGHT JOIN. Actually, the two statements below are equivalent. So you can see that AND clause is doing a filter before JOIN while the WHERE clause is doing a filter after JOIN.
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN dbo.Orders AS ORD
ON CUS.CustomerID = ORD.CustomerID
AND ORD.OrderDate >'20090515'
SELECT *
FROM dbo.Customers AS CUS
LEFT JOIN (SELECT * FROM dbo.Orders WHERE OrderDate >'20090515') AS ORD
ON CUS.CustomerID = ORD.CustomerID
Joins are quicker in my opinion when you have a larger table. It really isn't that much of a difference though especially if you are dealing with a rather smaller table. When I first learned about joins, i was told that conditions in joins are just like where clause conditions and that i could use them interchangeably if the where clause was specific about which table to do the condition on.
Putting the condition in the join seems "semantically wrong" to me, as that's not what JOINs are "for". But that's very qualitative.
Additional problem: if you decide to switch from an inner join to, say, a right join, having the condition be inside the JOIN could lead to unexpected results.
It is better to add the condition in the Join. Performance is more important than readability. For large datasets, it matters.