How to implement MINUS operator in Google Big Query - sql

I am trying to implement MINUS operation in Google Big Query but looks like there is no documentation in Query Reference. Can somebody share your thoughts on this. I have done it in regular SQL in the past but not sure if Google is offering it in Big Query. Your inputs are appreciated. Thank you.

Just adding an update here since this post still comes up in Google Search. BigQuery now supports the EXCEPT set operator.
https://cloud.google.com/bigquery/docs/reference/standard-sql/query-syntax#except
select * from t1
EXCEPT DISTINCT
select * from t2;

If BigQuery does not offer minus or except, you can do the same thing with not exists:
select t1.*
from table1 t1
where not exists (select 1
from table2 t2
where t2.col1 = t1.col1 and t2.col2 = t1.col2 . . .
);
This works correctly for non-NULL values. For NULL values, you need a bit more effort. And, this can also be written as a left join:
select t1.*
from table1 t1 left join
table2 t2
on t2.col1 = t1.col1 and t2.col2 = t1.col2
where t2.col1 is null;
One of these should be acceptable to bigquery.

What I usually do is similar to Linoff's answer and works always, independently of NULL fileds:
SELECT t1.*
FROM table1 t1 LEFT JOIN
(SELECT 1 AS aux, * FROM table2 t2)
ON t2.col1 = t1.col1 and t2.col2 = t1.col2
WHERE t2.aux IS NULL;
This solves the problems with nullable fields.
Notice: even though this is an old thread, I'm commenting just for sake of completeness if somebody gets to this page in the future.

Related

Convert OUTER APPLY to LEFT JOIN

We have query which is slow in production(for some internal reason),
SELECT T2.Date
FROM Table1 T1
OUTER APPLY
(
SELECT TOP 1 T2.[DATE]
FROM Table2 T2
WHERE T1.Col1 = T2.Col1
AND T1.Col2 = T2.Col2
ORDER BY T2.[Date] DESC
) T2
But when I convert to LEFT JOIN it become fast,
SELECT Max(T2.[Date])
FROM Table1 T1
LEFT JOIN Table2 T2
ON T1.Col1 = T2.Col1
AND T1.Col2 = T2.Col2
GROUP BY T1.Col1, T1.Col2
Can we say that both queries are equal? If not then how to convert it properly.
The queries are not exactly the same. It is important to understand the differences.
If t1.col1/t1.col2 are duplicated, then the first query returns a separate row for each duplication. The second combines them into a single row.
If either t1.col1 or t1.col2 are NULL, then the first query will return NULL for the maximum date. The second will return a row and the appropriate maximum.
That said, the two queries should have similar performance, particularly if you have an index on table2(col1, col2, date). I should note that under some circumstances the apply method is faster than joins, so relative performance depends on circumstances.

SQL command usage of in / or

I have an sql command similar to below one.
select * from table1
where table1.col1 in (select columnA from table2 where table2.keyColumn=3)
or table1.col2 in (select columnA from table2 where table2.keyColumn=3)
Its performance is really bad so how can I change this command? (pls note that the two sql commands in the paranthesis are exactly same.)
Try
select distinct t1.* from table1 t1
inner join table2 t2 ON t1.col1 =t2.columnA OR t1.col2 = t2.columnA
This is your query:
select *
from table1
where table1.col1 in (select columnA from table2 and t2.keyColumn = 3) or
table1.col2 in (select columnA from table2 and t2.keyColumn = 3);
Probably the best approach is to build an index on table2(keyColumn, columnA).
It is also possible that in has poor performance characteristics. So, you can try rewriting this as an exists query:
select *
from table1 t1
where exists (select 1 from table2 t2 where t2.columnA = t1.col1 and t2.keyColumn = 3) or
exists (select 1 from table2 t2 where t2.columnA = t2.col1 and t2.keyColumn = 3);
In this case, the appropriate index is table2(columnA, keyColumn).
Assuming you're doing this in VFP, use SYS(3054) to see how the query is being optimized and what part is not.
Are the main query and subqueries fully Rushmore-optimisable?
Since the subqueries do not appear to be correlated (i.e. they don't refer to table1 then as long as everything is fully supported by indexes you should be fine.

comparing two tables to make sure they are same row by row and column by column on SQl server

I am comparing two tables to make sure they are same row by row and column by column on SQl server.
SELECT *
FROM t1, t2
WHERE t1.column1 = t2.column1 AND t1.column2 = t2.column2
AND t1.column3 = t2.column3 AND t1.column4 != t2.column4
The tables are vey large, more than 100 million.
I got error:
ERROR [HY000] ERROR: 9434 : Not enough memory for merge-style join
Are there better ways to do this comparison.
thanks !
A much efficient way of checking the row by row difference will be using Exists operator.
Something like this....
SELECT *
FROM t1
WHERE NOT EXISTS (SELECT 1
FROM t2
WHERE t1.column1 = t2.column1
AND t1.column2 = t2.column2
AND t1.column3 = t2.column3
AND t1.column4 = t2.column4
)
You could try EXCEPT http://technet.microsoft.com/en-us/library/ms188055(v=sql.100).aspx
SELECT column1, column2, column3, column4 FROM t1
EXCEPT
SELECT column1, column2, column3, column4 FROM t2
What if you try an INNER JOIN (and not select all the data from both tables)?
SELECT t1.column4, t2.column4
FROM t1 INNER JOIN t2 ON t1.column1 = t2.column1 AND t1.column2 = t2.column2
AND t1.column3 = t2.column3
WHERE t1.column4 != t2.column4
Do you want to identify all the rows that are different or just identify IF there are any rows that are different?
Here's how I would do this: first, I assume you have primary keys on both tables. When you join those tables, the best way to join is using primary key fields, not all of them:
select t1.*, t2.*
from t1 join t2 on t1.id = t2.id
then you can compare those tables field-by-field without overloading sql:
select t1.*, t2.*
from t1 outer join t2 on t1.id = t2.id
where t1.field1 <> t2.field1 ot t1.field2 <> t2.field2 .....
the resulting records would be mismatches.
the code I wrote here is conceptual, I personally didn't run it on sql, so you might need to adjust
All of the above are good suggestions (My first try would be SELECT * FROM t1 EXCEPT SELECT * FROM t2), but you indicate they all give the same out of memory error. Therefore I must conclude your tables are simply too large to perform the operation you desire all in one go. You'll have to run the query in stages, using a technique like one of the ones from "Equivalent of LIMIT and OFFSET for SQL Server?" I'd start with something like this (SQL Fiddle):
DECLARE #offset INT = 0
SELECT TOP 50000000 *
FROM (
SELECT *,
ROW_NUMBER() over (order by column1) AS r_n_n
FROM t1
) xx
WHERE r_n_n >= #offset
EXCEPT
SELECT TOP 50000000 *
FROM (
SELECT *,
ROW_NUMBER() over (order by column1) AS r_n_n
FROM t2
) xx
WHERE r_n_n >= #offset
Then you can increment #offset by the amount of TOP n and do it again. This will likely involve some trial and error to find the limit for the TOP n clause that will run to completion rather than throw an error. I'd start with half, then try quarters, eighths, etc. as necessary.

Optimal query writing

I have 3 tables t1,t2,t3 each having 35K records.
select t1.col1,t2.col2,t3.col3
from table1 t1,table2 t2,table3 t3
where t1.col1 = t2.col1
and t1.col1 = 100
and t3.col3 = t2.col3
and t3.col4 = 101
and t1.col2 = 102;
It takes more time to return the result (15 secs). I have proper indexes.
What is the optimal way of rewriting it?
It's probably best to run your query with Explain Extended placed in front of it. That will give you a good idea of what indexes it is or isn't using. Include the output in your question if you need help parsing the results.
If you have an index based on t1.Col1 or t1.Col2, use THAT as the first part of your WHERE clause. Then, by using the "STRAIGHT_JOIN" clause, it tells MySQL to do exactly as I've listed here. Yes, this is older ANSI querying syntax which is still completely valid (as you originally had too), but should come out quickly with a response. The first two of the where clause will immediately restrict the dataset while the rest actually completes the joins to the other tables...
select STRAIGHT_JOIN
t1.Col1,
t2.Col2,
t3.Col3
from
table1 t1,
table2 t2,
table3 t3
where
t1.Col1 = 100
and t1.Col2 = 102
and t1.col1 = t2.col1
and t2.col3 = t3.col3
and t3.Col4 = 101

In SQL, we can use "Union" to merge two tables. What are different ways to do "Intersection"?

In SQL, there is an operator to "Union" two tables. In an interview, I was told that, say one table has just 1 field with 1, 2, 7, 8 in it, and another table also has just 1 field with 2, and 7 in it, how do I get the intersection. I was stunned at first, because I never saw it that way.
Later on, I found that it is actually a "Join" (inner join), which is just
select * from t1, t2 where t1.number = t2.number
(although the name "join" feels more like "union" rather than "intersect")
another solution seems to be
select * from t1 INTERSECT select * from t2
but it is not supported in MySQL. Are there different ways to get the intersection besides these two methods?
This page explains how to implement INTERSECT and MINUS in MySQL. To implement INTERSECT you should use an inner join:
SELECT t1.number
FROM t1
INNER JOIN t2
ON t1.number = t2.number
Your code does this too, but it is not recommended to write joins like that any more.
An intersect is just an inner join. So
select * from t1 INTERSECT select * from t2
can be rewritten for MySQL like
select *
from t1
inner join t2
on t1.col1 = t2.col1
and t1.col2 = t2.col2
and t1.col3 = t2.col3
...