Left excluding join with BigQuery - google-bigquery

I have two tables (A and B) having identical structures. Table B is basically a subset of Table A. I want to retrieve all the records from Table A that are not present in Table B.
For this, I am considering Left Excluding Join (reference). Here is the query I am executing:
select a.id, a.category from a
left join b
on a.id = b.id
where b.id is null;
As per BigQuery's estimate, the query will process 44.9 GiB. However, the query is taking unusually longer than expected to complete. Am I missing out on any important bit?

Related

SQL Inner Join w/ Unique Vals

Questions similar to this one about using DISTINCT values in an INNER JOIN have been asked a few times, but I don't see my (simple) use case.
Problem Description:
I have two tables Table A and Table B. They can be joined via a variable ID. Each ID may appear on multiple rows in both Table A and Table B.
I would like to INNER JOIN Table A and Table B on the distinct values of ID which appear in Table B and select all rows of Table A with a Table A.ID which appears matching some condition in Table B.
What I want:
I want to make sure I get only one copy of each row of Table A with a Table A.ID matching a Table B.ID which satisfies [some condition].
What I would like to do:
SELECT * FROM TABLE A
INNER JOIN (
SELECT DISTINCT ID FROM TABLE B WHERE [some condition]
) ON TABLE A.ID=TABLE B.ID
Additionally:
As a further (really dumb) constraint, I can't say anything about the SQL standard in use, since I'm executing the SQL query through Stata's odbc load command on a database I have no information about beyond the variable names and the fact that "it does accept SQL queries," ( <- this is the extent of the information I have).
If you want all rows in a that match an id in b, then use exists:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Trying to use join just complicates matters, because it both filters and generates duplicates.

difference in with/without "left join" and matching in "where" or "on"?

Is there any performance difference between two different SQL-codes as below? The first one is without left jon and matching with where, the other is with left join and matching with on.
Because I get exactly the same result/output from those sql's, but I will be working with bigger tables soon (like couple of billions rows), so I don't want to have any performance issues. Thanks in advance ...
select a.customer_id
from table a, table b
where a.customer_id = b.customer_id
select a.customer_id
from table a
left join table b
on a.customer_id = b.customer_id
The two do different things and yes, there is a performance impact.
Your first example is a cross join with a filter which reduces it to an inner join (virtually all planners are smart enough to reduce this to an inner join but it is semantically a cross join and filter).
Your second is a left join which means that where the filter is not met, you will still get all records from table a.
This means that the planner has to assume all records from table a are relevant, and that correlating records from table b are relevant in your second example, but in your first example it knows that only correlated records are relevant (and therefore has more freedom in planning).
In a very small data set you will see no difference but you may get different results. In a large data set, your left join can never perform better than your inner join and may perform worse.

Use of more then one join and left join

If I got more then one join and in the second join I use left join.
by using this clause its going to take all the data from the two first tables or only from the second table.
Thanks
Join is just a method to connect different tables. Theoretically (not computationally), there's no limit on the number of joins you used on a query.
Keep in mind that the order of joins are important once you started to use something other than inner joins. For example, a LEFT JOIN b is not equivalent to b LEFT JOIN a.
With that being said, when you have more than one join, the result should be interpreted carefully.
Consider
SELECT a.id,b.name,c.department
FROM
a INNER JOIN b on a.id = b.id
LEFT JOIN c on a.id = c.id
The resulting table would consist of all id that is present in both a and b, and return NULL if a department is not present for those id.
So to answer your question, joins consider all your data in the query. However, the output table depends on the joins you used. If there are still confusion, you can refer to this question which addressed a similar thing.

When using multiple joins in SQL, is it faster to join everything to table A , or join Table A to Table B, Table B to Table C, etc? [duplicate]

This question already has answers here:
Does the join order matter in SQL?
(4 answers)
Closed 7 years ago.
To help clarify, here's some code:
Method 1
SELECT * FROM tableA a
JOIN tableB b ON a.id=b.id
JOIN tableC c ON a.id=c.id
JOIN tableD d ON a.id=d.id
Method 2
SELECT * FROM tableA a
JOIN tableB b ON a.id=b.id
JOIN tableC c ON b.id=c.id
JOIN tableD d ON c.id=d.id
THERE IS NO DIFFERENCE.
Keep in mind, that databases are based on mathematical set theory. And in terms of set theory, these joins are equal.
Therefore, if you look at the actual query execution plan, you will see that SQL server is even reorganizing joins, and it might rearrange them in a completely other way.
For example, if a table contains only 10 records, then this table is often taken for the first join, because by cutting away only 5 records, you can already cut down 50% of the whole result set.
The database is maintaining some statistics about number of records and distribution of the content. With these statistics, the query engine can make a very good "guess" which order would be the fastest one.
You can influence this behaviour by using query hints, but this is only useful in very rare situations.

sqlite query does not complete - bad index, or just too much data?

I have two tables in SQLITE: from_original and from_binary. I want to LEFT OUTER JOIN to pull the records in from_original that are not present in from_binary. The problem is that the query I have written does not complete (I terminated it after about 1 hour). Can anyone help me to understand why?
I have an index defined for each of the fields in my query, but explain query plan only mentions one of the indices will be referenced. I am not sure if this is the problem, or if it is just a matter of too much data.
Here is the query I am trying to run:
select * from from_original o
left outer join from_binary b
on o.id = b.id
and o.timestamp = b.timestamp
where b.id is null
The tables each have about 4 million records:
I have an index defined on all the id and timestamp fields (see schema at end of post), but explain query plan only indicates that one of the id indices will be used.
Here is the table schema, including index definitions:
This is your query:
select o.*
from from_original o left outer join
from_binary b
on o.id = b.id and o.timestamp = b.timestamp
where b.id is null;
(Note: you do not need the columns from b because they should all be NULL.)
The best index for this query is a composite index on from_binary(id, timestamp). You do have a lot of data for SQLite, but this might finish in a finite amount of time.