How to compute loop join in Dataiku - google-bigquery

With Dataiku, I am trying to compute multiple joins across the same table in Bigquery. For example, my query would be (in a simple language) :
For i = 1 to 24 :
CREATE TABLE table0 as
SELECT
A.*,
B.column as column_i
FROM
table0 AS A
LEFT JOIN table_i AS B
ON A.id=B.id
How can I do this in a simple way ? I tried with a SQL script or notebook but it seems that Dataiku doesnt support the statement DECLARE for my variable i.

Related

Left excluding join with BigQuery

I have two tables (A and B) having identical structures. Table B is basically a subset of Table A. I want to retrieve all the records from Table A that are not present in Table B.
For this, I am considering Left Excluding Join (reference). Here is the query I am executing:
select a.id, a.category from a
left join b
on a.id = b.id
where b.id is null;
As per BigQuery's estimate, the query will process 44.9 GiB. However, the query is taking unusually longer than expected to complete. Am I missing out on any important bit?

SQL Inner Join w/ Unique Vals

Questions similar to this one about using DISTINCT values in an INNER JOIN have been asked a few times, but I don't see my (simple) use case.
Problem Description:
I have two tables Table A and Table B. They can be joined via a variable ID. Each ID may appear on multiple rows in both Table A and Table B.
I would like to INNER JOIN Table A and Table B on the distinct values of ID which appear in Table B and select all rows of Table A with a Table A.ID which appears matching some condition in Table B.
What I want:
I want to make sure I get only one copy of each row of Table A with a Table A.ID matching a Table B.ID which satisfies [some condition].
What I would like to do:
SELECT * FROM TABLE A
INNER JOIN (
SELECT DISTINCT ID FROM TABLE B WHERE [some condition]
) ON TABLE A.ID=TABLE B.ID
Additionally:
As a further (really dumb) constraint, I can't say anything about the SQL standard in use, since I'm executing the SQL query through Stata's odbc load command on a database I have no information about beyond the variable names and the fact that "it does accept SQL queries," ( <- this is the extent of the information I have).
If you want all rows in a that match an id in b, then use exists:
select a.*
from a
where exists (select 1 from b where b.id = a.id);
Trying to use join just complicates matters, because it both filters and generates duplicates.

Querying a Partitioned table in BigQuery using a reference from a joined table

I would like to run a query that partitions table A using a value from table B.
For example:
#standard SQL
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where B.date = '2018-01-01'
This query will scan all the partitions in table A and will not take into consideration the date I specified in the where clause (for partitioning purposes). I have tried running this query in several different ways but all produced the same result - scanning all partitions in table A.
Is there any way around it?
Thanks in advance.
With BigQuery scripting (Beta now), there is a way to prune the partitions.
Basically, a scripting variable is defined to capture the dynamic part of a subquery. Then in subsequent query, scripting variable is used as a filter to prune the partitions to be scanned.
DECLARE date_filter ARRAY<DATETIME>
DEFAULT (SELECT ARRAY_AGG(date) FROM B WHERE ...);
select A.user_id
from my_project.xxx A
inner join my_project.yyy B
on A._partitiontime = timestamp(B.date)
where A._partitiontime IN UNNEST(date_filter)
The doc says this about your use case:
Express the predicate filter as closely as possible to the table
identifier. Complex queries that require the evaluation of multiple
stages of a query in order to resolve the predicate (such as inner
queries or subqueries) will not prune partitions from the query.
The following query does not prune partitions (note the use of a subquery):
#standardSQL
SELECT
t1.name,
t2.category
FROM
table1 t1
INNER JOIN
table2 t2
ON
t1.id_field = t2.field2
WHERE
t1.ts = (SELECT timestamp from table3 where key = 2)

Calculated Field from Linked Table

I'm using MS Access 2013. My database has four relevant tables:
Table A) Numerical_Key_A, {Other Data Items}
Table B) Numerical_Key_B, "Adjustment Value", {Other Data Items}
Table C) Numerical_Key_C, {Other Data Items}
Table D) {Other Data Items}, Link-to-B, Link-to-A, Link-to-C
The way it works, however, is that the Link-to-C going into a record of Table D is always exactly "Adjustment Value" away from Numerical_Key_A. As such, I would like to make the Link-to-C be automatically calculated when I enter Link-to-A and Link-to-B.
As far as I can tell, this would require Table D having a Calculated Field that gets its data from the linked Table B, which Access does not allow. Is there another way to do this? I'd prefer not to use VBA if possible, but if it's the only way, I'll just have to learn how to do it. (I know VBA for Excel, but have never used it in Access before).
If i understand yo well, all you need to do is to create join between tables, for example:
SELECT D.Field1, D.Field2, B.[Adjustment Value]
FROM Table_D AS D INNER JOIN Table_B AS B ON D.Link-To-B = B.Numerical_Key_B
For further information, please, see:
Join tables and queries
How to: Perform Joins Using Access SQL
[EDIT]
In table D, there is a field, which I have labeled in the above snippit as "Link-to-C". This field should be populated with the value of Link-to-A plus TableB.AdjustmentValue
SELECT D.Field1, D.Link-To-C + D.Link-To-A + B.[Adjustment Value] AS CalculatedField
FROM (Table_D AS D INNER JOIN Table_B AS B ON D.Link-To-B = B.Numerical_Key_B)
INNER JOIN Table_C AS C ON C.Numerical_Key_C = D.Link-To-C)
INNER JOIN Table_A AS A ON A.Numerical_Key_A = D.Link-To-A
Note: i have no chance to test it. It's just an idea.
Finally i need to warn you: Access likes () /brackets/ while multiple joins are used.
[EDIT2]
Based on discussion, below query should return calculated field:
SELECT C.MetaLevel, R.LevelAdjustment, C.MetaLevel+ R.LevelAdjustment AS NominalLevel
FROM tblRank AS R INNER JOIN tblCreatures AS C ON R.ID = C.Rank;
If you would like to update NominalLevel in a tblCreatures, use query like this:
UPDATE tblCreatures AS C
INNER JOIN tblRank AS R ON R.ID = C.Rank
SET C.NominalLevel = (C.MetaLevel+ R.LevelAdjustment);
NominalLevel has been updated. You can select it also...
SELECT C.MetaLevel, R.LevelAdjustment, C.NominalLevel
FROM tblRank AS R INNER JOIN tblCreatures AS C ON R.ID = C.Rank;
Cheers,
Maciej

SQL JOIN that uses OR in the ON statement

I’m running a SQL query on Google BigQuery and want to do this kind of SQL command:
SELECT ... FROM A JOIN B
ON A.col1=B.col1 AND (A.col2=B.col2 OR A.col3=B.col3)
This fails though with the error:
Error: ON clause must be AND of = comparisons of one field name from each table, with all field names prefixed with table name.
Is there a way to rewrite the SQL to get this kind of functionality?
Turns out this works, which is equivalent to a UNION ALL statement in Google BigQuery. Not sure how to do it if you just want a UNION, since DISTINCT is actually not supported in BigQuery. Luckily it's enough for me as is.
SELECT ... FROM
(SELECT ... FROM A JOIN B ON A.col1=B.col1 AND A.col2=B.col2),
(SELECT ... FROM A JOIN B ON A.col1=B.col1 AND A.col3=B.col3)
This should work:
SELECT ... FROM A CROSS JOIN B
WHERE A.col1=B.col1 AND (A.col2=B.col2 OR A.col3=B.col3)