Joining two datasets with subqueries - google-bigquery

I am attempting to join two large datasets using BigQuery. they have a common field, however the common field has a different name in each dataset.
I want to count number of rows and sum the results of my case logic for both table1 and table2.
I believe that I have errors resulting from subquery (subselect?) and syntax errors. I have tried to apply precedent from similar posts but I still seem to be missing something. Any assistance in getting this sorted is greatly appreciated.
SELECT
table1.field1,
table1.field2,
(
SELECT COUNT (*)
FROM table1) AS table1_total,
sum(case when table1.mutually_exclusive_metric1 = "Y" then 1 else 0 end) AS t1_pass_1,
sum(case when table1.mutually_exclusive_metric1 = "Y" AND table1.mutually_exclusive_metric2 IS null OR table1.mutually_exclusive_metric3 = 'Y' then 1 else 0 end) AS t1_pass_2,
sum(case when table1.mutually_exclusive_metric3 ="Y" AND table1.mutually_exclusive_metric2 ="Y" AND table1.mutually_exclusive_metric3 ="Y" then 1 else 0 end) AS t1_pass_3,
(
SELECT COUNT (*)
FROM table2) AS table2_total,
sum(case when table2.metric1 IS true then 1 else 0 end) AS t2_pass_1,
sum(case when table2.metric2 IS true then 1 else 0 end) AS t2_pass_2,
(
SELECT COUNT (*)
FROM dataset1.table1 JOIN EACH dataset2.table2 ON common_field_table1 = common_field_table2) AS overlap
FROM
dataset1.table1,
dataset2.table2
WHERE
XYZ
Thanks in advance!

Sho. Lets take this one step at a time:
1) Using * is not explicit, and being explicit is good. Additionally, stating explicit selects and * will duplicate selects with autorenames. table1.field will become table1_field. Unless you are just playing around, don't use *.
2) You never joined. A query with a join looks like this (note order of WHERE and GROUP statements, note naming of each):
SELECT
t1.field1 AS field1,
t2.field2 AS field2
FROM dataset1.table1 AS t1
JOIN dataset2.table2 AS t2
ON t1.field1 = t2.field1
WHERE t1.field1 = "some value"
GROUP BY field1, field2
Where t1.f1 = t2.f1 contain corresponding values. You wouldn't repeat those in the select.
3) Use whitespace to make your code easier to read. It helps everyone involved, including you.
4) Your subselects are pretty useless. A subselect is used instead of creating a new table. For example, you would use a subselect to group or filter out data from an existing table. For example:
SELECT
subselect.field1 AS ssf1,
subselect.max_f1 AS ss_max_f1
FROM (
SELECT
t1.field1 AS field1,
MAX(t1.field1) AS max_f1,
FROM dataset1.table1 AS t1
GROUP BY field1
) AS subselect
The subselect is practically a new table that you select from. Treat it logically like it happens first, and you take the results from that and use it in your main select.
5) This was a terrible question. It didn't even look like you tried to figure things out one step at a time.

Related

Using A Count And A Case Statement In One Query

I'm pretty much out of ideas on how to get this to work.I haven't really used SQL in several years so there's a lot I don't remember.
So here is what I would like to happen:
I return the rows where the Code field from table has the value 1208 AND estnumber = 1187216
Run a count on the selection, if 0 run a subquery
If >0 run a different subquery
I didn't get to the subquery part yet because I can't get this to work correctly at all. Right now I just want it to return text.
Here is the latest attempt, I'm actually using db2 but maybe we can ignore that for now and i'll work that part out later because it says the syntax isnt correct, but other validators disagree (if you dont know anything about db2 just use standard sql when giving advice)
SELECT
count(*) AS t
FROM
table
WHERE
(
ESTNUMBER = 1187216
AND CODE = 1208
)
AND CASE WHEN t = 0 THEN 'it is zero' ELSE 'it is not zero' END;
Are you trying to do something like this?
WITH c AS (
SELECT count(*) AS cnt
FROM table
WHERE ESTNUMBER = 1187216 AND CODE = 1208
)
SELECT s1.*
FROM subquery1 s1
WHERE (SELECT cnt FROM c) = 0
UNION ALL
SELECT s2.*
FROM subquery2 s2
WHERE (SELECT cnt FROM c) > 0;
This assumes that the columns returned by the subqueries are compatible (same number, same types).
There are better ways to write this query (notably using EXISTS and NOT EXISTS), but this conforms directly to how you asked the question.
The string value should come up in the select clause and not in the where filter.
SELECT
count(*) AS t,
(CASE WHEN count(*) = 0 THEN 'it is zero' ELSE 'it is not zero' END) display_str
FROM
table
WHERE
(
ESTNUMBER = 1187216
AND CODE = 1208
)
You're thinking like an imperative programmer, not a declarative one. That is, SQL doesn't have sequential execution: it's all or nothing.
So, here's the start, the bit that works:
SELECT count(*) AS t
FROM table
WHERE ESTNUMBER = 1187216 AND CODE = 1208
Now, to check for the value of count(*), you by now know that WHERE isn't going to work. That's because COUNT is an aggregate function. To look at the result of such of function, you use HAVING.
For your CASE to work, you can move it up into the area that can get count(*) results:
SELECT count(*) AS t
(CASE WHEN count(*) = 0 THEN 'it is zero' ELSE 'it is not zero' END) as msg
FROM table
WHERE ESTNUMBER = 1187216 AND CODE = 1208
Note that "t" is an alias you've given the result of count(*). In most SQL implementations, that alias can't be leveraged in the rest of the statement.
Now, for the either or kind of thing, it would be time to reconsider your approach and what you're really after. You'll probably ultimately have both result sets in your statement and choose how the results are served up.
Something like:
select a.id, a.ct, (case when a.ct=0 then b.amt else c.amt end) as amt
from (select id, count(*) as ct from table1) a
left join (select id, sum(amount) as amt from table2) b on a.id=b.id
left join (select id, sum(amount) as amt from table3) c on a.id=c.id
Hope this helps.

Anyway to use IN operator in the SELECT statement? If not, why?

This may come off as a feature request more than anything, but it would be nice if SQL allowed use of the IN operator in a select statement such as the one below. I want to create new_variable intable1based on the ID variable in table2, hence the case statement.
select ID,
case when ID in (select ID
from table2)
then 1
else 0
end as new_variable
from table1
I understand that SQL will give me an error if I run this, but why is that the case? It doesn't seem obvious to me why SQL developers couldn't enable the IN operator to be used outside of the WHERE clause.
Side note: I'm currently using a left join to avoid this issue, so I am not hung up on this.
select ID,
case when ifnull(b.ID, 0) = 0 then 0
else 1
end as variable_name
from table1
left join(select ID from table2) as b
on a.ID = b.ID
SQL definitely supports this:
select ID,
(case when ID in (select ID from table2)
then 1 else 0
end) as new_variable
from table1
Note that there is a comma after id.
This is standard SQL. If your database doesn't support it, it is a feature request (and one that all or almost all databases support).

SQL Query Design for data validation issue

I have a fact table that contains some finance data. There is a column (VERS_NM) that defines wherther the value is "Actual" or "Current Outlook". The value for these two should always be the same but we noticed in some reports it seems incorrect. So I want to write a query to find where the actual value does not match the current outlook.
I cannot wrap my head around a way to do this.
Here is what the table looks like:
So to recap there will be an identical row to row 1 except the VERS_NM column will say "Actual". At least it is supposed to be, I want to find any instances where the Actual and Current Outlook don't match. Any help or ideas is much appreciated. Just a push in the right direction or some kind of plan to tackle this would be great.
Thanks!
You could just self join the data, replacing the fields a, b, c, d with the fields that indicate that the rows are equivilent.
SELECT
*
FROM
yourTable AS actual
INNER JOIN
yourTable AS outlook
ON actual.a = outlook.a
AND actual.b = outlook.b
AND actual.c = outlook.c
AND actual.d = outlook.d
WHERE
actual.VERS_NM = 'Actual'
AND outlook.VERS_NM = 'Current Outlook'
AND actual.FINC_ACCT_METRIC_VAL <> outlook.FINC_ACCT_METRIC_VAL
An alternative method is to use UNION. It's more of a way to detect differences in many table fields but can work in your situation as well.
The technique is described in the article The shortest, fastest, and easiest way to compare two tables in SQL Server: UNION! to compare two separate tables but you can analyse a single table.
Replace the fields COL1, COL2 etc. with the columns you want to compare. For your comparison I've added a WHERE clause to the inner SELECT to effectively view your data as two separate tables.
SELECT MIN(TableName) as TableName, ID, COL1, COL2, COL3 ...
FROM
(
SELECT 'Actual' as TableName, A.ID, A.COL1, A.COL2, A.COL3, ...
FROM Finance_Data A
WHERE VERS_NM = 'Actual'
UNION ALL
SELECT 'Outlook' as TableName, B.ID, B.COL1, B.COl2, B.COL3, ...
FROM Finance_Data B
WHERE VERS_NM = 'Current Outlook'
) T
GROUP BY ID, COL1, COL2, COL3 ...
HAVING COUNT(*) = 1
ORDER BY ID
You could group by all other columns, and then use a having clause to demand that each group has an "Actual" and a "Current outlook" variant:
select *
from YourTable
group by
col1
, col2
, ... all other columns ...
having sum(case when VERS_NM = 'Actual' then 1 end) <> 1
or sum(case when VERS_NM = 'Current Outlook' then 1 end) <> 1
or count(*) <> 2

Grouping data in the select statement

I have huge data which needs to be classifed in to different groups while retrieving. Each group has a different condition. I don't want to retrieve them separately. I want to know the number of items in each group using a single sql statement.
For example, the pseudo code will be like this:
Select count(IssueID) as Issue1_Count if(condition1),
count(IssueID) as Issue2_Count if(condition2),
count(IssueID) as Issue3_Count if(condition3)
From table1, table2, tabl3
where common_condition1 and common_Condition2;
Can somebody help me in making an Oralce query for this...
Put it like this:
SELECT
SUM(CASE WHEN condition1 THEN 1 ELSE 0 END) as Issue1_Count,
SUM(CASE WHEN condition2 THEN 1 ELSE 0 END) as Issue2_Count,
SUM(CASE WHEN condition3 THEN 1 ELSE 0 END) as Issue3_Count,
FROM
table1, table2, tabl3
WHERE
common_condition1 and common_Condition2;
Oracle's CASE statement should help you here. Have a look at this: http://www.dba-oracle.com/t_case_sql_clause.htm
There are limits though, so I'm not 100% positive you can do exactly what you have here using them.

SQL Summing Multiple Joins

I shortened the code quite a bit, but hopefully someone will get the idea of what i am tryign to do. Need to sum totals from two different selects, i tried putting each of them in Left Outer Joins(tried Inner Joins too). If i run wiht either Left Outer Join commented out, I get the correct data, but when i run them together, i get really screwed up counts. So, i know joins are probably not the correct approach to summing data from the same table, i can;t simple do it in a where clause there is other table involved int he code i commented out.
I guess i am trying to sum together 2 different queries.
SELECT eeoc.EEOCode AS 'Test1',
SUM(eeosum.Col_One) AS 'Col_One',
FROM EEO1Analysis eeo
LEFT OUTER JOIN (
SELECT eeor.AnalysisID, eeor.Test1,
SUM(CASE eeor.ZZZ WHEN 1 THEN (CASE eeor.AAAA WHEN 1 THEN 1 ELSE 0 END) ELSE 0 END) AS 'Col_One',
FROM EEO1Roster eeor
..........
WHERE eeor.AnalysisID = 7
GROUP BY eeor.AnalysisID, eeor.EEOCode
) AS eeosum2 ON eeosum2.AnalysisID = eeo.AnalysisID
LEFT OUTER JOIN (
SELECT eeor.AnalysisID, eeor.Test1,
SUM(CASE eeor.ZZZ WHEN 1 THEN (CASE eeor.AAAA WHEN 1 THEN 1 ELSE 0 END) ELSE 0 END) AS 'Col_One',
FROM EEO1Roster eeor
........
) AS eeosum ON eeosum.AnalysisID = eeo.AnalysisID
WHERE eeo.AnalysisID = 7
GROUP BY eeoc.Test1
You could UNION ALL the 2 queries and then do a SUM + GROUP BY i.e.
SELECT Col1, Col2, SUM(Col_One) FROM
(SELECT Col1, Col2, SUM(Col_One)
FROM Table1
WHERE <Conditionset1>
GROUP BY Col1, Col2
UNION ALL
SELECT Col1, Col2, SUM(Col_One)
FROM Table1
WHERE <Conditionset2>
GROUP BY Col1, Col2)
GROUP BY
Col1, Col2
Of course, if there is are row(s) returned by and they would be double counted.
What about
SELECT ... FROM EEO1Analysis eeo,
(SELECT ... LEFT OUTER JOIN ... GROUP BY ... ) AS data
...
?
And, if you can, I'd recommend preparing the data to separate tables, then operate on them with different analysis IDs. Could save some execution time at least.
Need to sum totals from two different selects
If you expect one row single-column result, this way is enough
SELECT
((SELECT SUM(...) FROM ... GROUP BY...) +
(SELECT SUM(...) FROM ... GROUP BY...)) as TheSumOfTwoSums