BigQuery - joining on a repeated field - google-bigquery

I'm trying to run a join on a repeated field.
Originally I get an error:
Cannot join on repeated field payload.pages.action
I fix this by running flatten on the relevant table (this is only an example query - it will give empty result if it would successfully run):
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each flatten(publicdata:samples.github_nested,payload.pages) t2
on t2.payload.pages.action=t1.repository.url
I get a different error:
Table wildcard function 'FLATTEN' can only appear in FROM clauses
This used to work in the past. Is there some syntax change?

I don't think there has been a syntax change, but you should be able to wrap the flatten statement in a subselect. That is,
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each (SELECT * FROM flatten(publicdata:samples.github_nested,payload.pages)) t2
on t2.payload.pages.action=t1.repository.url

Related

BigQuery select, join from multiple datasets and avoid name conflicts

Imagine I have several datasets and tables.
Format: dataset.table.field
dataset01.table_xxx.field_z
dataset02.table_xxx.field_z
I try to write smth like
select
dataset01.table_xxx.field_z as dataset01_table_xxx_field_z,
dataset02.table_xxx.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx
join dataset02.table_xxx on dataset02.table_xxx.field_z = dataset01.table_xxx.field_z
to avoid conflicting names
BigQuery says that dataset01.table_xxx.field_xxx is unrecognised name in SELECT clause.
it complains about unrecognised name in join clause too.
Query works if I remove dataset01, dataset02 from SELECT clause and on condition
What is the right way to refer fields in such case?
select
t1.field_z as dataset01_table_xxx_field_z,
t2.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx t1
join dataset02.table_xxx t2
on t2.field_z = t1.field_z

SQL INNER JOIN duplicate columns

I am trying to return some columns from 2 tables which share an ID column using the following browser database query system, which reads from the tables shown on this webpage. I believe the way to do this is by using INNER JOIN (e.g. see this guide).
SELECT sami_dr2.DR2Sample.CATID,
sami_dr2.DR2Sample.Mstar,
sami_dr2.StellarKinematics.PA_STELKIN
FROM sami_dr2.DR2Sample INNER JOIN sami_dr2.StellarKinematics
ON sami_dr2.DR2Sample.CATID = sami_dr2.StellarKinematics.CATID;
However, when I run this query I get the error message:
sql: Duplicate columns are not supported. Try using an alias for those columns within the SELECT clause e.g., SELECT t1.CATAID, t2.CATAID becomes SELECT t1.CATAID as t1_CATAID, t2.CATAID as t2_CATAID
But as far as I'm aware the whole point of using the INNER JOIN is to remove duplications as I'm not returning sami_dr2.StellarKinematics.CATID in my output table, only sami_dr2.DR2Sample.CATID.
I've also found that using SELECT sami_dr2.DR2Sample.CATID as ID_1, sami_dr2.StellarKinematics.CATID as ID_2 in the selction doesn't fix the problem either.
Any help on this would be greatly appreciated!

Hive - Multiple sub-queries in where clause is failing

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?
Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

Use ORDER BY UNIX_DATE() for join table in Bigquery

In my queries, I have used the unix_date function to group and count the data from backlogs to specific date. All works very well.
..
SELECT
*,
FROM
table1
FULL OUTER JOIN table2 USING (ID)
I'm not sure what should I add for the joining part to get a right query. I skipped the details of query as the query is quite long to be put on this post. Please let me know if you need the full query.
Problem: I think the join table append the row instead of just adding the column from joined query results because there are many same IDs in all tables (many-many relationship problem).However, not sure how to solve it.
Solved using composite key.
..
SELECT
*,
FROM
table1
FULL OUTER JOIN table2 USING (ID, Date)

How to fix: Only one expression can be specified in the select list when the subquery is not introduced with EXISTS

I've been through a bunch of existing posts but couldn't get this to
work. I'm trying to build a query get all the records in a table and
an extra column. The extra column is populated by this logic - the
first value represented in the row which has same session ID as the
original record and has ToolName=ReportingTool. When I try to
implement the query like this, I get this error.
I tried doing a left join but the problem there is I don't know how to
limit the left join output (from the right table's select) to 1. This
causes multiple joins on the left and the no. of records returned
changes. My query is as follows:
SELECT
*
FROM [TraceDB].[dbo].[TelemetryLogs] AS TelemetryOuter
LEFT JOIN [TraceDB].[dbo].[TelemetryLogs] AS TelemetryInner
ON
TelemetryInner.SessionID = TelemetryOuter.SessionID AND
TelemetryInner.ToolName='ReportingTool' AND
TelemetryInner.Name='Identity' AND
TelemetryInner.SessionID = (
SELECT TOP 1 *
FROM [TraceDB].[dbo].[TelemetryLogs] AS TelemtryIntInt
WHERE TelemtryIntInt.SessionID=TelemetryInner.SessionID
)
WHERE
TelemetryOuter.ToolName ='ReportingTool'
EDIT: Fixed a comma which got introduced as a copy paste type
Try
SELECT TOP 1 TelemtryIntInt.SessionID
In your inner SELECT. You're currently returning the whole row and you can't compare a scalar sessionID against a whole row.