BigQuery select, join from multiple datasets and avoid name conflicts - sql

Imagine I have several datasets and tables.
Format: dataset.table.field
dataset01.table_xxx.field_z
dataset02.table_xxx.field_z
I try to write smth like
select
dataset01.table_xxx.field_z as dataset01_table_xxx_field_z,
dataset02.table_xxx.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx
join dataset02.table_xxx on dataset02.table_xxx.field_z = dataset01.table_xxx.field_z
to avoid conflicting names
BigQuery says that dataset01.table_xxx.field_xxx is unrecognised name in SELECT clause.
it complains about unrecognised name in join clause too.
Query works if I remove dataset01, dataset02 from SELECT clause and on condition
What is the right way to refer fields in such case?

select
t1.field_z as dataset01_table_xxx_field_z,
t2.field_z as dataset02_table_xxx_field_z
from dataset01.table_xxx t1
join dataset02.table_xxx t2
on t2.field_z = t1.field_z

Related

Best way to "SELECT *" from multiple tabs in sql without duplicate?

I am trying to retrieve every data stored in 2 tabs from my database through a SELECT statement.
The problem is there are a lot of columns in each tab and manually selecting each column would be a pain in the ass.
So naturally I thought about using a join :
select * from equipment
join data
on equipment.id = data.equipmentId
The problem is I am getting the equipment ID 2 times in the result.
I thought that maybe some specific join could help me filter out the duplicate key, but I can't manage to find a way...
Is there any way to filter out the foreign key or is there a better way to do the whole thing (I would rather not have to post process the data to manually remove those duplicate columns)?
You can use USING clause.
"The USING clause specifies which columns to test for equality when
two tables are joined. It can be used instead of an ON clause in the
JOIN operations that have an explicit join clause."
select *
from test
join test2 using(id)
Here is a demo
You can also use NATURAL JOIN
select *
from test
natural join test2;

Hive - Multiple sub-queries in where clause is failing

I am trying to create a table by checking two sub-query expressions within the where clause but my query fails with the below error :
Unsupported sub query expression. Only 1 sub query expression is
supported
Code snippet is as follows (Not the exact code. Just for better understanding) :
Create table winners row format delimited fields terminated by '|' as
select
games,
players
from olympics
where
exists (select 1 from dom_sports where dom_sports.players = olympics.players)
and not exists (select 1 from dom_sports where dom_sports.games = olympics.games)
If I execute same command with only one sub-query in where clause it is getting executed successfully. Having said that is there any alternative to achieve the same in a different way ?
Of course. You can use left join.
Inner join will act as exists. and left join + where clause will mimic the not exists.
There can be issue with granularity but that depends on your data.
select distinct
olympics.games,
olympics.players
from olympics
inner join dom_sports dom_sports on dom_sports.players = olympics.players
left join dom_sports dom_sports2 where dom_sports2.games = olympics.games
where dom_sports2.games is null

How to use Except clause in Bigquery?

I am trying to use the existing Except clause in Bigquery. Please find my query below
select * EXCEPT (b.hosp_id, b.person_id,c.hosp_id) from
person a
inner join hospital b
on a.hosp_id= b.hosp_id
inner join reading c
on a.hosp_id= c.hosp_id
As you can see I am using 3 tables. All the 3 tables have the hosp_id column, so I would like to remove duplicate columns which are b.hosp_id and c.hosp_id. Simlarly, I would like to remove b.person_id column as well.
When I execute the above query, I get the syntax error as shown below
Syntax error: Expected ")" or "," but got "." at [9:19]
Please note that all the columns that I am using in Except clause is present in the tables used. Additional info is all the tables used are temp tables created using with clause. When I do the same manually by selecting column of interest, it works fine. But I have several columns and can't do this manually.
Can you help? I am trying to learn Bigquery. Your inputs would help
I use the EXCEPT on a per-table basis:
select p.* EXCEPT (hosp_id, person_id),
h.*,
r.* EXCEPT (hosp_id)
from person p inner join
hospital h
on p.hosp_id = h.hosp_id inner join
reading r
on p.hosp_id = r.hosp_id;
Note that this also uses meaningful abbreviations for table aliases, which makes the query much simpler to understand.
In your case, I don't think you need EXCEPT at all if you use the USING clause.
Try this instead:
select * EXCEPT (person_id) from
person a
inner join hospital b
using (hosp_id)
inner join reading c
using (hosp_id)
You can only put column names (not paths) in the EXCEPT list, and you can simply avoid projecting the duplicate columns with USING instead of ON.

Hive Query: defining a variable which is a list of strings

How can I create a constant list and use it in the WHERE clause of my query?
For example, I have a hive query, where I say
Select t1.Id,
t1.symptom
from t1
WHERE lower(symptom) NOT IN ('coughing','sneezing','xyz', etc,...)
Instead of keep repeating this long list of symptoms (which makes the code very ugly), is there a way to define it ahead of time as
MyList = ('coughing','sneezing','x',...)
and then in WHERE clause I'd just say WHERE lower(symptom) not in MyList.
You can put the list in a table and use join:
Select t1.Id, t1.symptom
from t1
where lower(symptom) NOT IN (select symptom from mysymptoms_list);
This persists the list, so it can be used in multiple queries.
You can use hive variable to do this.
SET hivevar:InClause=('coughing','sneezing','x',...)
Make sure you don't leave spaces either side of equals.
SELECT t1.Id,
t1.symptom
FROM t1
WHERE LOWER(symptom) NOT IN ${InClause}
If you are comfortable with joins, you can use a left join with where clause:
Select t1.Id, t1.symptom
from
t1 A left join MyList B
on
lower(A.symptom) = lower(B.symptom)
where lower(B.symptom) IS NULL;
This query will retain all symptoms(A.symptom) from table t1 in one column and for the second column(B.symptom) corresponding to the table MyList, the value will be same as the symptom in t1 if a match is found or NULL if a match is not found.
You want those where a match is not found, hence the where clause.

BigQuery - joining on a repeated field

I'm trying to run a join on a repeated field.
Originally I get an error:
Cannot join on repeated field payload.pages.action
I fix this by running flatten on the relevant table (this is only an example query - it will give empty result if it would successfully run):
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each flatten(publicdata:samples.github_nested,payload.pages) t2
on t2.payload.pages.action=t1.repository.url
I get a different error:
Table wildcard function 'FLATTEN' can only appear in FROM clauses
This used to work in the past. Is there some syntax change?
I don't think there has been a syntax change, but you should be able to wrap the flatten statement in a subselect. That is,
SELECT
t1.repository.forks
FROM publicdata:samples.github_nested t1
left join each (SELECT * FROM flatten(publicdata:samples.github_nested,payload.pages)) t2
on t2.payload.pages.action=t1.repository.url