Joining tables with incompatible types - sql

I'm trying to join two tables using this command :
SELECT * FROM bigquery-public-data.github_repos.files INNER JOIN bigquery-public-data.github_repos.commits USING (repo_name)
but there are incompatible types on either side of the join: STRING and ARRAY< STRING> Is there a way to go through this ?
Thank you !

You want to join a 2 billion row table with a 200 million row one. This won't end up well, unless you define restrictions on what you want to get out of this.
As for the technical problems of this query: The error says you are trying to JOIN a single value with an array of values. You need to UNNEST() that array.
This would work syntactically:
SELECT *
FROM `bigquery-public-data.github_repos.files` a
INNER JOIN (
SELECT * EXCEPT(repo_name)
FROM `bigquery-public-data.github_repos.commits`
, UNNEST(repo_name) repo
) b
ON a.repo_name=b.repo
But if you go for it, it will use all your free monthly quota (1TB of data scanned) for no good purpose, as far as I can tell.

Related

Best way to "SELECT *" from multiple tabs in sql without duplicate?

I am trying to retrieve every data stored in 2 tabs from my database through a SELECT statement.
The problem is there are a lot of columns in each tab and manually selecting each column would be a pain in the ass.
So naturally I thought about using a join :
select * from equipment
join data
on equipment.id = data.equipmentId
The problem is I am getting the equipment ID 2 times in the result.
I thought that maybe some specific join could help me filter out the duplicate key, but I can't manage to find a way...
Is there any way to filter out the foreign key or is there a better way to do the whole thing (I would rather not have to post process the data to manually remove those duplicate columns)?
You can use USING clause.
"The USING clause specifies which columns to test for equality when
two tables are joined. It can be used instead of an ON clause in the
JOIN operations that have an explicit join clause."
select *
from test
join test2 using(id)
Here is a demo
You can also use NATURAL JOIN
select *
from test
natural join test2;

Unnesting 3rd level dependency in Google BigQuery

I'm trying to Replace the schema in existing table using BQ. There are certain fields in BQ which have 3-5 level schema dependency.
For Ex. comsalesorders.comSalesOrdersInfo.storetransactionid this field is nested under two fields.
Since I'm using this to replace existing table, I can not change the field names in query.
The query looks similar to this
SELECT * REPLACE(comsalesorders.comSalesOrdersInfo.storetransactionid AS STRING) FROM CentralizedOrders_streaming.orderStatusUpdated, UNNEST(comsalesorders) AS comsalesorders, UNNEST(comsalesorders.comSalesOrdersInfo) AS comsalesorders.comSalesOrdersInfo
BQ enables unnesting first schema field but presents problem for 2nd nesting.
What changes do I need to make to this query to use UNNEST() for such depedndent schemas ?
Given that you don't have a schema, I will try to provide a generalized answer. Please try to understand the difference between the 2 queries.
-- Provide an alias for each unnest (as if each is a separate table)
select c.stuff
from table
left join unnest(table.first_level_nested) a
left join unnest(a.second_level_nested) b
left join unnest(b.third_level_nested) c
-- b and c won't work here because you are 'double unnesting'
select c.stuff
from table
left join unnest(table.first_level_nested) a
left join unnest(first_level_nested.second_level_nested) b
left join unnest(first_level_nested.second_level_nested.third_level_nested) c
I'm not sure I understand your question, but as I could guess, you want to change one column type to another type, such as STRING.
The UNNEST function is only used with columns that are array types, for example:
"comsalesorders":["comSalesOrdersInfo":{}, comSalesOrdersInfo:{}, comSalesOrdersInfo:{}]
But not with this kind of columns:
"comSalesOrdersInfo":{"storeTransactionID":"X1056-943462","ItemsWarrenty":0,"currencyCountry":"USD"}
Therefore, if a didn't misunderstand your question, I would make a query like this:
SELECT *, CAST(A.comSalesOrdersInfo.storeTransactionID as STRING)
FROM `TABLE`, UNNEST(comsalesorders) as A

How to use Except clause in Bigquery?

I am trying to use the existing Except clause in Bigquery. Please find my query below
select * EXCEPT (b.hosp_id, b.person_id,c.hosp_id) from
person a
inner join hospital b
on a.hosp_id= b.hosp_id
inner join reading c
on a.hosp_id= c.hosp_id
As you can see I am using 3 tables. All the 3 tables have the hosp_id column, so I would like to remove duplicate columns which are b.hosp_id and c.hosp_id. Simlarly, I would like to remove b.person_id column as well.
When I execute the above query, I get the syntax error as shown below
Syntax error: Expected ")" or "," but got "." at [9:19]
Please note that all the columns that I am using in Except clause is present in the tables used. Additional info is all the tables used are temp tables created using with clause. When I do the same manually by selecting column of interest, it works fine. But I have several columns and can't do this manually.
Can you help? I am trying to learn Bigquery. Your inputs would help
I use the EXCEPT on a per-table basis:
select p.* EXCEPT (hosp_id, person_id),
h.*,
r.* EXCEPT (hosp_id)
from person p inner join
hospital h
on p.hosp_id = h.hosp_id inner join
reading r
on p.hosp_id = r.hosp_id;
Note that this also uses meaningful abbreviations for table aliases, which makes the query much simpler to understand.
In your case, I don't think you need EXCEPT at all if you use the USING clause.
Try this instead:
select * EXCEPT (person_id) from
person a
inner join hospital b
using (hosp_id)
inner join reading c
using (hosp_id)
You can only put column names (not paths) in the EXCEPT list, and you can simply avoid projecting the duplicate columns with USING instead of ON.

Join in Google BigQuery via Cloud Datalab

I am trying to do JOIN on two columns from two different tables (one of them is a view) in Google BigQuery. I have tried this numerous ways, but have received this error the most consistently:
invalidQuery: 2.1 - 0.0: JOIN cannot be applied directly to a table union or to a table wildcard function. Consider wrapping the table union or table wildcard function in a subquery (e.g., SELECT *).
Here is my SQL (legacy) query:
SELECT
blp_today.beta_key,
blp_today.px_last,
blp_today.eqy_weighted_avg_px,
blp_today.created_date,
blp_today.security_ticker,
ciq_company_stg.ticker,
ciq_company_stg.ciq
FROM
[fcm-dw:acquisition_bloomberg.blp_today],
[fcm-dw:acquisition_ciq]
JOIN
blp_today.security_ticker AS ticker
ON
blp_today.security_ticker = ciq_company_stg.ticker
LIMIT 1000
Any help would be much appreciated.
I think you either want something like this:
SELECT * FROM(SELECT
beta_key,
px_last,
eqy_weighted_avg_px,
created_date,
security_ticker,
FROM
[fcm-dw:acquisition_bloomberg.blp_today],
[fcm-dw:acquisition_ciq] ) as a
JOIN
blp_today.security_ticker AS ticker
ON
a.security_ticker = ciq_company_stg.ticker
LIMIT 1000
//edit: I kind of missed earlier that the table that you are joining (after your join statement) does not actually seem to be a table. Are you trying to join or to union these two tables: [fcm-dw:acquisition_bloomberg.blp_today] and [fcm-dw:acquisition_ciq] ? And is the latter even a table? Your code seems to indicate that there is another table named: [fcm-dw:acquisition_ciq.ciq_company_stg]?
First wrap your union into a sub select then join the result
select ...
FROM
(select * from
[fcm-dw:acquisition_bloomberg.blp_today],
[fcm-dw:acquisition_ciq] ) t
JOIN
blp_today.security_ticker AS ticker

How to get names present in both views?

I have a very large view containing 5 million records containing repeated names with each row having unique transaction number. Another view of 9000 records containing unique names is also present. Now I want to retrieve records in first view whose names are present in second view
select * from v1 where name in (select name from v2)
But the query is taking very long to run. Is there any short cut method?
Did you try just using a INNER JOIN. This will return all rows that exist in both tables:
select v1.*
from v1
INNER JOIN v2
on v1.name = v2.name
If you need help learning JOIN syntax, here is a great visual explanation.
You can add the DISTINCT keyword which will remove any duplicate values that the query returns.
use JOIN.
The DISTINCT will allow you to return only unique records from the list since you are joining from the other table and there could be possibilities that a record may have more than one matches on the other table.
SELECT DISTINCT a.*
FROM v1 a
INNER JOIN v2 b
ON a.name = b.name
For faster performance, add an index on column NAME on both tables since you are joining through it.
To further gain more knowledge about joins, kindly visit the link below:
Visual Representation of SQL Joins