Making iPython BigQuery Magic Function SQL query dynamic - google-bigquery

I am using the bigquery magic function in Jupyter and would like to be able to dynamically change the project and dataset. For example
Instead of
%%bigquery table
SELECT * FROM `my_project.my_dataset.my_table`
I want
project = my_project
dataset = my_dataset
%%bigquery table
'SELECT * FROM `{}.{}.my_table`'.format(project,dataset)

According to the IPython Magics for BigQuery documentation is not possible to pass the project nor the dataset as parameters; nonetheless, you can use the BigQuery client library to perform this action in Jupyter Notebook.
from google.cloud import bigquery
client = bigquery.Client()
project = 'bigquery-public-data'
dataset = 'baseball'
sql ="""SELECT * FROM `{}.{}.games_wide` LIMIT 10"""
query=sql.format(project,dataset)
query_job = client.query(query)
print("The query data:")
for row in query_job:
# Row values can be accessed by field name or index.
print("gameId={}, seasonId={}".format(row[0], row["gameId"]))
I also recommend you to take a look in public documentation to know how to visualize BigQuery data in a Jupyter notebooks.

Related

Data to Native table

I want to move data from my external tables to native tables.
Please mention if there is a simple way of doing this.
Thank you.
The better way would be to directly create native tables from Google Cloud Storage itself. The external tables are read only, so there is no modification you could have made on the tables. But here I am giving the code for your use case i.e. move data from multiple external tables to native tables.
I am using the Python client library and hence it is required that you follow this document to install the library and set up authentication as mentioned, before running the code. In case you want to grant BigQuery specific roles to the service account, you will also need to add storage.objectViewer role to your service account as well, since we are querying external tables. My external tables (along with many other tables) are placed in dataset “Dataset1” and I am copying them to another dataset “Dataset2” ( which I have already created from the console). Please replace ProjectID, Dataset1 and Dataset2 with suitable values.
Code:
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query1 = """select table_name as table_name from
(SELECT TABLE_NAME as table_name, TABLE_TYPE as table_type
FROM `ProjectID.dataset1.INFORMATION_SCHEMA.TABLES`) t1 where t1.table_type="EXTERNAL"
"""
# The above query returns table names of only external tables present in the dataset.
query_job = client.query(query1)
for row in query_job:
# Row values can be accessed by field name or index.
b = row[0]
query2 = f"select * from `ProjectID.dataset1.{b}`"
# In query2 I am doing a select * from the tables whose names are returned by query1
destination = f"ProjectID.dataset2.{b}"
# Creating the table in another dataset with the same name as was in the first dataset.
job_config = bigquery.QueryJobConfig(destination=destination)
query_job2 = client.query(query2, job_config=job_config)
query_job2.result()
After I run the query I get the external tables copied to Dataset2 as native tables :

How can I use BigQuery's dataset in Data Lab

I have some datasets in BigQuery, I wonder if there is a way to use the same datasets in Data Lab? As the datasets are big, I can't download it and reload it in Data Lab.
Thank you very much.
The BigQuery Python client library support querying data stored in BigQuery. To load the commands from the client library, paste the following code into the first cell of the notebook:
%load_ext google.cloud.bigquery
%load_ext is one of the many Jupyter built-in magic commands.
The BigQuery client library provides a %%bigquery cell, which runs a SQL query and returns the results as a Pandas DataFrame.
You can query data from a public dataset or from the datasets in your project:
%%bigquery
SELECT *
FROM `MY_PROJECT.MY_DATASET.MY_TABLE`
LIMIT 50
I was able to successfully get data from the dataset without any issues.
You can follow this tutorial. I hope it helps.

Can I issue a query rather than specify a table when using the BigQuery connector for Spark?

I have used the Use the BigQuery connector with Spark to extract data from a table in BigQuery by running the code on Google Dataproc. As far as I'm aware the code shared there:
conf = {
# Input Parameters.
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
# Output Parameters.
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
# Load data in from BigQuery.
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
copies the entirety of the named table into input_directory. The table I need to extract data from contains >500m rows and I don't need all of those rows. Is there a way to instead issue a query (as opposed to specifying a table) so that I can copy a subset of the data from a table?
Doesn't look like BigQuery supports any kind of filtering/querying for tables export at the moment:
https://cloud.google.com/bigquery/docs/exporting-data
https://cloud.google.com/bigquery/docs/reference/rest/v2/jobs#configuration.extract

Google BQ: Running Parameterized Queries where Parameter Variable is the BQ Table Destination

I am trying to run a SQL from the Linux Commandline for a BQ Table destination. This SQL script will be used for multiple dates, clients, and BQ Table destinations, so this would require using parameters in my BQ API-commandline calls (the flag --parameter). Now, I have followed this link to learn about parameterized queries: https://cloud.google.com/bigquery/docs/parameterized-queries , but it's limited in helping me with declaring a table name.
My SQL script, called Advertiser_Date_Check.sql, is the following:
#standardSQL
SELECT *
FROM (SELECT *
FROM #variable_table
WHERE CAST(_PARTITIONTIME AS DATE) = #variable_date) as final
WHERE final.Advertiser IN UNNEST(#variable_clients)
Where the parameter variables represent the following:
variable_table: The BQ Table destination that I want to call
variable_date: The Date that I want to pull from the BQ table
variable_clients: An Array list of specific clients that I want to pull from the data (which is from the date I referenced)
Now, my Commandline (LINUX) for the BQ data is the following
TABLE_NAME=table_name_example
BQ_TABLE=$(echo '`project_id.dataset_id.'$TABLE_NAME'`')
TODAY=$(date +%F)
/bin/bq query --use_legacy_sql=false \
--parameter='variable_table::'$BQ_TABLE'' \
--parameter=variable_date::"$TODAY" \
--parameter='variable_clients:ARRAY<STRING>:["Client_1","Client_2","Client_3"]' \
"`cat /path/to/script/Advertiser_Date_Check.sql`"
The parameters of #variable_date and #variable_clients have worked just fine in the past when it was just them. However, since I desire to run this exact SQL command on various tables in a loop, I created a parameter called variable_table. Parameterized Queries have to be in Standard SQL format, so the table name convention needs to be in such format:
`project_id.dataset_id.table_name`
Whenever I try to run this on the Commandline, I usually get the following error:
Error in query string: Error processing job ... : Syntax error: Unexpected "#" at [4:12]
Which is referencing the parameter #variable_table, so it's having a hard time processing that this is referencing a table name.
In past attempts, there even has been the error:
project_id.dataset_id.table_name: command not found
But this was mostly due to poor reference of table destination name. The first error is the most common occurrence.
Overall, my questions regarding this matter are:
How do I reference a BQ Table as a parameter in the Commandline for Parameterized Queries at the FROM Clause (such as what I try to do with #variable_table)? Is it even possible?
Do you know of other methods to run a query on multiple BQ tables from the commandline besides by the way I am currently doing it?
Hope this all makes sense and thank you for your assistance!
From the documentation that you linked:
Parameters cannot be used as substitutes for identifiers, column names, table names, or other parts of the query.
I think what might work for you in this case, though, is performing the injection of the table name as a regular shell variable (instead of a query parameter). You'd want to make sure that you trust the contents of it, or that you are building the string yourself in order to avoid SQL injection. One approach is to have hardcoded constants for the table names and then choose which one to insert into the query text based on the user input.
I thought I would just post my example here which only covers your question about creating a "dynamic table name", but you can also use my approach for your other variables. My approach was to do this operation directly in python just before doing the BigQuery API call, by leveraging python's internal time function (assuming you want your variables to be time-based).
Create BigQuery-table via Python BQ API:
from google.colab import auth
from datetime import datetime
from google.cloud import bigquery
auth.authenticate_user()
now = datetime.now()
current_time = now.strftime("%Y%m%d%H%M")
project_id = '<project_id>'
client = bigquery.Client(project=project_id)
table_id = "<project_id>.<dataset_id>.table_"
table_id = table_id + current_time
job_config = bigquery.QueryJobConfig(destination=table_id)
sql = """
SELECT
dataset_id,
project_id,
table_id,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
row_count,
size_bytes,
round(safe_divide(size_bytes, (1000*1000)),1) as size_mb,
round(safe_divide(size_bytes, (1000*1000*1000)),3) as size_gb
FROM (select * from `<project_id>:<dataset_id>.__TABLES__`)
ORDER BY dataset_id, table_id asc;
"""
query_job = client.query(sql, job_config=job_config)
query_job.result()
print("Query results loaded to the table {}".format(table_id))
# Output:
# Query results loaded to the table <project_id>.<dataset_id>.table_202101141450
Feel free to copy and test it within a google colab notebook. Just fill in your own:
<project_id>
<dataset_id>

Datalab does not populate bigQuery tables

Hi I have a problem while using ipython notebooks on datalab.
I want to write the result of a table into a bigQuery table but it does not work and anyone says to use the insert_data(dataframe) function but it does not populate my table.
To simplify the problem I try to read a table and write it to a just created table (with the same schema) but it does not work. Can anyone tell me where I am wrong?
import gcp
import gcp.bigquery as bq
#read the data
df = bq.Query('SELECT 1 as a, 2 as b FROM [publicdata:samples.wikipedia] LIMIT 3').to_dataframe()
#creation of a dataset and extraction of the schema
dataset = bq.DataSet('prova1')
dataset.create(friendly_name='aaa', description='bbb')
schema = bq.Schema.from_dataframe(df)
#creation of the table
temptable = bq.Table('prova1.prova2').create(schema=schema, overwrite=True)
#I try to put the same data into the temptable just created
temptable.insert_data(df)
Calling insert_data will do a HTTP POST and return once that is done. However, it can take some time for the data to show up in the BQ table (up to several minutes). Try wait a while before using the table. We may be able to address this in a future update, see this
The hacky way to block until ready right now should be something like:
import time
while True:
info = temptable._api.tables_get(temptable._name_parts)
if 'streamingBuffer' not in info:
break
if info['streamingBuffer']['estimatedRows'] > 0:
break
time.sleep(5)