Export BigQuery Result to Avro or JSON

Export BigQuery Result to Avro or JSON - google-bigquery

Would someone please let me if there is a way to save the BigQuery Result to JSON or Avro format.
I am using following code to run the query on BigQuery Table.
client = bigquery.Client.from_service_account_json('/Users/gaurang.shah/Downloads/fb3735b731b9.json')
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
sql = """
select *
FROM `bigquery-public-data.samples.shakespeare`
limit 1;
"""
location = 'US'
query_job = client.query(sql, location=location, job_config=job_config)
query_job = client.get_job(query_job.job_id, location=location)
print(query_job.result())
I am trying to export the BigQuery table without using GCS in between. And this is one way I think I could achieve that.
The other way I think is using BQ command line tool. However not sure if it has any limit on how many queries I can fire and how much data I can retrieve.

You need to first run your query, write the results to a table, and then hook into the BigQuery export/extract API, where the results/table can be exported to GCS in the format you want. For example, here's CSV:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
See more here.

Related

How to get result from BigQuery based on user input parameters

This is a continuation of my previous post for making api that takes url parameter , passes it to BigQuery and if the luid record has data in orderid column, it returns True . How to check whether data exists in specific column on BigQuery with Flask?
I changed sql and it seems this sql works well on GCP console but as you can see , it returns Flase({'f0_': 0})) if you input correct parameter from browser. Do I need to fix this sql ??
[URL:https://test-989898.df.r.appspot.com?luid=U77777]
The output of return str(row)
↓
Row((True,), {'f0_': 0})
The output of SQL with same luid above on console
↓
row | f0_
1 | true
SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE luid = "U77777" AND orderid != '' limit 1000)
and I tried this article as below . User input parameter can not be available in BigQuery ??
https://cloud.google.com/bigquery/docs/parameterized-queries
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
for row in query_res:
return str(row)
↓
Row((True,), {'f0_': 0})
I've been stack in this problem for a while , I'm welcome to any idea . Anyone has good solutions ??
from flask import Flask, request, jsonify
from google.cloud import bigquery
app = Flask(__name__)
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
# first_row = next(iter(query_job.result()))
for row in query_res:
return str(row)
#return jsonify({luid:query_res.total_rows})
"""
if query_res == :
return jsonify({luid: str(True)})
else:
return jsonify({luid: str(False)})
"""
if __name__ == "__main__":
app.run()
↓
Row((True,), {'f0_': 0})

You seem to have solved most of the bits, it's just a question of getting them working together. Here's a quick sample that should help with the BigQuery things, and shows a different way of writing your query pattern using a public dataset table.
from google.cloud import bigquery
client = bigquery.Client()
# assume you get this from your flask app's param. this is the "luid" you're checking.
value = "treason"
# rewriting the sql demonstrate a similar thing with a public dataset table
sql = "SELECT COUNTIF(word=#luid AND corpus='sonnets') > 0 as word_is_sonnet FROM `bigquery-public-data.samples.shakespeare`"
config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", value),
]
)
job = client.query(sql, job_config=config)
# this is a bit odd, but in this case we know we're dealing with a single row
# coming from the iterable based on the query structure.
first_row = next(iter(job.result()))
print(first_row.get("word_is_sonnet"))
However, that said I'd make sure you're understanding how BigQuery works and charges for queries. You seem to be doing point lookups for a range of tables (the wildcard table in your original query), which means you're potentially doing a lot of table scanning to satisfy this request.
I just wanted to call that out so you're not surprised by either the performance or the costs if the intent is that you're issuing many requests like this.

Can BigQuery API overwrite existing table/view with create_table() (tables insert)?

I'm using the Python client create_table() function which calls the underlying tables insert API. There is an exists_ok parameter but this causes the function to simply ignore the create if the table already exists. The problem with this is that when creating a view, I would like to overwrite the existing view SQL if it's already there. What I'm currently doing to get around this is:
if overwrite:
bq_client.delete_table(view, not_found_ok=True)
view = bq_client.create_table(view)
What I don't like about this is there are potentially several seconds during which the view no longer exists. And if the code dies for whatever reason after the delete but before the create then the view is effectively gone.
My question: is there a way to create a table (view) such that it overwrites any existing object? Or perhaps I have to detect this situation and run some kind of update_table() (patch)?

If you want to overwrite an existing table, you can use google.cloud.bigquery.job.WriteDisposition class, please refer to official documentation.
You have three possibilities here: WRITE_APPEND, WRITE_EMPTY and WRITE_TRUNCATE. What you should use, is WRITE_TRUNCATE, which overwrites the table data.
You can see following example here:
from google.cloud import bigquery
import pandas
client = bigquery.Client()
table_id = "<YOUR_PROJECT>.<YOUR_DATASET>.<YOUR_TABLE_NAME>"
records = [
{"artist": u"Michael Jackson", "birth_year": 1958},
{"artist": u"Madonna", "birth_year": 1958},
{"artist": u"Shakira", "birth_year": 1977},
{"artist": u"Taylor Swift", "birth_year": 1989},
]
dataframe = pandas.DataFrame(
records,
columns=["artist", "birth_year"],
index=pandas.Index(
[u"Q2831", u"Q1744", u"Q34424", u"Q26876"], name="wikidata_id"
),
)
job_config = bigquery.LoadJobConfig(
schema=[
bigquery.SchemaField("artist", bigquery.enums.SqlTypeNames.STRING),
bigquery.SchemaField("wikidata_id", bigquery.enums.SqlTypeNames.STRING),
],
write_disposition="WRITE_TRUNCATE",
)
job = client.load_table_from_dataframe(
dataframe, table_id, job_config=job_config
)
job.result()
table = client.get_table(table_id)
Let me know if it suits your need. I hope it helps.
UPDATED:
You can use following Python code to update a table view using the client library:
client = bigquery.Client(project="projectName")
table_ref = client.dataset('datasetName').table('tableViewName')
table = client.get_table(table_ref)
table.view_query = "SELECT * FROM `projectName.dataset.sourceTableName`"
table = client.update_table(table, ['view_query'])

You can do it this way.
Hope this may help!
from google.cloud import bigquery
clientBQ = bigquery.Client()
def tableExists(tableID, client=clientBQ):
"""
Check if a table already exists using the tableID.
return : (Boolean)
"""
try:
table = client.get_table(tableID)
return True
except NotFound:
return False
if tableExists(viewID, client=clientBQ):
print("View already exists, Deleting the view ... ")
clientBQ .delete_table(viewID)
view = bigquery.Table(viewID)
view.view_query = "SELECT * FROM `PROJECT_ID.DATASET_NAME.TABLE_NAME`"
clientBQ.create_table(view)

Create View that will extract metadata information about dataset and table sizes in different environments

We need to monitor table sizes in different environments.
Use Google metadata API to get the information for a given Project/Environment.
Need to create a view which will provide
1. What are all the datasets
2. What tables in each dataset
3. Table sizes
4. Dataset size

BigQuery has such views for you already built-in: INFORMATION_SCHEMA is a series of views that provide access to metadata about datasets, tables, and views
For example, below returns metadata for all datasets in the default project
SELECT * FROM INFORMATION_SCHEMA.SCHEMATA
or
for my_project
SELECT * FROM my_project.INFORMATION_SCHEMA.SCHEMATA
There are other such views for tables also
In addition, there is a meta table that can be used to get more info about tables in given dataset: __TABLES__SUMMARY and __TABLES__
SELECT * FROM `project.dataset.__TABLES__`
For example:
SELECT table_id,
DATE(TIMESTAMP_MILLIS(creation_time)) AS creation_date,
DATE(TIMESTAMP_MILLIS(last_modified_time)) AS last_modified_date,
row_count,
size_bytes,
CASE
WHEN type = 1 THEN 'table'
WHEN type = 2 THEN 'view'
WHEN type = 3 THEN 'external'
ELSE '?'
END AS type,
TIMESTAMP_MILLIS(creation_time) AS creation_time,
TIMESTAMP_MILLIS(last_modified_time) AS last_modified_time,
dataset_id,
project_id
FROM `project.dataset.__TABLES__`

In order to automatize the query to check for every dataset in the project instead of adding them manually with UNION ALL, you can follow the advice given by #ZinkyZinky here and create a query that generates the UNION ALL calls for every dataset.__TABLES_. I have not managed to use this solution fully automatically in BigQuery because I don’t find a way to execute a command generated as a string (That is what string_agg is creating). Anyhow, I have managed to develop the solution in Python, adding the generated string in the next query. You can find the code below. It also creates a new table and stores the results there:
from google.cloud import bigquery
client = bigquery.Client()
project_id = "wave27-sellbytel-bobeda"
# Construct a full Dataset object to send to the API.
dataset_id = "project_info"
dataset = bigquery.Dataset(".".join([project_id, dataset_id]))
dataset.location = "US"
# Send the dataset to the API for creation.
# Raises google.api_core.exceptions.Conflict if the Dataset already
# exists within the project.
dataset = client.create_dataset(dataset) # API request
print("Created dataset {}.{}".format(client.project, dataset.dataset_id))
schema = [
bigquery.SchemaField("dataset_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("table_id", "STRING", mode="REQUIRED"),
bigquery.SchemaField("size_bytes", "INTEGER", mode="REQUIRED"),
]
table_id = "table_info"
table = bigquery.Table(".".join([project_id, dataset_id, table_id]), schema=schema)
table = client.create_table(table) # API request
print(
"Created table {}.{}.{}".format(table.project, table.dataset_id, table.table_id)
)
job_config = bigquery.QueryJobConfig()
# Set the destination table
table_ref = client.dataset(dataset_id).table(table_id)
job_config.destination = table_ref
# QUERIES
# 1. Creating the UNION ALL list with the table information of each dataset
query = (
r"SELECT string_agg(concat('SELECT * from `', schema_name, '.__TABLES__` '), 'union all \n') "
r"from INFORMATION_SCHEMA.SCHEMATA"
)
query_job = client.query(query, location="US") # API request - starts the query
select_tables_from_all_datasets = ""
for row in query_job:
select_tables_from_all_datasets += row[0]
# 2. Using the before mentioned list to create a table.
query = (
"WITH ALL__TABLES__ AS ({})"
"SELECT dataset_id, table_id, size_bytes FROM ALL__TABLES__;".format(select_tables_from_all_datasets)
)
query_job = client.query(query, location="US", job_config=job_config) # job_config configures in which table the results will be stored.
for row in query_job:
print row
print('Query results loaded to table {}'.format(table_ref.path))

BigQuery updates failing, but only when batched using Python API

I am trying to update a table using batched update statements. DML queries successfully execute in the BigQuery Web UI, but when batched, the first one succeeds while others fail. Why is this?
A sample query:
query = '''
update `project.dataset.Table`
set my_fk = 1234
where other_fk = 222 and
received >= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-22 05:28:12") and
received <= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-26 02:31:51")
'''
Sample code:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
queries = [] # list of DML Strings
jobs = []
for query in queries:
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Job output:
for job in jobs[1:]:
print(job.state)
# Done
print(job.error_result)
# {'message': 'Cannot set destination table in jobs with DML statements',
# 'reason': 'invalidQuery'}
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query

I suspect that the problem is job_config getting some fields populated (destination in particular) by the BigQuery API after the first job is inserted. Then, the second job will fail as it will be a DML statement with a destination table in the job configuration. You can verify that with:
for query in queries:
print(job_config.destination)
job = client.query(query, location='US', job_config=job_config)
print(job_config.destination)
jobs.append(job)
To solve this you can avoid reusing the same job_config for all jobs:
for query in queries:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)

Your code seems to be working fine on a single update. This is what I tried using python 3.6.5 and v1.9.0 of the client API
from google.cloud import bigquery
client = bigquery.Client()
query = '''
UPDATE `project.dataset.table` SET msg = null WHERE x is null
'''
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
print(job.state)
# PENDING
print(job.error_result)
# None
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
Please check your configuration and provide full code with an error log if this doesn't help you solve your problem
BTW, I also verify this from the command line
sh-3.2# ./bq query --nouse_legacy_sql --batch=true 'UPDATE `project.dataset.table` SET msg = null WHERE x is null'
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (133s) Current status: RUNNING
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (139s) Current status: DONE
sh-3.2#
sh-3.2# python --version

Error with parametrized query in Google BigQuery

I am trying to write a query using Google BigQuery Python API. I am setting the project id and dataset name as parameters. I have looked into the parametrized queries implementation on Google Github.io. But when executing the query I get the following error
google.api_core.exceptions.BadRequest: 400 Invalid table name: #project:#dataset.AIRPORTS
I am confused whether we can substitute the project, dataset names with parameters.
Below is my code
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('service_account.json')
project = client.project
datasets = list(client.list_datasets())
dataset = datasets[0]
dataset_id = dataset.dataset_id
QUERY = (
'SELECT * '
'FROM `{}.{}.AIRPORTS`'.format(project, dataset_id)
)
query = (
'SELECT * '
'FROM `#project.#dataset.AIRPORTS`'
)
TIMEOUT = 30
param1 = bigquery.ScalarQueryParameter('project', 'STRING', project)
param2 = bigquery.ScalarQueryParameter('dataset', 'STRING', dataset_id)
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = [param1, param2]
query_job = client.query(
query, job_config=job_config)
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
print(rows)

You can only use parameters in place of expressions, such as column_name = #param_value in a WHERE clause. A table name is not an expression, so you cannot use parameters in place of the project or dataset names. Note also that you need to use standard SQL in order to use parameters.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

Export BigQuery Result to Avro or JSON - google-bigquery

Related

How to get result from BigQuery based on user input parameters

Can BigQuery API overwrite existing table/view with create_table() (tables insert)?

Create View that will extract metadata information about dataset and table sizes in different environments

BigQuery updates failing, but only when batched using Python API

Error with parametrized query in Google BigQuery

Categories

Resources