How to get result from BigQuery based on user input parameters - sql

This is a continuation of my previous post for making api that takes url parameter , passes it to BigQuery and if the luid record has data in orderid column, it returns True . How to check whether data exists in specific column on BigQuery with Flask?
I changed sql and it seems this sql works well on GCP console but as you can see , it returns Flase({'f0_': 0})) if you input correct parameter from browser. Do I need to fix this sql ??
[URL:https://test-989898.df.r.appspot.com?luid=U77777]
The output of return str(row)
↓
Row((True,), {'f0_': 0})
The output of SQL with same luid above on console
↓
row | f0_
1 | true
SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE luid = "U77777" AND orderid != '' limit 1000)
and I tried this article as below . User input parameter can not be available in BigQuery ??
https://cloud.google.com/bigquery/docs/parameterized-queries
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
for row in query_res:
return str(row)
↓
Row((True,), {'f0_': 0})
I've been stack in this problem for a while , I'm welcome to any idea . Anyone has good solutions ??
from flask import Flask, request, jsonify
from google.cloud import bigquery
app = Flask(__name__)
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
# first_row = next(iter(query_job.result()))
for row in query_res:
return str(row)
#return jsonify({luid:query_res.total_rows})
"""
if query_res == :
return jsonify({luid: str(True)})
else:
return jsonify({luid: str(False)})
"""
if __name__ == "__main__":
app.run()
↓
Row((True,), {'f0_': 0})

You seem to have solved most of the bits, it's just a question of getting them working together. Here's a quick sample that should help with the BigQuery things, and shows a different way of writing your query pattern using a public dataset table.
from google.cloud import bigquery
client = bigquery.Client()
# assume you get this from your flask app's param. this is the "luid" you're checking.
value = "treason"
# rewriting the sql demonstrate a similar thing with a public dataset table
sql = "SELECT COUNTIF(word=#luid AND corpus='sonnets') > 0 as word_is_sonnet FROM `bigquery-public-data.samples.shakespeare`"
config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", value),
]
)
job = client.query(sql, job_config=config)
# this is a bit odd, but in this case we know we're dealing with a single row
# coming from the iterable based on the query structure.
first_row = next(iter(job.result()))
print(first_row.get("word_is_sonnet"))
However, that said I'd make sure you're understanding how BigQuery works and charges for queries. You seem to be doing point lookups for a range of tables (the wildcard table in your original query), which means you're potentially doing a lot of table scanning to satisfy this request.
I just wanted to call that out so you're not surprised by either the performance or the costs if the intent is that you're issuing many requests like this.

Related

Assigning date variable in Google Big Query query

I am trying to add a date variable in my query in GBQ.
So I have variable x (ex: 2016-04-20) which I want to use in query like:
#Query the necessary data
customer_data_query = """
SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers`
where CustomerRegisterDate > #max_last_date LIMIT 5) """
print(customer_data_query)
# Creating a connection to the google bigquery
client = bigquery.Client.from_service_account_json('./credentials/cred_ofo.json')
print("Connection to Google BigQuery is established")
query_params = [
bigquery.ScalarQueryParameter("max_last_date", "STRING", max_last_date),
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
customer_data = client.query(
customer_data_query,
# Location must match that of the dataset(s) referenced in the query.
location="US",
job_config=job_config,
).to_dataframe() # API request - starts the query
Any tips on how I can do it?
I have tried in the code above but not worked.
There were 2 solutions:
1 was to use format:
"""SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers` where CustomerRegisterDate > {} LIMIT 5""".format(max_date)
2:
To define parameters to be able to use in job_config, which is mentioned in BigQuery parametrized query documentation.

BigQuery updates failing, but only when batched using Python API

I am trying to update a table using batched update statements. DML queries successfully execute in the BigQuery Web UI, but when batched, the first one succeeds while others fail. Why is this?
A sample query:
query = '''
update `project.dataset.Table`
set my_fk = 1234
where other_fk = 222 and
received >= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-22 05:28:12") and
received <= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-26 02:31:51")
'''
Sample code:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
queries = [] # list of DML Strings
jobs = []
for query in queries:
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Job output:
for job in jobs[1:]:
print(job.state)
# Done
print(job.error_result)
# {'message': 'Cannot set destination table in jobs with DML statements',
# 'reason': 'invalidQuery'}
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
I suspect that the problem is job_config getting some fields populated (destination in particular) by the BigQuery API after the first job is inserted. Then, the second job will fail as it will be a DML statement with a destination table in the job configuration. You can verify that with:
for query in queries:
print(job_config.destination)
job = client.query(query, location='US', job_config=job_config)
print(job_config.destination)
jobs.append(job)
To solve this you can avoid reusing the same job_config for all jobs:
for query in queries:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Your code seems to be working fine on a single update. This is what I tried using python 3.6.5 and v1.9.0 of the client API
from google.cloud import bigquery
client = bigquery.Client()
query = '''
UPDATE `project.dataset.table` SET msg = null WHERE x is null
'''
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
print(job.state)
# PENDING
print(job.error_result)
# None
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
Please check your configuration and provide full code with an error log if this doesn't help you solve your problem
BTW, I also verify this from the command line
sh-3.2# ./bq query --nouse_legacy_sql --batch=true 'UPDATE `project.dataset.table` SET msg = null WHERE x is null'
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (133s) Current status: RUNNING
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (139s) Current status: DONE
sh-3.2#
sh-3.2# python --version

Export BigQuery Result to Avro or JSON

Would someone please let me if there is a way to save the BigQuery Result to JSON or Avro format.
I am using following code to run the query on BigQuery Table.
client = bigquery.Client.from_service_account_json('/Users/gaurang.shah/Downloads/fb3735b731b9.json')
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
sql = """
select *
FROM `bigquery-public-data.samples.shakespeare`
limit 1;
"""
location = 'US'
query_job = client.query(sql, location=location, job_config=job_config)
query_job = client.get_job(query_job.job_id, location=location)
print(query_job.result())
I am trying to export the BigQuery table without using GCS in between. And this is one way I think I could achieve that.
The other way I think is using BQ command line tool. However not sure if it has any limit on how many queries I can fire and how much data I can retrieve.
You need to first run your query, write the results to a table, and then hook into the BigQuery export/extract API, where the results/table can be exported to GCS in the format you want. For example, here's CSV:
# from google.cloud import bigquery
# client = bigquery.Client()
# bucket_name = 'my-bucket'
project = 'bigquery-public-data'
dataset_id = 'samples'
table_id = 'shakespeare'
destination_uri = 'gs://{}/{}'.format(bucket_name, 'shakespeare.csv')
dataset_ref = client.dataset(dataset_id, project=project)
table_ref = dataset_ref.table(table_id)
extract_job = client.extract_table(
table_ref,
destination_uri,
# Location must match that of the source table.
location='US') # API request
extract_job.result() # Waits for job to complete.
print('Exported {}:{}.{} to {}'.format(
project, dataset_id, table_id, destination_uri))
See more here.

Error with parametrized query in Google BigQuery

I am trying to write a query using Google BigQuery Python API. I am setting the project id and dataset name as parameters. I have looked into the parametrized queries implementation on Google Github.io. But when executing the query I get the following error
google.api_core.exceptions.BadRequest: 400 Invalid table name: #project:#dataset.AIRPORTS
I am confused whether we can substitute the project, dataset names with parameters.
Below is my code
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('service_account.json')
project = client.project
datasets = list(client.list_datasets())
dataset = datasets[0]
dataset_id = dataset.dataset_id
QUERY = (
'SELECT * '
'FROM `{}.{}.AIRPORTS`'.format(project, dataset_id)
)
query = (
'SELECT * '
'FROM `#project.#dataset.AIRPORTS`'
)
TIMEOUT = 30
param1 = bigquery.ScalarQueryParameter('project', 'STRING', project)
param2 = bigquery.ScalarQueryParameter('dataset', 'STRING', dataset_id)
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = [param1, param2]
query_job = client.query(
query, job_config=job_config)
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
print(rows)
You can only use parameters in place of expressions, such as column_name = #param_value in a WHERE clause. A table name is not an expression, so you cannot use parameters in place of the project or dataset names. Note also that you need to use standard SQL in order to use parameters.

BigQuery Pagination

How to do pagination using BigQuery ,when using javascript
First I send request:
var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': query,
'maxResults' : 50,
'pageToken': pageToken
});
This query will return me first 50 results and then how can i retrieve next 50 results.I want to do pagination dynamically using javascript and Bigquery.
query:
SELECT year, month,day,state,mother_age, AVG(weight_pounds) as AvgWeight FROM [publicdata:samples.natality] Group EACH By year, month,day,state, mother_age
This is the query that i am using.
TableData.list works, or alternately you can use jobs.getQueryResults(), which is usually the preferred way to get query results (since it can also wait for the query to complete).
You should use the page token returned from the original query response or the previous jobs.getQueryResults() call to iterate through pages. This is generally more efficient and reliable than using index-based pagination.
I don't have a javascript example, but here is an example using python that should be relatively easy to adapt:
from apiclient.discovery import build
def run_query(http, service, project_id, query, response_handler,
timeout=30*1000, max_results=1024):
query_request = {
'query': query,
'timeoutMs': timeout,
'maxResults': max_results}
print 'Running query "%s"' % (query,)
response = service.jobs().query(projectId=project_id,
body=query_request).execute(http)
job_ref = response['jobReference']
get_results_request = {
'projectId': project_id,
'jobId': job_ref['jobId'],
'timeoutMs': timeout,
'maxResults': max_results}
while True:
print 'Response %s' % (response,)
page_token = response.get('pageToken', None)
query_complete = response['jobComplete']
if query_complete:
response_handler(response)
if page_token is None:
# Our work is done, query is done and there are no more
# results to read.
break;
# Set the page token so that we know where to start reading from.
get_results_request['pageToken'] = page_token
# Apply a python trick here to turn the get_results_request dict
# into method arguments.
response = service.jobs().getQueryResults(
**get_results_request).execute(http)
def print_results(results):
fields = results['schema']['fields']
rows = results['rows']
for row in rows:
for i in xrange(0, len(fields)):
cell = row['f'][i]
field = fields[i]
print "%s: %s " % (field['name'], cell['v']),
print ''
def run(http, query):
service = build('bigquery', 'v2')
project_id = '#Your Project Here#'
run_query(http, service, project_id, query, print_results,
timeout=1)
Once the query has run, all results will be saved to a temporary table (or permanent, if you have set the respective flag).
You can read these results with tabledata.list. Notice that it offers an startIndex argument, so you can jump to any arbitrary page too, not only to the next one.
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list