BigQuery Pagination - google-bigquery

How to do pagination using BigQuery ,when using javascript
First I send request:
var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': query,
'maxResults' : 50,
'pageToken': pageToken
});
This query will return me first 50 results and then how can i retrieve next 50 results.I want to do pagination dynamically using javascript and Bigquery.
query:
SELECT year, month,day,state,mother_age, AVG(weight_pounds) as AvgWeight FROM [publicdata:samples.natality] Group EACH By year, month,day,state, mother_age
This is the query that i am using.

TableData.list works, or alternately you can use jobs.getQueryResults(), which is usually the preferred way to get query results (since it can also wait for the query to complete).
You should use the page token returned from the original query response or the previous jobs.getQueryResults() call to iterate through pages. This is generally more efficient and reliable than using index-based pagination.
I don't have a javascript example, but here is an example using python that should be relatively easy to adapt:
from apiclient.discovery import build
def run_query(http, service, project_id, query, response_handler,
timeout=30*1000, max_results=1024):
query_request = {
'query': query,
'timeoutMs': timeout,
'maxResults': max_results}
print 'Running query "%s"' % (query,)
response = service.jobs().query(projectId=project_id,
body=query_request).execute(http)
job_ref = response['jobReference']
get_results_request = {
'projectId': project_id,
'jobId': job_ref['jobId'],
'timeoutMs': timeout,
'maxResults': max_results}
while True:
print 'Response %s' % (response,)
page_token = response.get('pageToken', None)
query_complete = response['jobComplete']
if query_complete:
response_handler(response)
if page_token is None:
# Our work is done, query is done and there are no more
# results to read.
break;
# Set the page token so that we know where to start reading from.
get_results_request['pageToken'] = page_token
# Apply a python trick here to turn the get_results_request dict
# into method arguments.
response = service.jobs().getQueryResults(
**get_results_request).execute(http)
def print_results(results):
fields = results['schema']['fields']
rows = results['rows']
for row in rows:
for i in xrange(0, len(fields)):
cell = row['f'][i]
field = fields[i]
print "%s: %s " % (field['name'], cell['v']),
print ''
def run(http, query):
service = build('bigquery', 'v2')
project_id = '#Your Project Here#'
run_query(http, service, project_id, query, print_results,
timeout=1)

Once the query has run, all results will be saved to a temporary table (or permanent, if you have set the respective flag).
You can read these results with tabledata.list. Notice that it offers an startIndex argument, so you can jump to any arbitrary page too, not only to the next one.
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list

Related

How to get result from BigQuery based on user input parameters

This is a continuation of my previous post for making api that takes url parameter , passes it to BigQuery and if the luid record has data in orderid column, it returns True . How to check whether data exists in specific column on BigQuery with Flask?
I changed sql and it seems this sql works well on GCP console but as you can see , it returns Flase({'f0_': 0})) if you input correct parameter from browser. Do I need to fix this sql ??
[URL:https://test-989898.df.r.appspot.com?luid=U77777]
The output of return str(row)
↓
Row((True,), {'f0_': 0})
The output of SQL with same luid above on console
↓
row | f0_
1 | true
SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE luid = "U77777" AND orderid != '' limit 1000)
and I tried this article as below . User input parameter can not be available in BigQuery ??
https://cloud.google.com/bigquery/docs/parameterized-queries
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
for row in query_res:
return str(row)
↓
Row((True,), {'f0_': 0})
I've been stack in this problem for a while , I'm welcome to any idea . Anyone has good solutions ??
from flask import Flask, request, jsonify
from google.cloud import bigquery
app = Flask(__name__)
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
# first_row = next(iter(query_job.result()))
for row in query_res:
return str(row)
#return jsonify({luid:query_res.total_rows})
"""
if query_res == :
return jsonify({luid: str(True)})
else:
return jsonify({luid: str(False)})
"""
if __name__ == "__main__":
app.run()
↓
Row((True,), {'f0_': 0})
You seem to have solved most of the bits, it's just a question of getting them working together. Here's a quick sample that should help with the BigQuery things, and shows a different way of writing your query pattern using a public dataset table.
from google.cloud import bigquery
client = bigquery.Client()
# assume you get this from your flask app's param. this is the "luid" you're checking.
value = "treason"
# rewriting the sql demonstrate a similar thing with a public dataset table
sql = "SELECT COUNTIF(word=#luid AND corpus='sonnets') > 0 as word_is_sonnet FROM `bigquery-public-data.samples.shakespeare`"
config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", value),
]
)
job = client.query(sql, job_config=config)
# this is a bit odd, but in this case we know we're dealing with a single row
# coming from the iterable based on the query structure.
first_row = next(iter(job.result()))
print(first_row.get("word_is_sonnet"))
However, that said I'd make sure you're understanding how BigQuery works and charges for queries. You seem to be doing point lookups for a range of tables (the wildcard table in your original query), which means you're potentially doing a lot of table scanning to satisfy this request.
I just wanted to call that out so you're not surprised by either the performance or the costs if the intent is that you're issuing many requests like this.

Assigning date variable in Google Big Query query

I am trying to add a date variable in my query in GBQ.
So I have variable x (ex: 2016-04-20) which I want to use in query like:
#Query the necessary data
customer_data_query = """
SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers`
where CustomerRegisterDate > #max_last_date LIMIT 5) """
print(customer_data_query)
# Creating a connection to the google bigquery
client = bigquery.Client.from_service_account_json('./credentials/cred_ofo.json')
print("Connection to Google BigQuery is established")
query_params = [
bigquery.ScalarQueryParameter("max_last_date", "STRING", max_last_date),
]
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = query_params
customer_data = client.query(
customer_data_query,
# Location must match that of the dataset(s) referenced in the query.
location="US",
job_config=job_config,
).to_dataframe() # API request - starts the query
Any tips on how I can do it?
I have tried in the code above but not worked.
There were 2 solutions:
1 was to use format:
"""SELECT FirstName, LastName, Organisation, CustomerRegisterDate FROM `bigquery-bi.ofo.Customers` where CustomerRegisterDate > {} LIMIT 5""".format(max_date)
2:
To define parameters to be able to use in job_config, which is mentioned in BigQuery parametrized query documentation.

Django: How to know the raw sql and the time taken by a model queryset

I am trying to know the
1) time taken or duration and
2) raw sql
for any django model queryset
Eg:
users = User.objects.all()
print(time_take for users)
print(raw sql of users)
Answer tried from going through some previous solutions on this questions in Stack
sql:
I thinking to use the below as answer for sql query. It is given by #Flash in https://stackoverflow.com/a/47542953/2897115
from django.db import connections
def str_query(qs):
"""
qs.query returns something that isn't valid SQL, this returns the actual
valid SQL that's executed: https://code.djangoproject.com/ticket/17741
"""
cursor = connections[qs.db].cursor()
query, params = qs.query.sql_with_params()
cursor.execute('EXPLAIN ' + query, params)
res = str(cursor.db.ops.last_executed_query(cursor, query, params))
assert res.startswith('EXPLAIN ')
return res[len('EXPLAIN '):]
timetaken:
And for timetaken i use start = time() and stop = time()
the code becomes:
someview()
start = time()
qs = Somemodels.objecsts.all()
stop = time()
sql = str_query(qs)
timetaken = "%.3f" % (stop - start)
...
Q Will this show the correct values of sql and timetaken.
Q Is there any way to know the timetaken from the cursor.db module instead of using start = time() and stop = time()
I also found someplace to get sql using:
from django import db
db.connection.queries[-1]
Q How is this different from str_query(qs) method i am trying to use

Efficient way of iterating rows in pandas

I have a dataframe in pandas containing the following information
Using a for loop for each entry in the TRANSACTION_ID, I am calling the following function,
def checkForImages(TransNum):
"""pass function a transaction number and get the string with image found information then store that
string into the same row in a new column"""
try:
cursor.execute('select CAMERA_TYPE from VEHICLE_IMAGE where TRANSACTION_ID=' + str(TransNum))
result = ''
for img_type in cursor:
result = result + img_type[0]
if result == '':
result = 'No image available'
print 'Images found: ' + str(TransNum) + " "+ result
resultSort = result.split()
resultSort.sort()
result = ''
for i in range(len(resultSort)):
result = result + " " + resultSort[i]
cursor.close()
return result
except Exception as e:
# print 'Error occured while getting image references: ', e
pass
This function returns a string which is either 'No images available' or has the image information if found. I have to create a new column in the dataframe populated with this result so my final dataframe should look like this
My question is: How can I speed up this process? Using for loop on rows with 100k+ entries is extremely slow and painful. I have looked into functions like dataframe.map and dataframe.apply but haven't been able to get it working. Other options I see is using cython or multiple threads. In which option should I invest my time? Any help is appreciated
You query Oracle for each transaction and then additionally aggregate fetched data for each transaction in a loop - it's very inefficient.
First i would create a "mapping" DataFrame like as follows:
transaction_id images
111 No image available
112 FRONT REAR
113 OVERVIEW
this can be done using Oracle's LISTAGG function:
qry = """
select
transaction_id,
NVL(listagg(camera_type, ' ') within group (order by camera_type), 'No image available') as images
from vehicle_image group by transaction_id
"""
# `engine` - is a SQLAlchemy engine connection ...
cam = pd.read_sql(qry, con=engine, index_col=['transaction_id'])
after that we can use Series.map() method:
df['Image_Found'] = df.transaction_id.map(cam.images)

django using .extra() got error `only a single result allowed for a SELECT that is part of an expression`

I'm trying to use .extra() where the query return more than 1 result, like :
'SELECT "books_books"."*" FROM "books_books" WHERE "books_books"."owner_id" = %s' % request.user.id
I got an error : only a single result allowed for a SELECT that is part of an expression
Try it on dev-server using sqlite3. Anybody knows how to fix this? Or my query is wrong?
EDIT:
I'm using django-simple-ratings, my model like this :
class Thread(models.Model):
#
#
ratings = Ratings()
I want to display each Thread's ratings and whether a user already rated it or not. For 2 items, it will hit 6 times, 1 for the actual Thread and 2 for accessing the ratings. The query:
threads = Thread.ratings.order_by_rating().filter(section = section)\
.select_related('creator')\
.prefetch_related('replies')
threads = threads.extra(select = dict(myratings = "SELECT SUM('section_threadrating'.'score') AS 'agg' FROM 'section_threadrating' WHERE 'section_threadrating'.'content_object_id' = 'section_thread'.'id' ",)
Then i can print each Thread's ratings without hitting the db more. For the 2nd query, i add :
#continue from extra
blahblah.extra(select = dict(myratings = '#####code above####',
voter_id = "SELECT 'section_threadrating'.'user_id' FROM 'section_threadrating' WHERE ('section_threadrating'.'content_object_id' = 'section_thread'.'id' AND 'section_threadrating'.'user_id' = '3') "))
Hard-coded the user_id. Then when i use it on template like this :
{% ifequal threads.voter_id user.id %}
#the rest of the code
I got an error : only a single result allowed for a SELECT that is part of an expression
Let me know if it's not clear enough.
The problem is in the query. Generally, when you are writing subqueries, they must return only 1 result. So a subquery like the one voter_id:
select ..., (select sectio_threadrating.user_id from ...) as voter_id from ....
is invalid, because it can return more than one result. If you are sure it will always return one result, you can use the max() or min() aggregation function:
blahblah.extra(select = dict(myratings = '#####code above####',
voter_id = "SELECT max('section_threadrating'.'user_id') FROM 'section_threadrating' WHERE ('section_threadrating'.'content_object_id' = 'section_thread'.'id' AND 'section_threadrating'.'user_id' = '3') "))
This will make the subquery always return 1 result.
Removing that hard-code, what user_id are you expecting to retrieve here? Maybe you just can't reduce to 1 user using only SQL.