In the google function I'm triggering a dag:
endpoint = f"api/v1/dags/{dag_id}/dagRuns"
request_url = f"{web_server_url}/{endpoint}"
json_data = {"conf": {"file_name": file_name}}
response = make_composer2_web_server_request(
request_url, method="POST", json=json_data
)
It is possible to obtain that data in python operator:
def printParams(**kwargs):
for k, v in kwargs["params"].items():
print(f"{k}:{v}")
task = PythonOperator(
task_id="print_parameters",
python_callable=printParams)
What is the way to pass file_name into the sql statement taken from the POST request?
perform_analytics = BigQueryExecuteQueryOperator(
task_id="perform_analytics",
sql=f"""
Select **passFileNameVariableFromJsonDataHere** as FileName
FROM my_project.sample.table1
""",
destination_dataset_table=f"my_project.sample.table2",
write_disposition="WRITE_APPEND",
use_legacy_sql=False
)
Is there a way avoid using xCom variables as it is a single task level operation?
I was able to read the value in sql query like this:
sql=f"""
Select "{{{{dag_run.conf.get('file_name')}}}}" as FileName
FROM my_project.sample.table1
"""
Related
My code uses SQL to query a database hosted in BigQuery. Say I have a list of items stored in a variable:
list = ['a','b','c']
And I want to use that list as a parameter on a query like this:
%%bigquery --project xxx query
SELECT *
FROM `xxx.database.table`
WHERE items in list
As the magic command that calls the database is a full-cell command, how can I make some escape to get it to call the environment variables in the SQL query?
You can try UNNEST and the query in BIGQUERY works like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (['a','b','c'])
In your code it should look like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (list)
EDIT
I found two different ways to pass variables in Python.
The first approach is below. Is from google documentation[1].
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT * FROM `xx.mytable` WHERE items in UNNEST (#list)
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", ["a", "b", "c"]),
]
)
query_job = client.query(query, job_config=job_config) # Make an API request.
for row in query_job:
print("{}: \t{}".format(row.name, row.count))
The second approach is in the next document[2]. In your code should look like:
params = {'list': '[“a”,”b”,”c”]'}
%%bigquery df --params $params --project xxx query
select * from `xx.mytable`
where items in unnest (#list)
I also found some documentation[3] where it shows the parameters for %%bigquery magic.
[1]https://cloud.google.com/bigquery/docs/parameterized-queries#using_arrays_in_parameterized_queries
[2]https://notebook.community/GoogleCloudPlatform/python-docs-samples/notebooks/tutorials/bigquery/BigQuery%20query%20magic
[3]https://googleapis.dev/python/bigquery/latest/magics.html
I am using SQLAlchemy by create engine to connect python to snowflake for fetching data. Adding a snippet of code on how I am doing it. Before you suggest to use connector.snowflake, I have tried that and it has query tag but I need to pull queries by thread pool method and couldn't find a way to add query tag.
Have also tried ALTER SESSION SET QUERY_TAG but since the query runs parallelly it doesn't give the query tag.
Code:
vendor_class_query ='select * from table'
query_list1 = [vendor_class_query]
pool = ThreadPool(8)
def query(x):
engine = create_engine(
'snowflake://{user}:{password}#{account}/{database_name}/{schema_name}?\
warehouse={warehouse}&role={role}¶mstyle={paramstyle}'.format(
user=---------,
password=----------,
account=----------,
database_name=----------,
schema_name=----------,
warehouse=----------,
role=----------,
paramstyle='pyformat'
),
poolclass=NullPool
)
try:
connection = engine.connect()
for df in pd.read_sql_query(x, engine, chunksize=1000000000):
df.columns = map(str.upper, df.columns)
return df
finally:
connection.close()
engine.dispose()
return df
results1 = pool.map(query, query_list1)
vendor_class = results1[0]'''
This is a continuation of my previous post for making api that takes url parameter , passes it to BigQuery and if the luid record has data in orderid column, it returns True . How to check whether data exists in specific column on BigQuery with Flask?
I changed sql and it seems this sql works well on GCP console but as you can see , it returns Flase({'f0_': 0})) if you input correct parameter from browser. Do I need to fix this sql ??
[URL:https://test-989898.df.r.appspot.com?luid=U77777]
The output of return str(row)
↓
Row((True,), {'f0_': 0})
The output of SQL with same luid above on console
↓
row | f0_
1 | true
SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE luid = "U77777" AND orderid != '' limit 1000)
and I tried this article as below . User input parameter can not be available in BigQuery ??
https://cloud.google.com/bigquery/docs/parameterized-queries
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
for row in query_res:
return str(row)
↓
Row((True,), {'f0_': 0})
I've been stack in this problem for a while , I'm welcome to any idea . Anyone has good solutions ??
from flask import Flask, request, jsonify
from google.cloud import bigquery
app = Flask(__name__)
#app.route('/')
def get_request():
luid = request.args.get('luid') or ''
client = bigquery.Client()
query = """SELECT EXISTS(SELECT 1
FROM `test-266110.conversion_log.conversion_log_2020*` as p
WHERE #luid = p.luid AND orderid != '' limit 1000)"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", luid),
]
)
query_job = client.query(query, job_config=job_config)
query_res = query_job.result()
# first_row = next(iter(query_job.result()))
for row in query_res:
return str(row)
#return jsonify({luid:query_res.total_rows})
"""
if query_res == :
return jsonify({luid: str(True)})
else:
return jsonify({luid: str(False)})
"""
if __name__ == "__main__":
app.run()
↓
Row((True,), {'f0_': 0})
You seem to have solved most of the bits, it's just a question of getting them working together. Here's a quick sample that should help with the BigQuery things, and shows a different way of writing your query pattern using a public dataset table.
from google.cloud import bigquery
client = bigquery.Client()
# assume you get this from your flask app's param. this is the "luid" you're checking.
value = "treason"
# rewriting the sql demonstrate a similar thing with a public dataset table
sql = "SELECT COUNTIF(word=#luid AND corpus='sonnets') > 0 as word_is_sonnet FROM `bigquery-public-data.samples.shakespeare`"
config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ScalarQueryParameter("luid", "STRING", value),
]
)
job = client.query(sql, job_config=config)
# this is a bit odd, but in this case we know we're dealing with a single row
# coming from the iterable based on the query structure.
first_row = next(iter(job.result()))
print(first_row.get("word_is_sonnet"))
However, that said I'd make sure you're understanding how BigQuery works and charges for queries. You seem to be doing point lookups for a range of tables (the wildcard table in your original query), which means you're potentially doing a lot of table scanning to satisfy this request.
I just wanted to call that out so you're not surprised by either the performance or the costs if the intent is that you're issuing many requests like this.
I am trying to write a query using Google BigQuery Python API. I am setting the project id and dataset name as parameters. I have looked into the parametrized queries implementation on Google Github.io. But when executing the query I get the following error
google.api_core.exceptions.BadRequest: 400 Invalid table name: #project:#dataset.AIRPORTS
I am confused whether we can substitute the project, dataset names with parameters.
Below is my code
from google.cloud import bigquery
client = bigquery.Client.from_service_account_json('service_account.json')
project = client.project
datasets = list(client.list_datasets())
dataset = datasets[0]
dataset_id = dataset.dataset_id
QUERY = (
'SELECT * '
'FROM `{}.{}.AIRPORTS`'.format(project, dataset_id)
)
query = (
'SELECT * '
'FROM `#project.#dataset.AIRPORTS`'
)
TIMEOUT = 30
param1 = bigquery.ScalarQueryParameter('project', 'STRING', project)
param2 = bigquery.ScalarQueryParameter('dataset', 'STRING', dataset_id)
job_config = bigquery.QueryJobConfig()
job_config.query_parameters = [param1, param2]
query_job = client.query(
query, job_config=job_config)
iterator = query_job.result(timeout=TIMEOUT)
rows = list(iterator)
print(rows)
You can only use parameters in place of expressions, such as column_name = #param_value in a WHERE clause. A table name is not an expression, so you cannot use parameters in place of the project or dataset names. Note also that you need to use standard SQL in order to use parameters.
How to do pagination using BigQuery ,when using javascript
First I send request:
var request = gapi.client.bigquery.jobs.query({
'projectId': project_id,
'timeoutMs': '30000',
'query': query,
'maxResults' : 50,
'pageToken': pageToken
});
This query will return me first 50 results and then how can i retrieve next 50 results.I want to do pagination dynamically using javascript and Bigquery.
query:
SELECT year, month,day,state,mother_age, AVG(weight_pounds) as AvgWeight FROM [publicdata:samples.natality] Group EACH By year, month,day,state, mother_age
This is the query that i am using.
TableData.list works, or alternately you can use jobs.getQueryResults(), which is usually the preferred way to get query results (since it can also wait for the query to complete).
You should use the page token returned from the original query response or the previous jobs.getQueryResults() call to iterate through pages. This is generally more efficient and reliable than using index-based pagination.
I don't have a javascript example, but here is an example using python that should be relatively easy to adapt:
from apiclient.discovery import build
def run_query(http, service, project_id, query, response_handler,
timeout=30*1000, max_results=1024):
query_request = {
'query': query,
'timeoutMs': timeout,
'maxResults': max_results}
print 'Running query "%s"' % (query,)
response = service.jobs().query(projectId=project_id,
body=query_request).execute(http)
job_ref = response['jobReference']
get_results_request = {
'projectId': project_id,
'jobId': job_ref['jobId'],
'timeoutMs': timeout,
'maxResults': max_results}
while True:
print 'Response %s' % (response,)
page_token = response.get('pageToken', None)
query_complete = response['jobComplete']
if query_complete:
response_handler(response)
if page_token is None:
# Our work is done, query is done and there are no more
# results to read.
break;
# Set the page token so that we know where to start reading from.
get_results_request['pageToken'] = page_token
# Apply a python trick here to turn the get_results_request dict
# into method arguments.
response = service.jobs().getQueryResults(
**get_results_request).execute(http)
def print_results(results):
fields = results['schema']['fields']
rows = results['rows']
for row in rows:
for i in xrange(0, len(fields)):
cell = row['f'][i]
field = fields[i]
print "%s: %s " % (field['name'], cell['v']),
print ''
def run(http, query):
service = build('bigquery', 'v2')
project_id = '#Your Project Here#'
run_query(http, service, project_id, query, print_results,
timeout=1)
Once the query has run, all results will be saved to a temporary table (or permanent, if you have set the respective flag).
You can read these results with tabledata.list. Notice that it offers an startIndex argument, so you can jump to any arbitrary page too, not only to the next one.
https://developers.google.com/bigquery/docs/reference/v2/tabledata/list