How can I get the results of a query in bigquery into a list? - google-bigquery

I'm running a DAG that runs multiple stored procedures in Bigquery in each DAG run. Currently, my code is the following
sp_names = [
'sp_airflow_test_1',
'sp_airflow_test_2'
]
# Define DAG
with DAG(
dag_id,
default_args = default_args) as dag:
for sp in sp_names:
i = i + 1
task_array.append(
BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
)
if i != len(sp_names):
task_array[i - 1] >> task_array[i]
I'd like my list "sp_names" to be the result of a query I do to a 1 column table that is stored on my BQ dataset, instead of being hardcoded like it is right now.
How can I do this?
Thanks in advance.

To execute multiple Bigquery with a similar SQL structure, create BigQueryOperator dynamically by create_dynamic_task function.
# funtion to create task
def create_dynamic_task(sp):
task = BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
return task
# dynamically create task
task_list = []
for sp in sp_names:
task_list.append(create_dynamic_task(sp))

Related

Insert into BigQuery using parameters from web server request json data

In the google function I'm triggering a dag:
endpoint = f"api/v1/dags/{dag_id}/dagRuns"
request_url = f"{web_server_url}/{endpoint}"
json_data = {"conf": {"file_name": file_name}}
response = make_composer2_web_server_request(
request_url, method="POST", json=json_data
)
It is possible to obtain that data in python operator:
def printParams(**kwargs):
for k, v in kwargs["params"].items():
print(f"{k}:{v}")
task = PythonOperator(
task_id="print_parameters",
python_callable=printParams)
What is the way to pass file_name into the sql statement taken from the POST request?
perform_analytics = BigQueryExecuteQueryOperator(
task_id="perform_analytics",
sql=f"""
Select **passFileNameVariableFromJsonDataHere** as FileName
FROM my_project.sample.table1
""",
destination_dataset_table=f"my_project.sample.table2",
write_disposition="WRITE_APPEND",
use_legacy_sql=False
)
Is there a way avoid using xCom variables as it is a single task level operation?
I was able to read the value in sql query like this:
sql=f"""
Select "{{{{dag_run.conf.get('file_name')}}}}" as FileName
FROM my_project.sample.table1
"""

How to pass a variable in a full cell magic command in Jupyter/Colab?

My code uses SQL to query a database hosted in BigQuery. Say I have a list of items stored in a variable:
list = ['a','b','c']
And I want to use that list as a parameter on a query like this:
%%bigquery --project xxx query
SELECT *
FROM `xxx.database.table`
WHERE items in list
As the magic command that calls the database is a full-cell command, how can I make some escape to get it to call the environment variables in the SQL query?
You can try UNNEST and the query in BIGQUERY works like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (['a','b','c'])
In your code it should look like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (list)
EDIT
I found two different ways to pass variables in Python.
The first approach is below. Is from google documentation[1].
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT * FROM `xx.mytable` WHERE items in UNNEST (#list)
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", ["a", "b", "c"]),
]
)
query_job = client.query(query, job_config=job_config) # Make an API request.
for row in query_job:
print("{}: \t{}".format(row.name, row.count))
The second approach is in the next document[2]. In your code should look like:
params = {'list': '[“a”,”b”,”c”]'}
%%bigquery df --params $params --project xxx query
select * from `xx.mytable`
where items in unnest (#list)
I also found some documentation[3] where it shows the parameters for %%bigquery magic.
[1]https://cloud.google.com/bigquery/docs/parameterized-queries#using_arrays_in_parameterized_queries
[2]https://notebook.community/GoogleCloudPlatform/python-docs-samples/notebooks/tutorials/bigquery/BigQuery%20query%20magic
[3]https://googleapis.dev/python/bigquery/latest/magics.html

Need to add query tag to snowflake queries while I fetch data from python ( using threadpool, code is provided)

I am using SQLAlchemy by create engine to connect python to snowflake for fetching data. Adding a snippet of code on how I am doing it. Before you suggest to use connector.snowflake, I have tried that and it has query tag but I need to pull queries by thread pool method and couldn't find a way to add query tag.
Have also tried ALTER SESSION SET QUERY_TAG but since the query runs parallelly it doesn't give the query tag.
Code:
vendor_class_query ='select * from table'
query_list1 = [vendor_class_query]
pool = ThreadPool(8)
def query(x):
engine = create_engine(
'snowflake://{user}:{password}#{account}/{database_name}/{schema_name}?\
warehouse={warehouse}&role={role}&paramstyle={paramstyle}'.format(
user=---------,
password=----------,
account=----------,
database_name=----------,
schema_name=----------,
warehouse=----------,
role=----------,
paramstyle='pyformat'
),
poolclass=NullPool
)
try:
connection = engine.connect()
for df in pd.read_sql_query(x, engine, chunksize=1000000000):
df.columns = map(str.upper, df.columns)
return df
finally:
connection.close()
engine.dispose()
return df
results1 = pool.map(query, query_list1)
vendor_class = results1[0]'''

BigQuery updates failing, but only when batched using Python API

I am trying to update a table using batched update statements. DML queries successfully execute in the BigQuery Web UI, but when batched, the first one succeeds while others fail. Why is this?
A sample query:
query = '''
update `project.dataset.Table`
set my_fk = 1234
where other_fk = 222 and
received >= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-22 05:28:12") and
received <= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-26 02:31:51")
'''
Sample code:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
queries = [] # list of DML Strings
jobs = []
for query in queries:
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Job output:
for job in jobs[1:]:
print(job.state)
# Done
print(job.error_result)
# {'message': 'Cannot set destination table in jobs with DML statements',
# 'reason': 'invalidQuery'}
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
I suspect that the problem is job_config getting some fields populated (destination in particular) by the BigQuery API after the first job is inserted. Then, the second job will fail as it will be a DML statement with a destination table in the job configuration. You can verify that with:
for query in queries:
print(job_config.destination)
job = client.query(query, location='US', job_config=job_config)
print(job_config.destination)
jobs.append(job)
To solve this you can avoid reusing the same job_config for all jobs:
for query in queries:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Your code seems to be working fine on a single update. This is what I tried using python 3.6.5 and v1.9.0 of the client API
from google.cloud import bigquery
client = bigquery.Client()
query = '''
UPDATE `project.dataset.table` SET msg = null WHERE x is null
'''
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
print(job.state)
# PENDING
print(job.error_result)
# None
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
Please check your configuration and provide full code with an error log if this doesn't help you solve your problem
BTW, I also verify this from the command line
sh-3.2# ./bq query --nouse_legacy_sql --batch=true 'UPDATE `project.dataset.table` SET msg = null WHERE x is null'
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (133s) Current status: RUNNING
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (139s) Current status: DONE
sh-3.2#
sh-3.2# python --version

Airflow BigQueryOperator: how to save query result in a partitioned Table?

I have a simple DAG
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table='my_dataset.my_table20180524'),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={})
start >> bq_query >> end
When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:
Partitioning specification must be provided in order to create partitioned table
How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator
but I didn't find any example on how to use that parameter.
EDIT:
I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(
You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table
and then run below airflow pipeline again.
You can try code:
import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
dag=dag
)
start >> bq_query >> end
So what I did is that I created a dynamic table name variable and passed to the BQ operator.
The main issue here is that I don't have access to the new version of google cloud python API, the prod is using version 0.27.0.
So, to get the job done, I made something bad and dirty:
saved the query result in a sharded table, let it be table_sharded
got table_sharded's schema, let it be table_schema
saved " SELECT * FROM dataset.table_sharded" query to a partitioned table providing table_schema
All this is abstracted in one single operator that uses a hook. The hook is responsible of creating/deleting tables/partitions, getting table schema and running queries on BigQuery.
Have a look at the code. If there is any other solution, please let me know.
Using BigQueryOperator you can pass time_partitioning parameter which will create ingestion-time partitioned tables
bq_cmd = BigQueryOperator (
task_id= "task_id",
sql= [query],
destination_dataset_table= destination_tbl,
use_legacy_sql= False,
write_disposition= 'WRITE_TRUNCATE',
time_partitioning= {'time_partitioning_type':'DAY'},
allow_large_results= True,
trigger_rule= 'all_success',
query_params= query_params,
dag= dag
)
from datetime import datetime,timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
'project_id': Variable.get('gcp_project'),
'zone': Variable.get('gce_zone'),
'region': Variable.get('gce_region'),
'location': Variable.get('gce_zone'),
}
with DAG(
'test',
start_date=datetime(2019, 1, 1),
schedule_interval=None,
catchup=False,
default_args=DEFAULT_DAG_ARGS) as dag:
bq_query = BigQueryOperator(
task_id='create-partition',
bql="""SELECT
*
FROM
`dataset.table_name`""", -- table from which you want to pull data
destination_dataset_table='project.dataset.table_name' + '$' + datetime.now().strftime('%Y%m%d'), -- Auto partitioned table in Bq
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
use_legacy_sql=False,
)
I recommend to use Variable in Airflow and create all fields and use in DAG.
By above code, partition will be added in Bigquery table for Todays date.