How can I get the results of a query in bigquery into a list?

How can I get the results of a query in bigquery into a list? - google-bigquery

I'm running a DAG that runs multiple stored procedures in Bigquery in each DAG run. Currently, my code is the following
sp_names = [
'sp_airflow_test_1',
'sp_airflow_test_2'
]
# Define DAG
with DAG(
dag_id,
default_args = default_args) as dag:
for sp in sp_names:
i = i + 1
task_array.append(
BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
)
if i != len(sp_names):
task_array[i - 1] >> task_array[i]
I'd like my list "sp_names" to be the result of a query I do to a 1 column table that is stored on my BQ dataset, instead of being hardcoded like it is right now.
How can I do this?
Thanks in advance.

To execute multiple Bigquery with a similar SQL structure, create BigQueryOperator dynamically by create_dynamic_task function.
# funtion to create task
def create_dynamic_task(sp):
task = BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
return task
# dynamically create task
task_list = []
for sp in sp_names:
task_list.append(create_dynamic_task(sp))

Related

Insert into BigQuery using parameters from web server request json data

In the google function I'm triggering a dag:
endpoint = f"api/v1/dags/{dag_id}/dagRuns"
request_url = f"{web_server_url}/{endpoint}"
json_data = {"conf": {"file_name": file_name}}
response = make_composer2_web_server_request(
request_url, method="POST", json=json_data
)
It is possible to obtain that data in python operator:
def printParams(**kwargs):
for k, v in kwargs["params"].items():
print(f"{k}:{v}")
task = PythonOperator(
task_id="print_parameters",
python_callable=printParams)
What is the way to pass file_name into the sql statement taken from the POST request?
perform_analytics = BigQueryExecuteQueryOperator(
task_id="perform_analytics",
sql=f"""
Select **passFileNameVariableFromJsonDataHere** as FileName
FROM my_project.sample.table1
""",
destination_dataset_table=f"my_project.sample.table2",
write_disposition="WRITE_APPEND",
use_legacy_sql=False
)
Is there a way avoid using xCom variables as it is a single task level operation?

I was able to read the value in sql query like this:
sql=f"""
Select "{{{{dag_run.conf.get('file_name')}}}}" as FileName
FROM my_project.sample.table1
"""

How to pass a variable in a full cell magic command in Jupyter/Colab?

My code uses SQL to query a database hosted in BigQuery. Say I have a list of items stored in a variable:
list = ['a','b','c']
And I want to use that list as a parameter on a query like this:
%%bigquery --project xxx query
SELECT *
FROM `xxx.database.table`
WHERE items in list
As the magic command that calls the database is a full-cell command, how can I make some escape to get it to call the environment variables in the SQL query?

You can try UNNEST and the query in BIGQUERY works like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (['a','b','c'])
In your code it should look like this:
SELECT * FROM `xx.mytable` WHERE items in UNNEST (list)
EDIT
I found two different ways to pass variables in Python.
The first approach is below. Is from google documentation[1].
from google.cloud import bigquery
# Construct a BigQuery client object.
client = bigquery.Client()
query = """
SELECT * FROM `xx.mytable` WHERE items in UNNEST (#list)
"""
job_config = bigquery.QueryJobConfig(
query_parameters=[
bigquery.ArrayQueryParameter("list", "STRING", ["a", "b", "c"]),
]
)
query_job = client.query(query, job_config=job_config) # Make an API request.
for row in query_job:
print("{}: \t{}".format(row.name, row.count))
The second approach is in the next document[2]. In your code should look like:
params = {'list': '[“a”,”b”,”c”]'}
%%bigquery df --params $params --project xxx query
select * from `xx.mytable`
where items in unnest (#list)
I also found some documentation[3] where it shows the parameters for %%bigquery magic.
[1]https://cloud.google.com/bigquery/docs/parameterized-queries#using_arrays_in_parameterized_queries
[2]https://notebook.community/GoogleCloudPlatform/python-docs-samples/notebooks/tutorials/bigquery/BigQuery%20query%20magic
[3]https://googleapis.dev/python/bigquery/latest/magics.html

Need to add query tag to snowflake queries while I fetch data from python ( using threadpool, code is provided)

I am using SQLAlchemy by create engine to connect python to snowflake for fetching data. Adding a snippet of code on how I am doing it. Before you suggest to use connector.snowflake, I have tried that and it has query tag but I need to pull queries by thread pool method and couldn't find a way to add query tag.
Have also tried ALTER SESSION SET QUERY_TAG but since the query runs parallelly it doesn't give the query tag.
Code:
vendor_class_query ='select * from table'
query_list1 = [vendor_class_query]
pool = ThreadPool(8)
def query(x):
engine = create_engine(
'snowflake://{user}:{password}#{account}/{database_name}/{schema_name}?\
warehouse={warehouse}&role={role}&paramstyle={paramstyle}'.format(
user=---------,
password=----------,
account=----------,
database_name=----------,
schema_name=----------,
warehouse=----------,
role=----------,
paramstyle='pyformat'
),
poolclass=NullPool
)
try:
connection = engine.connect()
for df in pd.read_sql_query(x, engine, chunksize=1000000000):
df.columns = map(str.upper, df.columns)
return df
finally:
connection.close()
engine.dispose()
return df
results1 = pool.map(query, query_list1)
vendor_class = results1[0]'''

BigQuery updates failing, but only when batched using Python API

I am trying to update a table using batched update statements. DML queries successfully execute in the BigQuery Web UI, but when batched, the first one succeeds while others fail. Why is this?
A sample query:
query = '''
update `project.dataset.Table`
set my_fk = 1234
where other_fk = 222 and
received >= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-22 05:28:12") and
received <= PARSE_TIMESTAMP("%Y-%m-%d %H:%M:%S", "2018-01-26 02:31:51")
'''
Sample code:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
queries = [] # list of DML Strings
jobs = []
for query in queries:
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)
Job output:
for job in jobs[1:]:
print(job.state)
# Done
print(job.error_result)
# {'message': 'Cannot set destination table in jobs with DML statements',
# 'reason': 'invalidQuery'}
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query

I suspect that the problem is job_config getting some fields populated (destination in particular) by the BigQuery API after the first job is inserted. Then, the second job will fail as it will be a DML statement with a destination table in the job configuration. You can verify that with:
for query in queries:
print(job_config.destination)
job = client.query(query, location='US', job_config=job_config)
print(job_config.destination)
jobs.append(job)
To solve this you can avoid reusing the same job_config for all jobs:
for query in queries:
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
jobs.append(job)

Your code seems to be working fine on a single update. This is what I tried using python 3.6.5 and v1.9.0 of the client API
from google.cloud import bigquery
client = bigquery.Client()
query = '''
UPDATE `project.dataset.table` SET msg = null WHERE x is null
'''
job_config = bigquery.QueryJobConfig()
job_config.priority = bigquery.QueryPriority.BATCH
job = client.query(query, location='US', job_config=job_config)
print(job.state)
# PENDING
print(job.error_result)
# None
print(job.use_legacy_sql)
# False
print(job.job_type)
# Query
Please check your configuration and provide full code with an error log if this doesn't help you solve your problem
BTW, I also verify this from the command line
sh-3.2# ./bq query --nouse_legacy_sql --batch=true 'UPDATE `project.dataset.table` SET msg = null WHERE x is null'
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (133s) Current status: RUNNING
Waiting on bqjob_r5ee4f5dd56dc212f_000001697d3f9a56_1 ... (139s) Current status: DONE
sh-3.2#
sh-3.2# python --version

Airflow BigQueryOperator: how to save query result in a partitioned Table?

I have a simple DAG
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table='my_dataset.my_table20180524'),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={})
start >> bq_query >> end
When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:
Partitioning specification must be provided in order to create partitioned table
How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator
but I didn't find any example on how to use that parameter.
EDIT:
I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(

You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table
and then run below airflow pipeline again.
You can try code:
import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
dag=dag
)
start >> bq_query >> end
So what I did is that I created a dynamic table name variable and passed to the BQ operator.

The main issue here is that I don't have access to the new version of google cloud python API, the prod is using version 0.27.0.
So, to get the job done, I made something bad and dirty:
saved the query result in a sharded table, let it be table_sharded
got table_sharded's schema, let it be table_schema
saved " SELECT * FROM dataset.table_sharded" query to a partitioned table providing table_schema
All this is abstracted in one single operator that uses a hook. The hook is responsible of creating/deleting tables/partitions, getting table schema and running queries on BigQuery.
Have a look at the code. If there is any other solution, please let me know.

Using BigQueryOperator you can pass time_partitioning parameter which will create ingestion-time partitioned tables
bq_cmd = BigQueryOperator (
task_id= "task_id",
sql= [query],
destination_dataset_table= destination_tbl,
use_legacy_sql= False,
write_disposition= 'WRITE_TRUNCATE',
time_partitioning= {'time_partitioning_type':'DAY'},
allow_large_results= True,
trigger_rule= 'all_success',
query_params= query_params,
dag= dag
)

from datetime import datetime,timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
'project_id': Variable.get('gcp_project'),
'zone': Variable.get('gce_zone'),
'region': Variable.get('gce_region'),
'location': Variable.get('gce_zone'),
}
with DAG(
'test',
start_date=datetime(2019, 1, 1),
schedule_interval=None,
catchup=False,
default_args=DEFAULT_DAG_ARGS) as dag:
bq_query = BigQueryOperator(
task_id='create-partition',
bql="""SELECT
*
FROM
`dataset.table_name`""", -- table from which you want to pull data
destination_dataset_table='project.dataset.table_name' + '$' + datetime.now().strftime('%Y%m%d'), -- Auto partitioned table in Bq
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
use_legacy_sql=False,
)
I recommend to use Variable in Airflow and create all fields and use in DAG.
By above code, partition will be added in Bigquery table for Todays date.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How can I get the results of a query in bigquery into a list? - google-bigquery

Related

Insert into BigQuery using parameters from web server request json data

How to pass a variable in a full cell magic command in Jupyter/Colab?

Need to add query tag to snowflake queries while I fetch data from python ( using threadpool, code is provided)

BigQuery updates failing, but only when batched using Python API

Airflow BigQueryOperator: how to save query result in a partitioned Table?

Categories

Resources