pull xcom in BigQueryOperator - google-bigquery

I'm trying to run a BigQueryOperator with some dynamic parameter based on a previous task using xcom ( I managed to push it using BashOperator with xcom_push=True)
I thought using the following would do the trick
def get_next_run_date(**context):
last_date = context['task_instance'].xcom_pull(task_ids=['get_autoplay_last_run_date'])[0].rstrip()
last_date = datetime.strptime(last_date, "%Y%m%d").date()
return last_date + timedelta(days=1)
t3 = BigQueryOperator(
task_id='autoplay_calc',
bql='autoplay_calc.sql',
params={
"env" : deployment
,"region" : region
,"partition_start_date" : get_next_run_date()
},
bigquery_conn_id='gcp_conn',
use_legacy_sql=False,
write_disposition='WRITE_APPEND',
allow_large_results=True,
#provide_context=True,
destination_dataset_table=reporting_project + '.pa_reporting_public_batch.autoplay_calc',
dag=dag
)`
but using the above provide me with a Broken Dag Error with 'task_instance' error.

Have you tried using context['ti'].xcom_pull()?

You are using it in a wrong way.
You can not use xcom in params. You need to use it in bql/sql parameter. You sql file, autoplay_calc.sql can contain something like
select * from XYZ where date == "{{xcom_pull(task_ids=['get_autoplay_last_run_date'])[0].rstrip() }}"

Related

Insert into BigQuery using parameters from web server request json data

In the google function I'm triggering a dag:
endpoint = f"api/v1/dags/{dag_id}/dagRuns"
request_url = f"{web_server_url}/{endpoint}"
json_data = {"conf": {"file_name": file_name}}
response = make_composer2_web_server_request(
request_url, method="POST", json=json_data
)
It is possible to obtain that data in python operator:
def printParams(**kwargs):
for k, v in kwargs["params"].items():
print(f"{k}:{v}")
task = PythonOperator(
task_id="print_parameters",
python_callable=printParams)
What is the way to pass file_name into the sql statement taken from the POST request?
perform_analytics = BigQueryExecuteQueryOperator(
task_id="perform_analytics",
sql=f"""
Select **passFileNameVariableFromJsonDataHere** as FileName
FROM my_project.sample.table1
""",
destination_dataset_table=f"my_project.sample.table2",
write_disposition="WRITE_APPEND",
use_legacy_sql=False
)
Is there a way avoid using xCom variables as it is a single task level operation?
I was able to read the value in sql query like this:
sql=f"""
Select "{{{{dag_run.conf.get('file_name')}}}}" as FileName
FROM my_project.sample.table1
"""

How can I get the results of a query in bigquery into a list?

I'm running a DAG that runs multiple stored procedures in Bigquery in each DAG run. Currently, my code is the following
sp_names = [
'sp_airflow_test_1',
'sp_airflow_test_2'
]
# Define DAG
with DAG(
dag_id,
default_args = default_args) as dag:
for sp in sp_names:
i = i + 1
task_array.append(
BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
)
if i != len(sp_names):
task_array[i - 1] >> task_array[i]
I'd like my list "sp_names" to be the result of a query I do to a 1 column table that is stored on my BQ dataset, instead of being hardcoded like it is right now.
How can I do this?
Thanks in advance.
To execute multiple Bigquery with a similar SQL structure, create BigQueryOperator dynamically by create_dynamic_task function.
# funtion to create task
def create_dynamic_task(sp):
task = BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
return task
# dynamically create task
task_list = []
for sp in sp_names:
task_list.append(create_dynamic_task(sp))

use parameter as date in execute script

I am working on execute Script processor in nifi. I declared lastdate
as 2020-12-21 and I am trying to use this attribute in execute script
(groovy) file to fetch data from oracle.
In oracle it gave me the correct result. In nifi go to failure.
My code (script body):
def last_data = flowFile.getAttribute('last_date')
query = "select t.* from mytable where r.mydate > " + last_date + ",'yyyy-mm-dd')
Hehe, your variable is last_datA and you concatene last_datE.
I thinks that it buddy :)

Sqlalchemy use json_set to update specific JSON field

I intend to use func function to update a specific JSON field in Sqlalchemy, but I get some problem, here is my code to update field:
self.db.query(TestModel).filter(TestModel.test_id == self._test_id).update(
{field_name: func.json_set(
field_name,
"$." + key,
formatted_val)}
, synchronize_session='fetch'
)
self.db.commit()
I ran the code above and got the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) malformed JSON
So, I go to check the log, found Sqlalchemy form an SQL clause like that:
UPDATE test_model SET field_name=json_set('field_name', '$.keyname', 'value') WHERE test_model.test_id = 1;
the problem is Sqlalchemy should not use 'field_name' to specific the field it should use field_name to specific the field, and I try to run corrected sql clause below in sql client:
UPDATE test_model SET field_name=json_set(field_name, '$.keyname', 'value') WHERE test_model.test_id = 1;
and it work find
I just want to know how to make the Sqlalchemy form the correct field from 'field_name' to field_name?
You should pass first parameter with a name of model to function func.json_set:
self.db.query(TestModel).filter(TestModel.test_id == self._test_id).update(
{field_name: func.json_set(
TestModel.field_name,
"$." + key,
formatted_val)},
synchronize_session='fetch'
)
self.db.commit()

Airflow BigQueryOperator: how to save query result in a partitioned Table?

I have a simple DAG
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table='my_dataset.my_table20180524'),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={})
start >> bq_query >> end
When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:
Partitioning specification must be provided in order to create partitioned table
How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator
but I didn't find any example on how to use that parameter.
EDIT:
I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(
You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table
and then run below airflow pipeline again.
You can try code:
import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
dag=dag
)
start >> bq_query >> end
So what I did is that I created a dynamic table name variable and passed to the BQ operator.
The main issue here is that I don't have access to the new version of google cloud python API, the prod is using version 0.27.0.
So, to get the job done, I made something bad and dirty:
saved the query result in a sharded table, let it be table_sharded
got table_sharded's schema, let it be table_schema
saved " SELECT * FROM dataset.table_sharded" query to a partitioned table providing table_schema
All this is abstracted in one single operator that uses a hook. The hook is responsible of creating/deleting tables/partitions, getting table schema and running queries on BigQuery.
Have a look at the code. If there is any other solution, please let me know.
Using BigQueryOperator you can pass time_partitioning parameter which will create ingestion-time partitioned tables
bq_cmd = BigQueryOperator (
task_id= "task_id",
sql= [query],
destination_dataset_table= destination_tbl,
use_legacy_sql= False,
write_disposition= 'WRITE_TRUNCATE',
time_partitioning= {'time_partitioning_type':'DAY'},
allow_large_results= True,
trigger_rule= 'all_success',
query_params= query_params,
dag= dag
)
from datetime import datetime,timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
'project_id': Variable.get('gcp_project'),
'zone': Variable.get('gce_zone'),
'region': Variable.get('gce_region'),
'location': Variable.get('gce_zone'),
}
with DAG(
'test',
start_date=datetime(2019, 1, 1),
schedule_interval=None,
catchup=False,
default_args=DEFAULT_DAG_ARGS) as dag:
bq_query = BigQueryOperator(
task_id='create-partition',
bql="""SELECT
*
FROM
`dataset.table_name`""", -- table from which you want to pull data
destination_dataset_table='project.dataset.table_name' + '$' + datetime.now().strftime('%Y%m%d'), -- Auto partitioned table in Bq
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
use_legacy_sql=False,
)
I recommend to use Variable in Airflow and create all fields and use in DAG.
By above code, partition will be added in Bigquery table for Todays date.