In the google function I'm triggering a dag:
endpoint = f"api/v1/dags/{dag_id}/dagRuns"
request_url = f"{web_server_url}/{endpoint}"
json_data = {"conf": {"file_name": file_name}}
response = make_composer2_web_server_request(
request_url, method="POST", json=json_data
)
It is possible to obtain that data in python operator:
def printParams(**kwargs):
for k, v in kwargs["params"].items():
print(f"{k}:{v}")
task = PythonOperator(
task_id="print_parameters",
python_callable=printParams)
What is the way to pass file_name into the sql statement taken from the POST request?
perform_analytics = BigQueryExecuteQueryOperator(
task_id="perform_analytics",
sql=f"""
Select **passFileNameVariableFromJsonDataHere** as FileName
FROM my_project.sample.table1
""",
destination_dataset_table=f"my_project.sample.table2",
write_disposition="WRITE_APPEND",
use_legacy_sql=False
)
Is there a way avoid using xCom variables as it is a single task level operation?
I was able to read the value in sql query like this:
sql=f"""
Select "{{{{dag_run.conf.get('file_name')}}}}" as FileName
FROM my_project.sample.table1
"""
I'm running a DAG that runs multiple stored procedures in Bigquery in each DAG run. Currently, my code is the following
sp_names = [
'sp_airflow_test_1',
'sp_airflow_test_2'
]
# Define DAG
with DAG(
dag_id,
default_args = default_args) as dag:
for sp in sp_names:
i = i + 1
task_array.append(
BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
)
if i != len(sp_names):
task_array[i - 1] >> task_array[i]
I'd like my list "sp_names" to be the result of a query I do to a 1 column table that is stored on my BQ dataset, instead of being hardcoded like it is right now.
How can I do this?
Thanks in advance.
To execute multiple Bigquery with a similar SQL structure, create BigQueryOperator dynamically by create_dynamic_task function.
# funtion to create task
def create_dynamic_task(sp):
task = BigQueryOperator(
task_id='run_{}'.format(sp),
sql="""CALL `[project].[dataset].{}`();""".format(sp),
use_legacy_sql=False
)
return task
# dynamically create task
task_list = []
for sp in sp_names:
task_list.append(create_dynamic_task(sp))
I am working on execute Script processor in nifi. I declared lastdate
as 2020-12-21 and I am trying to use this attribute in execute script
(groovy) file to fetch data from oracle.
In oracle it gave me the correct result. In nifi go to failure.
My code (script body):
def last_data = flowFile.getAttribute('last_date')
query = "select t.* from mytable where r.mydate > " + last_date + ",'yyyy-mm-dd')
Hehe, your variable is last_datA and you concatene last_datE.
I thinks that it buddy :)
I intend to use func function to update a specific JSON field in Sqlalchemy, but I get some problem, here is my code to update field:
self.db.query(TestModel).filter(TestModel.test_id == self._test_id).update(
{field_name: func.json_set(
field_name,
"$." + key,
formatted_val)}
, synchronize_session='fetch'
)
self.db.commit()
I ran the code above and got the error:
sqlalchemy.exc.OperationalError: (sqlite3.OperationalError) malformed JSON
So, I go to check the log, found Sqlalchemy form an SQL clause like that:
UPDATE test_model SET field_name=json_set('field_name', '$.keyname', 'value') WHERE test_model.test_id = 1;
the problem is Sqlalchemy should not use 'field_name' to specific the field it should use field_name to specific the field, and I try to run corrected sql clause below in sql client:
UPDATE test_model SET field_name=json_set(field_name, '$.keyname', 'value') WHERE test_model.test_id = 1;
and it work find
I just want to know how to make the Sqlalchemy form the correct field from 'field_name' to field_name?
You should pass first parameter with a name of model to function func.json_set:
self.db.query(TestModel).filter(TestModel.test_id == self._test_id).update(
{field_name: func.json_set(
TestModel.field_name,
"$." + key,
formatted_val)},
synchronize_session='fetch'
)
self.db.commit()
I have a simple DAG
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table='my_dataset.my_table20180524'),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={})
start >> bq_query >> end
When executing the bq_query task the SQL query gets saved in a sharded table. I want it to get saved in a daily partitioned table. In order to do so, I only changed destination_dataset_table to my_dataset.my_table$20180524. I got the error below when executing the bq_task:
Partitioning specification must be provided in order to create partitioned table
How can I specify to BigQuery to save query result to a daily partitioned table ? my first guess has been to use query_params in BigQueryOperator
but I didn't find any example on how to use that parameter.
EDIT:
I'm using google-cloud==0.27.0 python client ... and it's the one used in Prod :(
You first need to create an Empty partitioned destination table. Follow instructions here: link to create an empty partitioned table
and then run below airflow pipeline again.
You can try code:
import datetime
from airflow import DAG
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
today_date = datetime.datetime.now().strftime("%Y%m%d")
table_name = 'my_dataset.my_table' + '$' + today_date
with DAG(dag_id='my_dags.my_dag') as dag:
start = DummyOperator(task_id='start')
end = DummyOperator(task_id='end')
sql = """
SELECT *
FROM 'another_dataset.another_table'
"""
bq_query = BigQueryOperator(bql=sql,
destination_dataset_table={{ params.t_name }}),
task_id='bq_query',
bigquery_conn_id='my_bq_connection',
use_legacy_sql=False,
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
query_params={'t_name': table_name},
dag=dag
)
start >> bq_query >> end
So what I did is that I created a dynamic table name variable and passed to the BQ operator.
The main issue here is that I don't have access to the new version of google cloud python API, the prod is using version 0.27.0.
So, to get the job done, I made something bad and dirty:
saved the query result in a sharded table, let it be table_sharded
got table_sharded's schema, let it be table_schema
saved " SELECT * FROM dataset.table_sharded" query to a partitioned table providing table_schema
All this is abstracted in one single operator that uses a hook. The hook is responsible of creating/deleting tables/partitions, getting table schema and running queries on BigQuery.
Have a look at the code. If there is any other solution, please let me know.
Using BigQueryOperator you can pass time_partitioning parameter which will create ingestion-time partitioned tables
bq_cmd = BigQueryOperator (
task_id= "task_id",
sql= [query],
destination_dataset_table= destination_tbl,
use_legacy_sql= False,
write_disposition= 'WRITE_TRUNCATE',
time_partitioning= {'time_partitioning_type':'DAY'},
allow_large_results= True,
trigger_rule= 'all_success',
query_params= query_params,
dag= dag
)
from datetime import datetime,timedelta
from airflow import DAG
from airflow.models import Variable
from airflow.contrib.operators.bigquery_operator import BigQueryOperator
from airflow.operators.dummy_operator import DummyOperator
DEFAULT_DAG_ARGS = {
'owner': 'airflow',
'depends_on_past': False,
'retries': 2,
'retry_delay': timedelta(minutes=10),
'project_id': Variable.get('gcp_project'),
'zone': Variable.get('gce_zone'),
'region': Variable.get('gce_region'),
'location': Variable.get('gce_zone'),
}
with DAG(
'test',
start_date=datetime(2019, 1, 1),
schedule_interval=None,
catchup=False,
default_args=DEFAULT_DAG_ARGS) as dag:
bq_query = BigQueryOperator(
task_id='create-partition',
bql="""SELECT
*
FROM
`dataset.table_name`""", -- table from which you want to pull data
destination_dataset_table='project.dataset.table_name' + '$' + datetime.now().strftime('%Y%m%d'), -- Auto partitioned table in Bq
write_disposition='WRITE_TRUNCATE',
create_disposition='CREATE_IF_NEEDED',
use_legacy_sql=False,
)
I recommend to use Variable in Airflow and create all fields and use in DAG.
By above code, partition will be added in Bigquery table for Todays date.