How to get apache airflow to render Hive HQL ${variables} with Jinja

How to get apache airflow to render Hive HQL ${variables} with Jinja - hive

It seems like this is supported where you pass in a HQL script with ${xxx} vars and it gets preprocessed to convert them to {{xxx}} Jinja-style before the stage that actually does the template rendering to then replace those with values from a user-supplied dictionary. I believe this because there is a function like this in the HiveOperator class:
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.hql = re.sub(
"(\$\{([ a-zA-Z0-9_]*)\})", "{{ \g<2> }}", self.hql)
if self.script_begin_tag and self.script_begin_tag in self.hql:
self.hql = "\n".join(self.hql.split(self.script_begin_tag)[1:])
The problem is I cannot figure out how to trigger this piece of code to get called before the template rendering stage. I have a basic DAG script like this:
from airflow import DAG
from airflow.operators.hive_operator import HiveOperator
from datetime import datetime, timedelta
default_args = dict(
owner='mpetronic',
depends_on_past=False,
start_date=datetime(2017, 5, 2),
verbose=True,
retries=1,
retry_delay=timedelta(minutes=5)
)
dag = DAG(
dag_id='report',
schedule_interval='* * * * *',
user_defined_macros=dict(a=1, b=2),
default_args=default_args)
hql = open('/home/mpetronic/repos/airflow/resources/hql/report.hql').read()
task = HiveOperator(
task_id='report_builder',
hive_cli_conn_id='hive_dv',
schema='default',
mapred_job_name='report_builder',
hiveconf_jinja_translate=True,
dag=dag,
hql=hql)
I can see that my user_defined_macros dictionary makes it to the code where it gets merged with a global jinja context dictionary that then is applied to my HQL script to render it as a template. However, because my HQL is native HQL, all my variables that I want to update are in the form of ${xxx} and jinja just skips over them. I need airflow to call prepare_template() first but just don't see how to make that happen.
I realize I could just manually alter my HQL ${xxx} to {{xxx}} as that works but that seems like an anti-pattern. I want the script to be able to work natively or via airflow. This is the function, in the TaskInstance class, the does render my manually altered {{xxx}} values:
def render_templates(self):
task = self.task
jinja_context = self.get_template_context()
if hasattr(self, 'task') and hasattr(self.task, 'dag'):
if self.task.dag.user_defined_macros:
jinja_context.update(
self.task.dag.user_defined_macros)
rt = self.task.render_template # shortcut to method
for attr in task.__class__.template_fields:
content = getattr(task, attr)
if content:
rendered_content = rt(attr, content, jinja_context)
setattr(task, attr, rendered_content)

I figured out my issue. This is the regex used in the noted method:
(\$\{([ a-zA-Z0-9_]*)\})
It does not account for beeline variables in the form of:
${hivevar:var_name}
It does not consider a colon in the pattern. That form is the more standard way of defining Hive variables within a namespace using beeline. To make this Jinja substitution work, you have to reference variables in the HQL using just ${var_name} but you can only define variables in beeline using:
set hivevar:var_name=123;
I think that Airflow should fully support the hivevar:var_name style of namespaced variables when you are running with beeline given beeline is the preferred client to use with Hive.

Related

Passing DataFrame from notebook to another with pyspark

i'am trying to call a DataFrame that i created in notebook1 to use it in my notebook2 in Databricks Community addition with pyspark and i tried this code dbutils.notebook.run("notebook1", 60, {"dfnumber2"})
but it shows this error.
py4j.Py4JException: Method _run([class java.lang.String, class java.lang.Integer, class java.util.HashSet, null, class java.lang.String]) does not exist
any help please?

The actual problem is that you pass last parameter ({"dfnumber2"}) incorrectly - with this syntax it's a set, not the map type. You need to use syntax: {"table_name": "dfnumber2"} to represent it as a dict/map.
But if you look into documentation of dbutils.notebook.run, you will see following phrase:
To implement notebook workflows, use the dbutils.notebook.* methods. Unlike %run, the dbutils.notebook.run() method starts a new job to run the notebook.
But jobs aren't supported on the Community Edition, so it won't work anyway.

Create a global temp view and pass the table name as argument to your next notebook.
Drnumber2.createOrReplaceGlobalTempView("dfnumber2")
dbutils.notebook.run("notebook1", 60, {table_name:"dfnumber2"})
In your notebook1 you can do
table_name= dbutils.widgets.get("table_name")
Dfnumber2 = spark.sql("select * from global_temp."+table_name)

Cannot access scala value/variable inside RDD foreach function (Null)

I have a Spark Structured Streaming job that needs to use the rdd.forEach inside the forEachBatch function as per the bellow code:
val tableName = "ddb_table"
df
.writeStream
.foreachBatch { (batchDF: DataFrame, _: Long) =>
batchDF
.rdd
.foreach(
r => updateDDB(r, tableName, "key")
)
curDate= LocalDate.now().toString.replaceAll("-", "/")
prevDate= LocalDate.now().minusDays(1).toString.replaceAll("-", "/")
}
.outputMode(OutputMode.Append)
.option("checkpointLocation", "checkPointDir")
.start()
.awaitTermination()
What happens is that the tableName variable is not recognized inside the rdd.forEach function because the call to the DynamoDB API inside the updateDDB raises an exception stating that the tableName cannot be null.
The issue is clearly in the rdd/forEach and the way it works with variables. I read some things about broadcast variables, but I don't have enough experience working with RDDs and Spark in a much lower level to be sure what is the way to go.
Some notes:
I need this to be inside the forEachBatch function because I need to update other variables apart from this write to DDB (in this case the curDate and prevDate variables)
The code runs successfully when I pass the tableName parameter directly in the function call.
I have one class that extends the ForEachWriter that works ok when using the forEach instead of the forEachBatch, but as stated in point 1) I need to use the second because I need to update several things at a streaming batch time.

How can I pass a date to a sql file when executing a query in Airflow

I am using BigQueryExecuteQueryOperator in airflow. I have my sql in a separate file, and I am passing params to that file, specifically I am passing folder names and dates. When I pass a date though, I get an error, because the date is passed into the sql file without any 'quotes'.
Error
400 No matching signature for function DATE for argument types
From Airflow Logs. The param is sent to the sql file with no 'quotes' around it
DATE(2020-01-01)
Ways I have tried to define the date
partitionFilter = datetime.today().strftime('%Y-%m-%d')
partitionFilter = str(datetime.today()).split()[0]
partitionFilter = '2020-01-01'
partitionFilter = datetime.now().strftime('%Y-%m-%d')
Fixed that seems to work
In the sql file, if I wrap the passed param in 'quotes', sql is able to read it is a date. This seems a little hacky, and I am wondering is there a better way. What I am concerned with is that as this application grows, maybe the date will be passed a different way and it will include 'quotes', which would not work since the 'quotes' are hard coded in the sql file.
This works in the sql file if I wrap param in 'quotes':
DATE('{{params.partitionFilter}}')
This does not work in the sql file:
DATE( {{params.partitionFilter}} )
excerpt from Dag
partitionFilter = datetime.today().strftime('%Y-%m-%d')
with DAG(
dag_id,
schedule_interval='#daily',
start_date= days_ago(1),
catchup = False,
user_defined_macros={"DATASET": DATASET_NAME,}
) as dag:
query_one = BigQueryExecuteQueryOperator(
task_id='query_one',
sql='/sql/my-file.sql',
use_legacy_sql=False,
params={'partitionFilter':partitionFilter}
)

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.

The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

From within a grails HQL, how would I use a (non-aggregate) Oracle function?

If I were retrieving the data I wanted from a plain sql query, the following would suffice:
select * from stvterm where stvterm_code > TT_STUDENT.STU_GENERAL.F_Get_Current_term()
I have a grails domain set up correctly for this table, and I can run the following code successfully:
def a = SaturnStvterm.findAll("from SaturnStvterm as s where id > 201797") as JSON
a.render(response)
return false
In other words, I can hardcode in the results from the Oracle function and have the HQL run correctly, but it chokes any way that I can figure to try it with the function. I have read through some of the documentation on Hibernate about using procs and functions, but I'm having trouble making much sense of it. Can anyone give me a hint as to the proper way to handle this?
Also, since I think it is probably relevant, there aren't any synonyms in place that would allow the function to be called without qualifying it as schema.package.function(). I'm sure that'll make things more difficult. This is all for Grails 1.3.7, though I could use a later version if needed.

To call a function in HQL, the SQL dialect must be aware of it. You can add your function at runtime in BootStrap.groovy like this:
import org.hibernate.dialect.function.SQLFunctionTemplate
import org.hibernate.Hibernate
def dialect = applicationContext.sessionFactory.dialect
def getCurrentTerm = new SQLFunctionTemplate(Hibernate.INTEGER, "TT_STUDENT.STU_GENERAL.F_Get_Current_term()")
dialect.registerFunction('F_Get_Current_term', getCurrentTerm)
Once registered, you should be able to call the function in your queries:
def a = SaturnStvterm.findAll("from SaturnStvterm as s where id > TT_STUDENT.STU_GENERAL.F_Get_Current_term()")

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

How to get apache airflow to render Hive HQL ${variables} with Jinja - hive

Related

Passing DataFrame from notebook to another with pyspark

Cannot access scala value/variable inside RDD foreach function (Null)

How can I pass a date to a sql file when executing a query in Airflow

How can I access value in sequence type?

From within a grails HQL, how would I use a (non-aggregate) Oracle function?

Categories

Resources