Extend BigQueryExecuteQueryOperator with additional labels using jinja2 - google-bigquery

In order to track GCP costs using labels, would like to extend BigQueryExecuteQueryOperator with some additional labels so that each task instance gets these labels automatically set in its constructor.
class ExtendedBigQueryExecuteQueryOperator(BigQueryExecuteQueryOperator):
#apply_defaults
def __init__(self,
*args,
**kwargs) -> None:
task_labels = {
'dag_id': '{{ dag.dag_id }}',
'task_id': kwargs.get('task_id'),
'ds': '{{ ds }}',
# ugly, all three params got in diff. ways
}
super().__init__(*args, **kwargs)
if self.labels is None:
self.labels = task_labels
else:
self.labels.update(task_labels)
with DAG(dag_id=...,
start_date=...,
schedule_interval=...,
default_args=...) as dag:
t1 = ExtendedBigQueryExecuteQueryOperator(
task_id=f't1',
sql=f'SELECT 1;',
labels={'some_additional_label2':'some_additional_label2'}
# all labels should be: dag_id, task_id, ds, some_additional_label2
)
t2 = ExtendedBigQueryExecuteQueryOperator(
task_id=f't2',
sql=f'SELECT 2;',
labels={'some_additional_label3':'some_additional_label3'}
# all labels should be: dag_id, task_id, ds, some_additional_label3
)
t1 >> t2
but then I lose task level labels some_additional_label2 or some_additional_label3.

You could create the following policy in airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == "BigQueryExecuteQueryOperator":
task.labels.update({'dag_id': task.dag_id, 'task_id': task.task_id})
From docs:
Your local Airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG attributes. It receives a single argument as a reference to task objects, and is expected to alter its attributes.
More details on applying Policy: https://airflow.readthedocs.io/en/1.10.9/concepts.html#cluster-policy
You won't need to extend BigQueryExecuteQueryOperator in that case. The only missing part is execution_date which you can set in the task itself.
Example:
with DAG(dag_id=...,
start_date=...,
schedule_interval=...,
default_args=...) as dag:
t1 = BigQueryExecuteQueryOperator(
task_id=f't1',
sql=f'SELECT 1;',
lables={'some_additional_label2':'some_additional_label2', 'ds': '{{ ds }}'}
)
airflow_local_settings file needs to be on your PYTHONPATH. You can put in under $AIRFLOW_HOME/config or inside your dags directory.

Related

How can I make a DataFrame from a ColumnTransformer composed by LOO Encoder, OHE and Ordinal Encoder?

Since the feature get_feature_names() was deprecated from "native" categorical encoders in Sklearn (actually it was replaced by get_feature_names_out()), how could I make a DataFrame where the transformed variables have their proper names since inside the ColumnTransformer has encoders whose respond for get_feature_names_out() and others for get_feature_names()? Here is the situation:
features_pipe = make_column_transformer(
(OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
(OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
(ce.LeaveOneOutEncoder(), ['State (US)'])
).fit(X_train, y_train)
X_train_encoded = features_pipe.transform(X_train)
X_test_encoded = features_pipe.transform(X_test)
X_train_encoded_df = pd.DataFrame(X_train_encoded, columns= features_pipe.get_features_names_out())
X_train_encoded_df.head()
I got this error: AttributeError: 'ColumnTransformer' object has no attribute 'get_features_names_out'
That's because LeaveOneOutEncoder does not support get_feature_names_out(). It supports get_feature_names().
How could I overcome this issue and print my DataFrame correctly?
I have this same issue once in the past.
If you don't mind using a subclass of ColumnTransformer you can create it and just modify to call get_feature_names() when get_feature_names_out() was not available.
In this case, you should declare the class
from sklearn.compose import ColumnTransformer
from sklearn.compose._column_transformer import _is_empty_column_selection
class MyColumnTransformer(ColumnTransformer):
def __init__(self, transformers, **kwargs):
super().__init__(transformers=transformers, **kwargs)
def _get_feature_name_out_for_transformer(
self, name, trans, column, feature_names_in
):
column_indices = self._transformer_to_input_indices[name]
names = feature_names_in[column_indices]
if trans == "drop" or _is_empty_column_selection(column):
return
elif trans == "passthrough":
return names
if not hasattr(trans, "get_feature_names_out"):
return trans.get_feature_names()
return trans.get_feature_names_out(names)
Although the use of ColumnTransformer is not as simple as using make_column_transformer, it's much more customizable.
So, in this case, you also have to pass a name to each transformer using the following schema:
(name, transformer, columns)
features_pipe = MyColumnTransformer(transformers=
[
('OHE', OneHotEncoder(handle_unknown = 'ignore', sparse=False), ['Gender', 'Race']),
('OE', OrdinalEncoder(), ['Age', 'Overall Work Exp.', 'Fieldwork Exp.', 'Level of Education']),
('LOOE', ce.LeaveOneOutEncoder(), ['State (US)'])
])
features_pipe.fit(X_train, y_train)
and finally continue the code the way you suggested.
If you don't want to append the transformer names to the features names, just include verbose_feature_names_out=False when initializing MyColumnTransformer.

Change Airflow BigQueryInsertJobOperator and BigQueryGetDataOperator Priority to Batch

I've read the Apache Airflow documentation for operators for BigQuery jobs here (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/bigquery.html#BigQueryGetDataOperator) and I can't find how to change the job priority to batch. How is it can be done?
BigQueryExecuteQueryOperator has priority param that can be set with INTERACTIVE/BATCH the default is INTERACTIVE:
execute_insert_query = BigQueryExecuteQueryOperator(
task_id="execute_insert_query",
sql=INSERT_ROWS_QUERY,
use_legacy_sql=False,
location=location,
priority='BATCH',
)
The BigQueryInsertJobOperator doesn't have it.
I think you can create a custom operator that inherits from BigQueryInsertJobOperator and adds it by overwriting the _submit_job function:
class MyBigQueryInsertJobOperator(BigQueryInsertJobOperator):
def __init__(
self,
priority: str = 'INTERACTIVE',
**kwargs,
) -> None:
super().__init__(**kwargs)
self.priority = priority
def _submit_job(
self,
hook: BigQueryHook,
job_id: str,
) -> BigQueryJob:
# Submit a new job
job = hook.insert_job(
configuration=self.configuration,
project_id=self.project_id,
location=self.location,
job_id=job_id,
priority=self.priority,
)
# Start the job and wait for it to complete and get the result.
job.result()
return job
I didn't test though but it should work.

!gcloud dataproc jobs submit pyspark - ERROR AttributeError: 'str' object has no attribute 'batch'

how i can input dataset - type as input to dataproc jobs ?
mine code is below
%%writefile spark_job.py
import sys
import pyspark
import argparse
import pickle
#def time_configs_rdd(test_set, batch_sizes,batch_numbers,repetitions):
def time_configs_rdd(argv):
print(argv)
parser = argparse.ArgumentParser() # get a parser object
parser.add_argument('--out_bucket', metavar='out_bucket', required=True,
help='The bucket URL for the result.') # add a required argument
parser.add_argument('--out_file', metavar='out_file', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--batch_size', metavar='batch_size', required=True,
help='The bucket URL for the result.') # add a required argument
parser.add_argument('--batch_number', metavar='batch_number', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--repetitions', metavar='repetitions', required=True,
help='The filename for the result.') # add a required argument
parser.add_argument('--test_set', metavar='test_set', required=True,
help='The filename for the result.') # add a required argument
args = parser.parse_args(argv) # read the value
# the value provided with --out_bucket is now in args.out_bucket
time_configs_results = []
for s in args.batch_size:
for n in args.batch_number:
dataset = **args.test_set.batch(s).take(n)**
for r in args.repetitions:
tt0 = time.time()
for i in enumerate(dataset):
totaltime = str(time.time()-tt0)
batchtime = totaltime
#imgpersec = s*n/totaltime
time_configs_results.append((s,n,r,float(batchtime)))
#time_configs_results.append((s,n,r,batchtime,imgpersec))
time_configs_results_rdd = sc.parallelize(time_configs_results) #create an RDD with all results for each parameter
time_configs_results_rdd_avg = time_configs_results_rdd.map(lambda x: (x, x[0]*x[1]/x[3])) #RDD with the average reading speeds (RDD.map)
#mapping = time_configs_results_rdd_avg.collect()
#print(mapping)
return (time_configs_results_rdd_avg)
if 'google.colab' not in sys.modules: # Don't use system arguments when run in Colab
time_configs_rdd(sys.argv[1:])
elif __name__ == "__main__" : # but define them manually
time_configs_rdd(["--out_bucket", BUCKET, "--out_file", "time_configs_rdd_out.pkl","--batch_size", batch_size, "--batch_number", batch_number,"--test_set", test_set ] )
and code to execute it
FILENAME = 'file_RDD_OUT.pkl'
batch_size = [1]
batch_number = [1]
repetitions = [1]
#test_set = 1 will give string error
test_set = dataset2 # file <ParallelMapDataset shapes: ((192, 192, None), ()), types: (tf.float32,
tf.string)> cannot be inserted
!gcloud dataproc jobs submit pyspark --cluster $CLUSTER --region $REGION \
./spark_job.py \
-- --out_bucket $BUCKET --out_file $FILENAME --batch_size $batch_size --batch_number $batch_number --repetitions $repetitions --test_set $test_set
unfortunetlly is keep failing with error
AttributeError: 'str' object has no attribute 'batch'
ERROR: (gcloud.dataproc.jobs.submit.pyspark) Job [c2048c422f334b08a628af5a1aa492eb] failed with error:
Job failed with message [AttributeError: 'str' object has no attribute 'batch'].
problem is with test_set how i should convert dataset2(ParallelMapDataset) to be readed by the job
So you are trying to parse a string from the command line argument to a ParallelMapDataset type. You want to use the type param in your add_argument calls.
From https://docs.python.org/3/library/argparse.html#type and I quote:
By default, ArgumentParser objects read command-line arguments in as simple strings. However, quite often the command-line string should instead be interpreted as another type, like a float or int. The type keyword argument of add_argument() allows any necessary type-checking and type conversions to be performed.
and
type= can take any callable that takes a single string argument and returns the converted value
So you probably want something like:
def parse_parallel_map_dataset(string):
# your logic to parse the string into your desired data structure
...
parser.add_argument('--test_set', metavar='test_set', required=True,
type=parse_parallel_map_dataset)
Or better yet, read your test_set from a file and pass the file name as an argument.

Odoo 9 context value missing in override method

in odoo9 I override the search_read method. The super method works ok. With the data returned I want to make a filter, the filter is on the context, the value was asigned on the click of the button comming from the view.
<button name="status_instalacion" string="Instalación" type="action" icon="fa-wrench fa-2x" context="{'stage_id' : 1, 'current_id': active_id}"/>
The problem occurs when I query the context in the search_read method. It exists but doesn't have the values I placed
context on click of button:
self._context
{u'lang': u'en_US', u'stage_id': 1, u'tz': False, u'uid': 1, u'current_id': 40, u'tipo_validacion': u'Sistemas Cr\xedticos', u'sistema_critico': u'AGUA'}
the stage_id is the value I want
context on read_search:
self._context
{u'lang': u'en_US', u'bin_size': True, u'tipo_validacion': u'Sistemas Cr\xedticos', u'tz': False, u'uid': 1,
u'active_test': False, u'sistema_critico': u'AGUA'}
as you can see the 'stage_id' value is missing
Tried also assigning the value to a property of the class, but the value never changes it is always the initial value.
from logging import getLogger
from openerp import api, fields, models
_logger = getLogger(__name__)
class MgmtsystemSistemasEquipos(models.Model):
""" Equipos."""
_name = 'mgmtsystem.sistemas.equipos'
dmy = 99 # ---> this value never changes
def dummy(self): # ---> tried calling a function. not work
return self.dmy
def set_dummy(self, id): # ----> set the value
self.dmy = id or self.dmy
codigo = fields.Char(
string=u'Código',
help=u"Código equipo",
required=True,
size=30)
name = fields.Char(
string=u'Nombre equipo',
required=True,
readonly=False,
index=True,
help="Nombre corto equipo",
size=30)
stage_id = fields.Many2one(
'mgmtsystem.action.stage',
'Fase',
default=_default_stage,
readonly=True)
#api.multi
def status_instalacion(self):
import pudb
pu.db
# save value to variable dmy to retrieve later
id = self._context.get('stage_id')
self.set_dummy(id)
#api.model
def search_read(
self, domain=None, fields=None, offset=0,
limit=None, order=None):
import pudb
pu.db
# here the variable allways has the original value (99)
current_stage_id = self.dmy
current_stage_id = self.dummy()
current_stage_id = getattr(self, dmy)
res = super(MgmtsystemSistemasEquipos, self).search_read(
domain, fields, offset, limit, order)
current_id = res[0]['id']
valid_protocols_ids = self._get_ids(
current_stage_id, current_id,
'mgmtsystem_equipos_protocolos',
'mgmtsystem_equipos_protocolos_rel',
'protocolo_id')
# # remove ids
res[0]['protocolos_ids'] = valid_protocols_ids
res[0]['informes_ids'] = valid_informes_ids
res[0]['anexos_ids'] = valid_anexos_ids
return res
# #api.multi
def _get_ids(self, current_stage_id, current_id, model, model_rel, field_rel):
import pudb
pu.db
# in this method the value of the variable is allways the original
current_stage_id = self.dummy()
sql = """ select a.id from
%s as a
join %s as b
on a.id = b.%s where b.equipo_id = %s
and a.stage_id = %s; """ % (model, model_rel, field_rel,
current_id, current_stage_id)
import psycopg2
try:
self.env.cr.execute(sql)
except psycopg2.ProgrammingError, ex:
message = 'Error trying to download data from server. \n {0} \n {1}'.format(ex.pgerror, sql)
_logger.info(message)
return False
rows = self.env.cr.fetchall()
list_of_ids = []
for row in rows:
list_of_ids.append(row[0])
return list_of_ids
I don't know Python very well, and thats the cause of my misunderstanding of how to read the value of the variable.
But then again, Why is the context modified in the search_read method?.
Thank you.
You should try following.
#api.model
def search_read(self, domain=None, fields=None, offset=0, limit=None, order=None):
import pudb
pu.db
# Here you need to get the value from the context.
current_stage_id = self._context.get('stage_id', getattr(self, dmy))
res = super(MgmtsystemSistemasEquipos, self).search_read(domain=domain, fields=fields, offset=offset, limit=limit, order=order)
current_id = res[0]['id']
valid_protocols_ids = self._get_ids(
current_stage_id, current_id,
'mgmtsystem_equipos_protocolos',
'mgmtsystem_equipos_protocolos_rel',
'protocolo_id')
# # remove ids
res[0]['protocolos_ids'] = valid_protocols_ids
res[0]['informes_ids'] = valid_informes_ids
res[0]['anexos_ids'] = valid_anexos_ids
return res
In your code those lines won't work just because there is no recordset available in self (it's correct behaviour search_read must have #api.model decorator).
# here the variable allways has the original value (99)
current_stage_id = self.dmy
current_stage_id = self.dummy()
current_stage_id = getattr(self, dmy)
So just remove those and lines and apply some other logic to get data.

Can I restrict objects in Python3 so that only attributes that I make a setter for are allowed?

I have something called a Node. Both Definition and Theorem are a type of node, but only Definitions should be allowed to have a plural attribute:
class Definition(Node):
def __init__(self,dic):
self.type = "definition"
super(Definition, self).__init__(dic)
self.plural = move_attribute(dic, {'plural', 'pl'}, strict=False)
#property
def plural(self):
return self._plural
#plural.setter
def plural(self, new_plural):
if new_plural is None:
self._plural = None
else:
clean_plural = check_type_and_clean(new_plural, str)
assert dunderscore_count(clean_plural)>=2
self._plural = clean_plural
class Theorem(Node):
def __init__(self, dic):
self.type = "theorem"
super().__init__(dic)
self.proofs = move_attribute(dic, {'proofs', 'proof'}, strict=False)
# theorems CANNOT have plurals:
# if 'plural' in self:
# raise KeyError('Theorems cannot have plurals.')
As you can see, Definitions have a plural.setter, but theorems do not. However, the code
theorem = Theorem(some input)
theorem.plural = "some plural"
runs just fine and raises no errors. But I want it to raise an error. As you can see, I tried to check for plurals manually at the bottom of my code shown, but this would only be a patch. I would like to block the setting of ANY attribute that is not expressly defined. What is the best practice for this sort of thing?
I am looking for an answer that satisfies the "chicken" requirement:
I do not think this solves my issue. In both of your solutions, I can
append the code t.chicken = 'hi'; print(t.chicken), and it prints hi
without error. I do not want users to be able to make up new
attributes like chicken.
The short answer is "Yes, you can."
The follow-up question is "Why?" One of the strengths of Python is the remarkable dynamism, and by restricting that ability you are actually making your class less useful (but see edit at bottom).
However, there are good reasons to be restrictive, and if you do choose to go down that route you will need to modify your __setattr__ method:
def __setattr__(self, name, value):
if name not in ('my', 'attribute', 'names',):
raise AttributeError('attribute %s not allowed' % name)
else:
super().__setattr__(name, value)
There is no need to mess with __getattr__ nor __getattribute__ since they will not return an attribute that doesn't exist.
Here is your code, slightly modified -- I added the __setattr__ method to Node, and added an _allowed_attributes to Definition and Theorem.
class Node:
def __setattr__(self, name, value):
if name not in self._allowed_attributes:
raise AttributeError('attribute %s does not and cannot exist' % name)
super().__setattr__(name, value)
class Definition(Node):
_allowed_attributes = '_plural', 'type'
def __init__(self,dic):
self.type = "definition"
super().__init__(dic)
self.plural = move_attribute(dic, {'plural', 'pl'}, strict=False)
#property
def plural(self):
return self._plural
#plural.setter
def plural(self, new_plural):
if new_plural is None:
self._plural = None
else:
clean_plural = check_type_and_clean(new_plural, str)
assert dunderscore_count(clean_plural)>=2
self._plural = clean_plural
class Theorem(Node):
_allowed_attributes = 'type', 'proofs'
def __init__(self, dic):
self.type = "theorem"
super().__init__(dic)
self.proofs = move_attribute(dic, {'proofs', 'proof'}, strict=False)
In use it looks like this:
>>> theorem = Theorem(...)
>>> theorem.plural = 3
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 6, in __setattr__
AttributeError: attribute plural does not and cannot exist
edit
Having thought about this some more, I think a good compromise for what you want, and to actually answer the part of your question about restricting allowed changes to setters only, would be to:
use a metaclass to inspect the class at creation time and dynamically build the _allowed_attributes tuple
modify the __setattr__ of Node to always allow modification/creation of attributes with at least one leading _
This gives you some protection against both misspellings and creation of attributes you don't want, while still allowing programmers to work around or enhance the classes for their own needs.
Okay, the new meta class looks like:
class NodeMeta(type):
def __new__(metacls, cls, bases, classdict):
node_cls = super().__new__(metacls, cls, bases, classdict)
allowed_attributes = []
for base in (node_cls, ) + bases:
for name, obj in base.__dict__.items():
if isinstance(obj, property) and hasattr(obj, '__fset__'):
allowed_attributes.append(name)
node_cls._allowed_attributes = tuple(allowed_attributes)
return node_cls
The Node class has two adjustments: include the NodeMeta metaclass and adjust __setattr__ to only block non-underscore leading attributes:
class Node(metaclass=NodeMeta):
def __init__(self, dic):
self._dic = dic
def __setattr__(self, name, value):
if not name[0] == '_' and name not in self._allowed_attributes:
raise AttributeError('attribute %s does not and cannot exist' % name)
super().__setattr__(name, value)
Finally, the Node subclasses Theorem and Definition have the type attribute moved into the class namespace so there is no issue with setting them -- and as a side note, type is a bad name as it is also a built-in function -- maybe node_type instead?
class Definition(Node):
type = "definition"
...
class Theorem(Node):
type = "theorem"
...
As a final note: even this method is not immune to somebody actually adding or changing attributes, as object.__setattr__(theorum_instance, 'an_attr', 99) can still be used -- or (even simpler) the _allowed_attributes can be modified; however, if somebody is going to all that work they hopefully know what they are doing... and if not, they own all the pieces. ;)
You can check for the attribute everytime you access it.
class Theorem(Node):
...
def __getattribute__(self, name):
if name not in ["allowed", "attribute", "names"]:
raise MyException("attribute "+name+" not allowed")
else:
return self.__dict__[name]
def __setattr__(self, name, value):
if name not in ["allowed", "attribute", "names"]:
raise MyException("attribute "+name+" not allowed")
else:
self.__dict__[name] = value
You can build the allowed method list dynamically as a side effect of a decorator:
allowed_attrs = []
def allowed(f):
allowed_attrs.append(f.__name__)
return f
You would also need to add non method attributes manually.
If you really want to prevent all other dynamic attributes. I assume there's a well-defined time window that you want to allow adding attributes.
Below I allow it until object initialisation is finished. (you can control it with allow_dynamic_attribute variable.
class A:
def __init__(self):
self.allow_dynamic_attribute = True
self.abc = "hello"
self._plural = None # need to give default value
# A.__setattr__ = types.MethodType(__setattr__, A)
self.allow_dynamic_attribute = False
def __setattr__(self, name, value):
if hasattr(self, 'allow_dynamic_attribute'):
if not self.allow_dynamic_attribute:
if not hasattr(self, name):
raise Exception
super().__setattr__(name, value)
#property
def plural(self):
return self._plural
#plural.setter
def plural(self, new_plural):
self._plural = new_plural
a = A()
print(a.abc) # fine
a.plural = "yes" # fine
print(a.plural) # fine
a.dkk = "bed" # raise exception
Or it can be more compact this way, I couldn't figure out how MethodType + super can get along together.
import types
def __setattr__(self, name, value):
if not hasattr(self, name):
raise Exception
else:
super().__setattr__(name,value) # this doesn't work for reason I don't know
class A:
def __init__(self):
self.foo = "hello"
# after this point, there's no more setattr for you
A.__setattr__ = types.MethodType(__setattr__, A)
a = A()
print(a.foo) # fine
a.bar = "bed" # raise exception
Yes, you can create private members that cannot be modified from outside the class. The variable name should start with two underscores:
class Test(object):
def __init__(self, t):
self.__t = t
def __str__(self):
return str(self.__t)
t = Test(2)
print(t) # prints 2
t.__t = 3
print(t) # prints 2
That said, trying to access such a variable as we do in t.__t = 3 will not raise an exception.
A different approach which you can take to achieve the wanted behavior is using functions. This approach will require "accessing attributes" using functional notation, but if that doesn't bother you, you can get exactly what you want. The following demo "hardcodes" the values, but obviously you can have Theorem() accept an argument and use it to set values to the attributes dynamically.
Demo:
# -*- coding: utf-8 -*-
def Theorem():
def f(attrib):
def proofs():
return ''
def plural():
return '◊◊◊◊◊◊◊◊'
if attrib == 'proofs':
return proofs()
elif attrib == 'plural':
return plural()
else:
raise ValueError("Attribute [{}] doesn't exist".format(attrib))
return f
t = Theorem()
print(t('proofs'))
print(t('plural'))
print(t('wait_for_error'))
OUTPUT

◊◊◊◊◊◊◊◊
Traceback (most recent call last):
File "/Users/alfasi/Desktop/1.py", line 40, in <module>
print(t('wait_for_error'))
File "/Users/alfasi/Desktop/1.py", line 32, in f
raise ValueError("Attribute [{}] doesn't exist".format(attrib))
ValueError: Attribute [wait_for_error] doesn't exist