Change Airflow BigQueryInsertJobOperator and BigQueryGetDataOperator Priority to Batch - google-bigquery

I've read the Apache Airflow documentation for operators for BigQuery jobs here (https://airflow.apache.org/docs/apache-airflow-providers-google/stable/_modules/airflow/providers/google/cloud/operators/bigquery.html#BigQueryGetDataOperator) and I can't find how to change the job priority to batch. How is it can be done?

BigQueryExecuteQueryOperator has priority param that can be set with INTERACTIVE/BATCH the default is INTERACTIVE:
execute_insert_query = BigQueryExecuteQueryOperator(
task_id="execute_insert_query",
sql=INSERT_ROWS_QUERY,
use_legacy_sql=False,
location=location,
priority='BATCH',
)
The BigQueryInsertJobOperator doesn't have it.
I think you can create a custom operator that inherits from BigQueryInsertJobOperator and adds it by overwriting the _submit_job function:
class MyBigQueryInsertJobOperator(BigQueryInsertJobOperator):
def __init__(
self,
priority: str = 'INTERACTIVE',
**kwargs,
) -> None:
super().__init__(**kwargs)
self.priority = priority
def _submit_job(
self,
hook: BigQueryHook,
job_id: str,
) -> BigQueryJob:
# Submit a new job
job = hook.insert_job(
configuration=self.configuration,
project_id=self.project_id,
location=self.location,
job_id=job_id,
priority=self.priority,
)
# Start the job and wait for it to complete and get the result.
job.result()
return job
I didn't test though but it should work.

Related

pytest-qt waitSignal for long running computation running in a thread pool

I have successfully implemented a Python Qt app based off this very nice tutorial.
I am now writing tests using pytest-qt to specifically test a button that triggers a long running computation that eventually emits a signal when finished. I would like to use waitSignal as documented here.
def test_long_computation(qtbot):
app = Application()
# Watch for the app.worker.finished signal, then start the worker.
with qtbot.waitSignal(app.worker.finished, timeout=10000) as blocker:
blocker.connect(app.worker.failed) # Can add other signals to blocker
app.worker.start()
# Test will block at this point until either the "finished" or the
# "failed" signal is emitted. If 10 seconds passed without a signal,
# TimeoutError will be raised.
assert_application_results(app)
When the button is clicked, this function is executed:
def on_button_click_function(self):
"""
start thread pool to run function
"""
# Pass the function to execute
worker = Worker(self.execute_check_urls)
worker.signals.result.connect(self.print_output) # type: ignore
worker.signals.finished.connect(self.thread_complete) # type: ignore
worker.signals.progress.connect(self.progress_fn) # type: ignore
# Execute
log_info("Starting thread pool worker ...")
self.threadpool.start(worker).
And when the thread completes, a signal is emitted
def thread_complete(self):
main_window = self.find_main_window()
if main_window:
main_window.button_click_signal.emit(
self.results
)
log_info(
f"Emitted signal for button click function: {self.results}"
)
Below is the init function of the main class:
class MyClass:
def __init__(
self,
*args,
**kwargs
):
super(MyClass, self).__init__(*args, **kwargs)
self.threadpool = QThreadPool()
print(
"Multithreading with max %d threads"
% self.threadpool.maxThreadCount()
).
And the worker class which is QRunnable:
def __init__(
self,
fn,
*args,
**kwargs
):
super(Worker, self).__init__()
self.fn = fn
self.args = args
self.kwargs = kwargs
self.signals = WorkerSignals()
self.kwargs['progress_callback'] = self.signals.progress
#Slot()
def run(self):
try:
result = self.fn(*self.args, **self.kwargs)
except (Exception):
traceback.print_exc()
exctype, value = sys.exc_info()[:2]
self.signals.error.emit(
(exctype, value, traceback.format_exc())
)
else:
self.signals.result.emit(result)
finally:
self.signals.finished.emit().
I would appreciate some guidance on how to access the worker object from the threadpool. I also tested qtbot.mouseClick() which triggers the function but never emits the signal.

Extend BigQueryExecuteQueryOperator with additional labels using jinja2

In order to track GCP costs using labels, would like to extend BigQueryExecuteQueryOperator with some additional labels so that each task instance gets these labels automatically set in its constructor.
class ExtendedBigQueryExecuteQueryOperator(BigQueryExecuteQueryOperator):
#apply_defaults
def __init__(self,
*args,
**kwargs) -> None:
task_labels = {
'dag_id': '{{ dag.dag_id }}',
'task_id': kwargs.get('task_id'),
'ds': '{{ ds }}',
# ugly, all three params got in diff. ways
}
super().__init__(*args, **kwargs)
if self.labels is None:
self.labels = task_labels
else:
self.labels.update(task_labels)
with DAG(dag_id=...,
start_date=...,
schedule_interval=...,
default_args=...) as dag:
t1 = ExtendedBigQueryExecuteQueryOperator(
task_id=f't1',
sql=f'SELECT 1;',
labels={'some_additional_label2':'some_additional_label2'}
# all labels should be: dag_id, task_id, ds, some_additional_label2
)
t2 = ExtendedBigQueryExecuteQueryOperator(
task_id=f't2',
sql=f'SELECT 2;',
labels={'some_additional_label3':'some_additional_label3'}
# all labels should be: dag_id, task_id, ds, some_additional_label3
)
t1 >> t2
but then I lose task level labels some_additional_label2 or some_additional_label3.
You could create the following policy in airflow_local_settings.py:
def policy(task):
if task.__class__.__name__ == "BigQueryExecuteQueryOperator":
task.labels.update({'dag_id': task.dag_id, 'task_id': task.task_id})
From docs:
Your local Airflow settings file can define a policy function that has the ability to mutate task attributes based on other task or DAG attributes. It receives a single argument as a reference to task objects, and is expected to alter its attributes.
More details on applying Policy: https://airflow.readthedocs.io/en/1.10.9/concepts.html#cluster-policy
You won't need to extend BigQueryExecuteQueryOperator in that case. The only missing part is execution_date which you can set in the task itself.
Example:
with DAG(dag_id=...,
start_date=...,
schedule_interval=...,
default_args=...) as dag:
t1 = BigQueryExecuteQueryOperator(
task_id=f't1',
sql=f'SELECT 1;',
lables={'some_additional_label2':'some_additional_label2', 'ds': '{{ ds }}'}
)
airflow_local_settings file needs to be on your PYTHONPATH. You can put in under $AIRFLOW_HOME/config or inside your dags directory.

How can i use Pool instead of Process & Pipe

I have a first tree class that does some calculations, and another forest class that contains trees and does more calculations.
The tree class uses the Pipe() connection mecanism to fit a tree : the resulting nodes and leaves need to be recv() by the tree.
The forest class, after fitting a certain number of tree, uses also the Pipe() connection mecanism to compute other things depending on the trees, which are also recv() by the forest.
When the number of tree rises in the forest, i am running into OSError : Too many open files and cant increase the ulimit since i have no root access on the machine where the code is running.
I am gessing that using Pool() will not have theese problems. But i do not understand how i can transmit information using Pool between child and main processes. For the moment, i use an idiom that looks like :
from multiprocessing import Pipe, Process
class tree:
def __init__(self):
self.random_number = 5 # Choosed by fair roll dice
def start_fit(self):
self.parent_conn, child_conn = Pipe(duplex=False)
self.process = tree_fitter(self.random_number, child_conn)
self.process.start()
return None
def join_fit(self):
self.fitting_result = self.parent_conn.recv()
self.process.join()
self.parent_conn.close()
del self.parent_conn
del self.process
return None
class tree_fitter(Process):
def __init__(self, random_number, connection):
self.connection = connection
self.new_randm_number = 2 # Choosed by fair roll dice
def run(self):
result = self.new_randm_number * random_number
self.connection.send(result)
self.connection.close()
return 0
class forest:
def __init__(self,n_trees):
self.n = n_trees # Number of trees
self.trees = [tree() for i in range(0,self.n)]
def compute_more_complicate_things(self):
processes = []
parents = []
results = []
for i in range(0,self.n):
self.trees[i].start_fit()
for i in range(0,self.n):
self.trees[i].join_fit()
parent_connection, child_connection = Pipe(duplex=False)
parents.append(parent_connection)
processes.append(forest_cumulative_mean(self, i,child_connection))
for i in range(0,self.n):
processes[i].start()
for i in range(0,self.n):
results.append(parents[i].recv())
processes[i].join()
parents[i].close()
print(i, end=" ")
del processes
del parents
del results
self.cumusum_of_result = results
class forest_cumulative_mean(Process):
def __init__(self, forest, i, connection):
self.trees = forest.trees[:i]
self.connection = connection
def run(self):
result = sum([t.result for t in self.trees])
self.connection.send(result)
self.connection.close()
return 0
1° How can i transform it to the Pool idiom without changing too much the structure of my code (this is of course a dummy example (i did not try to run it), i have a lot more code around it..).
2° Will this tackle the error i have about spanning too many processes ?

How to properly pass arguments to scrapy spider on scrapinghub?

I am trying to pass paramters to my spider (ideally a Dataframe or csv) with:
self.client = ScrapinghubClient(apikey)
self.project = self.client.get_project()
job = spider.jobs.run()
I tried using the *args and **kwargs argument type but each time I only get the last result. For example:
data = ["1", "2", "3"]
job = spider.jobs.run(data=data)
When I try to print them from inside my spider I only get the element 3:
def __init__(self, **kwargs):
for key in kwargs:
print kwargs[key]
2018-05-17 08:39:28 INFO [stdout] 3
I think that there is some easy explanation that i just can't seem to understand.
Thanks in advance!
For passing arguments and tags you can do like this
priority = randint(0, 4)
job = spider.jobs.run(
units=1,
job_settings=setting,
add_tag=['auto','test', 'somethingelse'],
job_args={'arg1': arg1,'arg2': arg2,'arg3': arg3},
priority=priority
)

getting runtime statistics with monitoredtrainingsession in tensorflow

I am trying to get my tensorflow code profile (running and memory consumption of each layers in the network) by following the runtime statistics instruction here. As far as I understand, I need to create run options and run metadata like this
run_options = tf.RunOptions(trace_level=tf.RunOptions.FULL_TRACE)
run_metadata = tf.RunMetadata()
and pass them to sess.run
However, as I am also trying to use tf.train.MonitoredTrainingSession I don't know if I can pass the same thing into this class. A plausible approach could make use of Hooks but I do not know how to do it. I am still very new to them
You can simply create a custom hook and pass it to the MonitoredTrainingSession. There is no need to pass your own tf.RunMetadata() instance to the run call.
Here is an example Hook which stores metadata every N steps to ckptdir:
import tensorflow as tf
class TraceHook(tf.train.SessionRunHook):
"""Hook to perform Traces every N steps."""
def __init__(self, ckptdir, every_step=50, trace_level=tf.RunOptions.FULL_TRACE):
self._trace = every_step == 1
self.writer = tf.summary.FileWriter(ckptdir)
self.trace_level = trace_level
self.every_step = every_step
def begin(self):
self._global_step_tensor = tf.train.get_global_step()
if self._global_step_tensor is None:
raise RuntimeError("Global step should be created to use _TraceHook.")
def before_run(self, run_context):
if self._trace:
options = tf.RunOptions(trace_level=self.trace_level)
else:
options = None
return tf.train.SessionRunArgs(fetches=self._global_step_tensor,
options=options)
def after_run(self, run_context, run_values):
global_step = run_values.results - 1
if self._trace:
self._trace = False
self.writer.add_run_metadata(run_values.run_metadata,
f'{global_step}', global_step)
if not (global_step + 1) % self.every_step:
self._trace = True
It checks in before_run whether it has to trace or not and if so, adds the RunOptions. In after_run it checks if the next run call needs to be traced and if so, it sets _trace to True again. Additionally it stores the metadata when it is available.