Tf Agents Parallel Py Environment With an Environment that has Input Parameters - tensorflow2.0

Suppose you have an environment that has input parameters: for example, to create an instance you would use
env_instance = MyEnv(var_1=3, var_2=5, ...)
Now suppose you want to create a parallel_py_environment using the environment "MyEnv"? Since you need input parameters, you cannot use
tf_py_environment.TFPyEnvironment(parallel_py_environment.ParallelPyEnvironment([MyEnv]*int(n_envs)))

The solution is to create a super class:
class MyEnvPar(MyEnv):
def __init__(self):
super().__init__(var_1=3, var_2=5)
And then you can use
tf_py_environment.TFPyEnvironment(parallel_py_environment.ParallelPyEnvironment([MyEnvPar]*int(n_envs)))

Related

Can we pass dataframes between different notebooks in databricks and sequentially run multiple notebooks? [duplicate]

I have a notebook which will process the file and creates a data frame in structured format.
Now I need to import that data frame created in another notebook, but the problem is before running the notebook I need to validate that only for some scenarios I need to run.
Usually to import all data structures, we use %run. But in my case it should be combinations of if clause and then notebook run
if "dataset" in path": %run ntbk_path
its giving an error " path not exist"
if "dataset" in path": dbutils.notebook.run(ntbk_path)
this one I cannot get all the data structures.
Can someone help me to resolve this error?
To implement it correctly you need to understand how things are working:
%run is a separate directive that should be put into the separate notebook cell, you can't mix it with the Python code. Plus, it can't accept the notebook name as variable. What %run is doing - it's evaluating the code from specified notebook in the context of the current Spark session, so everything that is defined in that notebook - variables, functions, etc. is available in the caller notebook.
dbutils.notebook.run is a function that may take a notebook path, plus parameters and execute it as a separate job on the current cluster. Because it's executed as a separate job, then it doesn't share the context with current notebook, and everything that is defined in it won't be available in the caller notebook (you can return a simple string as execution result, but it has a relatively small max length). One of the problems with dbutils.notebook.run is that scheduling of a job takes several seconds, even if the code is very simple.
How you can implement what you need?
if you use dbutils.notebook.run, then in the called notebook you can register a temp view, and caller notebook can read data from it (examples are adopted from this demo)
Called notebook (Code1 - it requires two parameters - name for view name & n - for number of entries to generate):
name = dbutils.widgets.get("name")
n = int(dbutils.widgets.get("n"))
df = spark.range(0, n)
df.createOrReplaceTempView(name)
Caller notebook (let's call it main):
if "dataset" in "path":
view_name = "some_name"
dbutils.notebook.run(ntbk_path, 300, {'name': view_name, 'n': "1000"})
df = spark.sql(f"select * from {view_name}")
... work with data
it's even possible to do something like with %run, but it could require a kind of "magic". The foundation of it is the fact that you can pass arguments to the called notebook by using the $arg_name="value", and you can even refer to the values specified in the widgets. But in any case, the check for value will happen in the called notebook.
The called notebook could look as following:
flag = dbutils.widgets.get("generate_data")
dataframe = None
if flag == "true":
dataframe = ..... create datarame
and the caller notebook could look as following:
------ cell in python
if "dataset" in "path":
gen_data = "true"
else:
gen_data = "false"
dbutils.widgets.text("gen_data", gen_data)
------- cell for %run
%run ./notebook_name $generate_data=$gen_data
------ again in python
dbutils.widgets.remove("gen_data") # remove widget
if dataframe: # dataframe is defined
do something with dataframe

How can I access value in sequence type?

There are the following attributes in client_output
weights_delta = attr.ib()
client_weight = attr.ib()
model_output = attr.ib()
client_loss = attr.ib()
After that, I made the client_output in the form of a sequence through
a = tff.federated_collect(client_output) and round_model_delta = tff.federated_map(selecting_fn,a)in here . and I declared
`
#tff.tf_computation() # append
def selecting_fn(a):
#TODO
return round_model_delta
in here. In the process of averaging on the server, I want to average the weights_delta by selecting some of the clients with a small loss value. So I try to access it via a.weights_delta but it doesn't work.
The tff.federated_collect returns a tff.SequenceType placed at tff.SERVER which you can manipulate the same way as for example client dataset is usually handled in a method decorated by tff.tf_computation.
Note that you have to use the tff.federated_collect operator in the scope of a tff.federated_computation. What you probably want to do[*] is pass it into a tff.tf_computation, using the tff.federated_map operator. Once inside the tff.tf_computation, you can think of it as a tf.data.Dataset object and everything in the tf.data module is available.
[*] I am guessing. More detailed explanation of what you would like to achieve would be helpful.

Where does SimObject name get set?

I want to know where does SimObject names like mem_ctrls, membus, replacement_policy are set in gem5. After looking at the code, I understood that, these name are used in stats.txt.
I have looked into SimObject code files(py,cc,hh files). I printed all Simobject names by stepping through root descendants in Simulation.py and then searched some of the names like mem_ctrls using vscode, but could not find a place where these names are set.
for obj in root.descendants():
print("object name:%s\n"% obj.get_name())
These names are the Python variable names from the configuration/run script.
For instance, from the Learning gem5 simple.py script...
from m5.objects import *
# create the system we are going to simulate
system = System()
# Set the clock fequency of the system (and all of its children)
system.clk_domain = SrcClockDomain()
system.clk_domain.clock = '1GHz'
system.clk_domain.voltage_domain = VoltageDomain()
# Set up the system
system.mem_mode = 'timing' # Use timing accesses
system.mem_ranges = [AddrRange('512MB')] # Create an address range
The names will be system, clk_domain, mem_ranges.
Note that only the SimObjects will have a name. The other parameters (e.g., integers, etc.) will not have a name.
You can see where this is set here: https://gem5.googlesource.com/public/gem5/+/master/src/python/m5/SimObject.py#1352

How to get apache airflow to render Hive HQL ${variables} with Jinja

It seems like this is supported where you pass in a HQL script with ${xxx} vars and it gets preprocessed to convert them to {{xxx}} Jinja-style before the stage that actually does the template rendering to then replace those with values from a user-supplied dictionary. I believe this because there is a function like this in the HiveOperator class:
def prepare_template(self):
if self.hiveconf_jinja_translate:
self.hql = re.sub(
"(\$\{([ a-zA-Z0-9_]*)\})", "{{ \g<2> }}", self.hql)
if self.script_begin_tag and self.script_begin_tag in self.hql:
self.hql = "\n".join(self.hql.split(self.script_begin_tag)[1:])
The problem is I cannot figure out how to trigger this piece of code to get called before the template rendering stage. I have a basic DAG script like this:
from airflow import DAG
from airflow.operators.hive_operator import HiveOperator
from datetime import datetime, timedelta
default_args = dict(
owner='mpetronic',
depends_on_past=False,
start_date=datetime(2017, 5, 2),
verbose=True,
retries=1,
retry_delay=timedelta(minutes=5)
)
dag = DAG(
dag_id='report',
schedule_interval='* * * * *',
user_defined_macros=dict(a=1, b=2),
default_args=default_args)
hql = open('/home/mpetronic/repos/airflow/resources/hql/report.hql').read()
task = HiveOperator(
task_id='report_builder',
hive_cli_conn_id='hive_dv',
schema='default',
mapred_job_name='report_builder',
hiveconf_jinja_translate=True,
dag=dag,
hql=hql)
I can see that my user_defined_macros dictionary makes it to the code where it gets merged with a global jinja context dictionary that then is applied to my HQL script to render it as a template. However, because my HQL is native HQL, all my variables that I want to update are in the form of ${xxx} and jinja just skips over them. I need airflow to call prepare_template() first but just don't see how to make that happen.
I realize I could just manually alter my HQL ${xxx} to {{xxx}} as that works but that seems like an anti-pattern. I want the script to be able to work natively or via airflow. This is the function, in the TaskInstance class, the does render my manually altered {{xxx}} values:
def render_templates(self):
task = self.task
jinja_context = self.get_template_context()
if hasattr(self, 'task') and hasattr(self.task, 'dag'):
if self.task.dag.user_defined_macros:
jinja_context.update(
self.task.dag.user_defined_macros)
rt = self.task.render_template # shortcut to method
for attr in task.__class__.template_fields:
content = getattr(task, attr)
if content:
rendered_content = rt(attr, content, jinja_context)
setattr(task, attr, rendered_content)
I figured out my issue. This is the regex used in the noted method:
(\$\{([ a-zA-Z0-9_]*)\})
It does not account for beeline variables in the form of:
${hivevar:var_name}
It does not consider a colon in the pattern. That form is the more standard way of defining Hive variables within a namespace using beeline. To make this Jinja substitution work, you have to reference variables in the HQL using just ${var_name} but you can only define variables in beeline using:
set hivevar:var_name=123;
I think that Airflow should fully support the hivevar:var_name style of namespaced variables when you are running with beeline given beeline is the preferred client to use with Hive.

Spark: How to send arguments to Spark foreach function

I am trying to save the contents of a Spark RDD to Redis with the following code
import redis
class RedisStorageAdapter(BaseStorageAdapter):
#staticmethod
def save(record):
###--- How do I get action_name ---- ###
redis_key = #<self.source_action_name>
redis_host=settings['REDIS']['HOST']
redis_port=settings['REDIS']['PORT']
redis_db=settings['REDIS']['DB']
redis_client = redis.StrictRedis(redis_host, redis_port, redis_db)
redis_client.sadd(redis_key, record)
def store_output(self, results_rdd):
print self.source_action_name
results_rdd.foreach(RedisStorageAdapter.save)
But I want the Redis Key to be different based on what self.source_action_name is initialized to (in BaseStorageAdapter)
How do I pass the source_action_name to RedisStorageAdapter.save function? foreach function only allows the name of the function to be executed and no parameter list
Also - if there is a better way to move data from RDD to Redis, let me know
Of course, foreach takes a function, not a function name. So you can pass to it a lambda function:
results_rdd.foreach(lambda x: RedisStorageAdapter.save(x, self.source_action_name))