What is the difference between the `policy` and `collect_policy` of a tf-agent? - tensorflow

I am looking at tf-agents to learn about reinforcement learning. I am following this tutorial. There is a different policy used, called collect_policy for training than for evaluation (policy).
The tutorial states there is a difference, but in IMO it does not describe the why of having 2 policies as it does not describe a functional difference.
Agents contain two policies:
agent.policy — The main policy that is used for evaluation and deployment.
agent.collect_policy — A second policy that is used for data collection.
I've looked at the source code of the agent. It says
policy: An instance of tf_policy.Base representing the Agent's current policy.
collect_policy: An instance of tf_policy.Base representing the Agent's current data collection policy (used to set self.step_spec).
But I do not see self.step_spec anywhere in the source file. The next closest thing I find is time_step_spec. But that is the first ctor argument of the TFAgent class, so that makes no sense to set via a collect_policy.
So the only thing I can think of was: put it to the test. So I used policy instead of collect_policy for training. And the agent reached the max score in the environment nonetheless.
So what is the functional difference between the two policies?

There are some reinforcement learning algorithms, such as Q-learning, that use a policy to behave in (or interact with) the environment to collect experience, which is different than the policy they are trying to learn (sometimes known as the target policy). These algorithms are known as off-policy algorithms. An algorithm that is not off-policy is known as on-policy (i.e. the behaviour policy is the same as the target policy). An example of an on-policy algorithm is SARSA. That's why we have both policy and collect_policy in TF-Agents, i.e., in general, the behavioural policy can be different than the target policy (though this may not always be the case).
Why should this be the case? Because during learning and interaction with environment, you need to explore the environment (i.e. take random actions), while, once you have learned the near-optimal policy, you may not need to explore anymore and can just take the near-optimal action (I say near-optimal rather than optimal because you may not have learned the optimal one)

Related

Client participation in the federated computation rounds

I am building a federated learning model using Tensorflow Federated.
Based on what I have read in the tutorials and papers, I understood that the state-of-the-art method (FedAvg) is working by selecting a random subset of clients at each round.
My concern is:
I am having a small number of clients. Totally I have 8 clients, I select 6 clients for training and I kept 2 for testing.
All of the data are provided on my local device, so I am using the TFF as the simulation environment.
If I use all of the 6 clients in all of the rounds during federated communication rounds, would this be a wrong execution of the FedAvg method?
Note that I am planning also to use the same experiment used in this paper. That aims to use different server optimization methods and compare their performance. So, would (all clients participating procedure) works here or not?
Thanks in advance
This is certainly a valid application of FedAvg and the variants proposed in the linked paper, though one that is only studied empirically in a subset of the literature. On the other hand, many theoretical analyses of FedAvg assume a similar situation to the one you're describing; at the bottom of page 4 of that linked paper, you will see that the analysis is performed in this so-called 'full participation' regime, where every client participates on every round.
Often the setting you describe is called 'cross silo'; see, e.g., section 7.5 of Advances and Open Problems in Federated Learning, which will also contain many useful pointers for the cross-silo literature.
Finally, depending on the application, consider that it may be more natural to literally train on all clients, reserving portions of each clients' data for validation and test. Questions around natural partitions of data to model the 'setting we care about' are often thorny in the federated setting.

How to write a custom policy in tf_agents

I wanted to use the contextual bandit agents (LinearThompson Sampling agent) in the tf_Agents.
I am using a custom environment and my rewards are delayed by 3 days. Hence for training, the observations are generated from the saved historical tables (predictions generated 3 days ago) and their corresponding rewards (Also in the table).
Given this, only during training, how do I make the policy to output an action, for a given observation, from the historical tables? And during evaluation I want the policy to behave the usual way, generating the actions using the policy it learned from.
Looks like I need to write a custom policy, that behaves in a way during training and behaves it's usual self (linearthompsonsampling.policy) during evaluation. Unfortunately I couldn't find any examples or documentation for this usecase. Can someone please explain how to code this - an example would be very useful

What is the difference between Symbolic and Concrete model checking when the search is bounded in time?

Could someone please spend a few words to explain to someone who does not come from a formal methods background what is the difference between verifying a specification using Symbolic Model Checking and doing the same using Concrete Model Checking, when the search is bounded in time? I am referring to the definition of SMC and Concrete MC made in UPPAAL.
In particular, I wrote a program that uses UPPAAL Java API to verify a query against a network of timed automata. If the query is verified, UPPAAL returns a symbolic trace to parse or something else if it is not. If the verification takes too long I decided to halt the verification process, return a message and move on with the next query to verify. Everything is good so far.
Recently, I have been playing around with UPPAAL Stratego which natively offers the possibility of choosing a maximum time or depth of exploration to bound the search. However, this options can be applied only when the verification is carried out using Concrete Model Checking.
My question is : is there any difference in halting the symbolic verification process, as I am doing in my Java program and what UPPAAL Stratego does natively? In both case I don't get an answer (or a trace) but what about the "reliability" of the exploration?
Which would be better (i.e. more complete) between the two options? Halting the symbolic verification or halting the concrete verification?
My understanding so far is that in Symbolic Model Checking, the possible states are defined by using intervals of variables whilst in Concrete Model Checking variables assume an actual value. My view is that, in terms of "completeness", halting the SMC after some time is more "complete" since the exploration of the state space happens systematically using BFS or DFS algorithm and, if I use BFS, I can be "sure" that within N steps nothing bad happens. But again, my background in model checking is not rich, so I might have get it completely wrong.
Also, if could drop any reference to the strategies, it would be really appreciated.
Thanks!

Automatically create test cases for web page?

If someone has a webpage, the usual way of testing the web site for user interaction bugs is to create each test case by hand and use selenium.
Is there a tool to create these testcases automatically? So if I have a webpage that gets altered, new test cases get created automatically?
You can look at a paid product. That type of technology is not being developed as open source and will probably cost a bit. Some of the major test tools get closer to this, but full auto I have not heard of.
If this was the case the role of QA Engineer and especially Automation Engineer would not be as important and the jobs would spike downwards pretty quickly. I would imagine that if such a tool was out there that it would be breaking news to the entire industry and be world wide.
If you go down the artificial intelligence path this is possible in theory and concept, however, usually artificial intelligence development efforts costs more than the app being developed that needs the testing, so...that's not going to happen.
The best to do at this point is separate out as much of the maintenance into a single section from the rest so you limit the maintenance headache when modyfying and keep a core that stays the same. I usually focus on control manipulation as generic and then workflow and specific maps and data change. That will allow it to function against any website...but you still have to write/update the tests and maintain the maps.
I think Growing Test Cases Automatically is more of what your asking. To be more specific I'll try to introduce basics and if you're interested take a closer look at Evolutionary Testing
Usually there is a standard set of constraints we meet like changing functionality of the system under test (SUT), limited timeframe, lack of appropriate test tools and the list goes on… Yet there is another type of challenge which arises as technological solutions progress further – increase of system complexity.
While the typical constraints are solvable through different technical and management approaches, in the case of system complexity we are facing the limit of our capability of defining a straight-forward analytical method for assessing and validating system behavior. Complex system consist of multiple, often heterogeneous components which when working together amplify each other’s statistical and behavioral deviations, resulting in a system which acts in ways that were not part of its initial design. To make matter worse, complex systems increase sensitivity to their environment as well with the help of the same mechanism.
Options for testing complex systems
How can we test a system which behaves differently each time we run a test scenario? How can we reproduce a problem which costs days and millions to recover from, but happens only from time to time under conditions which are known just approximately?
One possible solution which I want to focus on is to embrace our lack of knowledge and work with whatever we have by using evolutionary testing. In this context the evolutionary testing can be viewed as a variant of black-box testing, because we are working with feeding input into and evaluating output from a SUT without focusing on its internal structure. The fine line here is that we are organizing this process of automatic test case generation and execution on a massive scale as an iterative optimization process which mimics the natural evolution.
Evolutionary testing
Elements:
• Population – set of test case executions, participating into the optimization process
• Generation – the part of the Population, involved into given iteration
• Individual – single test case execution and its results, an element from the Population
• Genome – unified definition of all test cases, model describing the Population
• Genotype – a single test case instance, a model describing an Individual, instance of the Genome
• Recombination – transformation of one or more Genotypes into a new Genotype
• Mutation – random change in a Genotype
• Fitness Function – formalized criterion, expressing the suitability of the Individual against the goal of the optimization
How we create these elements?
• Definition of the experiment goal (selection criteria) – sets the direction of the optimization process and is related to the behavior of the SUT. Involves certain characteristics of SUT state or environment during the performed test case experiments. Examples:
o “SUT should complete the test case execution with an error code”
o “The test case should drive the SUT through the largest number of branches in SUT’s logical structure”
o “Ambient temperature in the room where SUT is situated should not exceed 40 ºC during test case execution”
o “CPU utilization on the system, where SUT runs should exceed 80% during test case execution”
Any measurable parameters of SUT and its environment could be used in a goal statement. Knowledge of the relation between the test input and the goal itself is not obligatory. This gives a possibility to cover goals which are derived directly from requirements, rather than based on some late requirement derivative like business, architectural or technical model.
• Definition of the relevant inputs and outputs of the tested system – identification of SUT inputs and outputs, as well as environment parameters, relevant to the experiment goal.
• Formal definition of the experiment genome – encoding the summarized set of test cases into a parameterized model (usually a data structure), expressing relevant SUT input data, environment parameters and action sequences. This definition also needs to comply with the two major operations applied over genome instances – recombination and mutation. The mechanism for those two operations can be predefined for the type of data or action present in the genome or have custom definitions
• Formal definition of the selection criteria (fitness function) – an evaluation mechanism which takes SUT output or environment parameters resulting from a test case execution (Individual) and calculates a number (Fitness), signifying how close is this particular Individual to the experiment goal.
How the process works?
We use the Genome to create a Generation of random Genotypes (test case instances).
We execute the test cases (Genotypes) generating results (Individuals)
We evaluate each execution result (Individual) against our goal using the Fitness Function
We select only those Individuals from given Generation which have Fitness above a given threshold (the top 10 %, above the average, etc.)
We use the selected individuals to produce a new, full Generation set by applying Recombination and Mutation
We repeat the process, returning on step 2
The iteration process usually stops by setting a condition with regard to the evaluated Fitness of a Generation. For example:
• If the top Fitness hasn’t changed with more than 0.1% since the last Iteration
• If the difference between the top and the bottom Fitness in a Generation is less than 0.3%
then probably it is time to stop.
Upsides and downsides
Upsides:
• We can work with limited knowledge for the SUT and goal-oriented test definitions
• We use a test case model (Genome) which allows us to mass-produce a large number of test cases (Genotypes) with little effort
• We can “seed” test cases (Genotypes) in the first iteration instead of generating them at random in order to speed up the optimization process.
• We could run test cases in parallel in order to speed up the process
• We could find multiple solutions which meet our test goal
• If the optimization process in convergent we have a guarantee that each following Generation is a better approximate solution of our test goal. This means that even if we need to stop before we have reached optimal Fitness we will still have better test cases than the one we started with.
• We can achieve replay of very complex, hard to reproduce test scenarios which mimic real life and which are far beyond the reach of any other automated or manual testing technique.
Downsides:
• The process of defining the necessary elements for evolutionary test implementation is non-trivial and requires specific knowledge.
• Implementing such automation approach is time- and resource-consuming and should be employed only when it is justifiable.
• The convergence of the optimization process depends on the smoothness of the Fitness Function. If its definition results in a zones of discontinuity or small/no gradient then we can expect slow or no convergence
Update:
I also recommend you to look at Genetic algorithms and this article about Test data generation can give you approaches and guidelines.
I happen to develop ecFeed - an open-source tool that may assist in test design. It's in pre-release phase and we are going to add better integration with Selenium, but you may have a look at the current snapshot: https://github.com/testify-no/ecFeed/wiki . The next version should arrive in October and will have major improvements in usability. Anyway, I am looking forward for constructive criticism.
In the Microsoft development world there is Visual Studio's Coded UI Test framework. This will record your actions in a web browser and generate test cases to replicate that use case. It won't update test cases with any changes to code though, you would need to update them manually or re-generate.

own process manager based on hydra (mpich)

How do you assess the level of difficulty writing own process manager based on the sources of hydra (mpich)? ie., for scale 1 to 100? It will be change the part corresponding to the assignment of processes to computers.
This shouldn't be too hard, but Hydra already implements several rank allocation strategies, so you might not even need to write any code.
You can already provide a user-specified rank allocation. Based on the provided configuration, Hydra can use the hwloc library to obtain hardware topology information and bind processes to cores.