I am using contextual bandits algorithm in TF_agents.
Is there a way to train the agent using historical data (context, action, reward) in table, instead of using the replay buffer ?
The environment provides context and reward. Therefore I cam make the environment provide these from the table. But the action is provided by the agent. I am not sure how to override the action provide by the agent (on a specific context) with the action in historical table data.
I am using a custom environment, and a prebuilt agent (LinearThompsonSampling - Bandit agent). Not quite sure if I can use the LinearThompson sampling inbuilt agent and at the same time, provide actions based on the historical data for training. Couldn't find any examples in the tf_agents documentation
Related
I would like to have a simple CQRS implementation on an API.
In short:
Separate routes for Command and Query.
Separate DB tables (on the same DB at the moment). Normalized one for Command and a de-normalized one for Query.
Asynchronous event-driven update of Query Read Model, using existing external Event Bus.
After the Command is executed, naturally I need to raise an event and pass it to the Event Bus.
Event Bus would process the event and pass it to it's subscriber(s).
In this case the subscriber is Read Model which needs to be updated.
So I need a callback route on the API which gets the event from Command Bus and updated the Read Model projection (i.e.: updating the de-normalized DB table which is used for Queries).
The problem is that the update of the Read Model projection is neither a Command (we do not execute any Domain Logic) nor a Query.
The questions is:
How should this async Read Model update work in order to be compliant both with CQRS and DDD?
How should this async Read Model update work in order to be compliant both with CQRS and DDD?
I normally think of the flow of information as a triangle.
We copy information from the outside world into our "write model", via commands
We copy information from the write model into our "read model"
We copy information from the read model to the outside world, via queries.
Common language for that middle step is "projection".
So the projection (typically) runs asynchronously, querying the "write model" and updating the "read model".
In the architecture you outlined, it would be the projection that is subscribed to the bus. When the bus signals that the write model has changed, we wake up the projection, and let it run so that it can update the read model.
(Note the flow of information - the signal we get from the bus triggers the projection to run, but the projection copies data from the write model, not from the event bus message. This isn't the only way to arrange things, but it is simple, and therefore easy to reason about when things start going pear shaped.)
It is often the case that the projection will store some of its own metadata when it updates the read model, so as to not repeat work.
As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:
results of pipeline components
metadata about artifacts generated through the components of the pipelines
metadata about executions of these components
metadata about the pipeline and associated lineage information
Features:
Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?
Orchestration and pipeline history:
Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?
TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata.
In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though.
And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.
Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.
Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:
asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.
So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1, and it has an entry in the Artifacts table of MLMD, with a relation to the Trainer run and it's TrainArgs parameters on the ContextProperty table.
If not, what does it mean by results of pipeline components?
It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs of a Trainer component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.
This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton or Argo more relevant compared to Airflow, and MLMD a more focused metadata store compared to the later.
Does GCP have a job scheduling service like Azure Scheduler, where jobs can be scheduled and managed dynamically via API?
Google Cron service is set in a static file and it seems like their answer to this is to use that to poke a roll your own service backed with PubSub and a data store. Looking for Quartz-like functionality, consumable by APP engine, which can be managed and invoked via API as opposed to managing a cluster, queue, and compute instance/VM deployment of Quartz (or the like) or rolling a custom solution. Should support 50 million simultaneous jobs per day with retry / recoverability and dynamic scheduling per tenant capabilities.
This is the cheapest and easiest way I can imagine building a solution today on top of an existing AppEngine based project:
As you observed, currently there is no such API/service directly available on GCP. There is an open feature request (on GAE) for it.
But, also as you observed, it is possible to build and use a custom solution, just like the one you proposed.
Depending on the context even simpler solutions are possible. For a GAE context check out, for example, How to schedule repeated jobs or tasks from user parameters in Google App Engine?.
We would like to automate scale up of streaming units for certain stream analytics job if the 'SU utilization' is high. Is it possible to achieve this using PowerShell? Thanks.
Firstly, as Pete M said, we could call REST API to create or update a transformation within a job.
Besides, Azure Stream Analytics Cmdlets New-AzureRmStreamAnalyticsTransformation could be used to update a transformation within a job.
Depends on what you mean by "automate". You can update a transformation via the API from a scheduled job, including streaming unit allocation. I'm not sure if you can do this via the PS object model but you can always make a rest call:
https://learn.microsoft.com/en-us/rest/api/streamanalytics/stream-analytics-transformation
If you mean you want to use powershell to create and configure a job to automatically scale on its own, unfortunately today that isn't possible regardless of how you create the job. ASA doesn't support elastic scaling. You have to do it "manually", either by hand or some manner of scheduled webjob or similar.
It is three years later now, but I think you can use App Insights to automatically create an alert rule based on percent utilization. Is it an absolute MUST that you use powershell? If so, there is an Azure Automation Script on Github:
https://github.com/Azure/azure-stream-analytics/blob/master/Autoscale/StepScaleUp.ps1
I am writing software that creates a large graph database. The software needs to access dozens of different REST APIs with millions of total requests. The data will then be processed by the Hadoop cluster. Each of these APIs have rate limits that vary by requests/second, per window, per day and per user (typically via OAuth).
Does anyone have any suggestions on how I might use either a Map function or other Hadoop-ecosystem tool to manage these queries? The goal would to be to leverage the parallel processing in Hadoop.
Because of the varied rate limits, it often makes sense to switch to a different API query while waiting for the first limit to reset. An example would be one API call that creates nodes in the graph and another that enriches the data for that node. I could have the system go out and enrich the data for the new nodes while waiting for the first API limit to reset.
I have tried using SQS queuing on EC2 to manage the various API limits and states (creating a queue for each API call), but have found it to be ridiculously slow.
Any ideas?
It looks like the best option for my scenario will be using Storm, or specifically the Trident abstraction. It gives me the greatest flexibility for both workload management but process management as well