in flyte, is it possible to dynamically control cache usage? - flyte

My use case is as follows:
the initial task of the flyte workflow extracts a dataset
in production, I want to always use the latest data (or use cache with short expiration
during development, runtime is more important then data freshness - so I prefer to use cached dataset
is there an option to control somehow the caching when launching a workflow (and not just when defining the workflow)

Related

Data stored in MLMD in TensorFlow TFX

As far as I understand, TensorFlow uses MLMD to record and retrieve metadata associated with workflows. This may include:
results of pipeline components
metadata about artifacts generated through the components of the pipelines
metadata about executions of these components
metadata about the pipeline and associated lineage information
Features:
Does the above (e.g. #1 aka "results of components") imply that MLMD stores actual data? (e.g. input features for ML training?). If not, what does it mean by results of pipeline components?
Orchestration and pipeline history:
Also, when using TFX with e.g. AirFlow, which uses its own metastore (e.g. metadata about DAGs, their runs, and other Airflow configurations like users, roles, and connections) does MLMD store redundant information? Does it supersede it?
TFX is a ML pipeline/workflow so when you write a TFX application what you are doing is essentially constructing the structure of the workflow and preparing the WF to accept a particular set of data and process or use it (transformations, model build, inference, deploy etc.). So in that aspect it never stores the actual data, it stores the information (metadata) in order to process or use the data. So for example in the condition where it checks anomalies, it requires to remember the previous data schema/stats (not the actual data), so it saves that information as metadata in the MLMD; besides the actual run metadata.
In terms of Airflow it will also save the run metadata. This can be seen as a subset of all the metadata, very limited in comparison to the metadata saved in MLMD. There will be a redundancy involved though.
And the controller is TFX that defines and makes use of the underlining Airflow orchestration. It will not supersede but it will definitely fail if there is a clash.
Imagine the filesystem of a disk drive. The contents of the files are stored in the disk, but it's the index and the pointers to these data that is called filesystem. That metadata that brings value to the user who can find the relevant data when they need them, by searching or navigating through the filesystem.
Similarly with MLMD, it stores the metadata of a ML pipeline, like which hyperparameters you've used in an execution, which version of training data, how was the distribution of the features, etc. But it's beyond being just a registry of the runs. These metadata can be used to empower two killer features of a ML pipeline tool:
asynchronous execution of its components, for example retrain a model when there are new data, without necessary having a new vocabulary generated
reuse results from previous runs, or step-level output caching. For example, do not run a step if its input parameters haven't changed, but reuse the output of a previous run from the cache to feed the next component.
So yes, the actual data are indeed stored in a storage, maybe a cloud bucket, in form of parquet files across transformations, or model files and schemata protobufs. And MLMD stores the uri to these data with some meta information. For example, a savedmodel is stored in s3://mymodels/1, and it has an entry in the Artifacts table of MLMD, with a relation to the Trainer run and it's TrainArgs parameters on the ContextProperty table.
If not, what does it mean by results of pipeline components?
It means the pointers to the data which have been generated by the run of a component, including the input parameters. In our previous example, if the input data as well as the the TrainArgs of a Trainer component haven't changed in a run, it shouldn't run again that expensive component, but reuse the modelfile from the cache.
This requirement of a continuous ML pipeline makes the use of workflow managers such as Tekton or Argo more relevant compared to Airflow, and MLMD a more focused metadata store compared to the later.

Is it possible to configure .Net Core to use the file system to cache responses?

Specifically, is there any way in .Net Core (3.0 or earlier) to use the local file system as a Response Cache instead of just in-memory?
After a fair amount of researching, the closest thing seems to be the Response Caching middleware [1], but this does not:
allow pages to be cached indefinitely,
preserve caches between application and server restarts,
allow invalidating the cache on a per-page basis (e.g. blog entry updated),
allow invalidating the entire cache when global changes are made (e.g. theme update, menu changes, etc.).
I'm guessing these features will require custom implementation of ResponseCaching that hits the local file system, but I don't want to reinvent it if it already exists.
Some background:
This will replace our use of a static site-generator, which is problematic for site-wide changes because of the sheer quantity of data (nearly 24 hours to generate and copy to all of the servers).
The scenario is very similar to an encyclopedia or news site -- the vast majority of the content changes infrequently, a few things are added per day, and there is no user-specific content (and if or when there is, it would be dynamically loaded via JS/Ajax). Additionally, the page loads happen to be processor/memory/database intensive.
We will be using a reverse proxy like CloudFlare or AWS CloudFront, but AWS automatically expires their edge caches daily. Edge node cache misses are still frequent.
This is different than IDistributedCache [2] in that it should be response caching, not just caching data used by the MVC Model.
We will also use in-memory cache [3], but again, that solves a different caching scenario.
References
[1] https://learn.microsoft.com/en-us/aspnet/core/performance/caching/middleware
[2] https://learn.microsoft.com/en-us/aspnet/core/performance/caching/distributed
[3] https://learn.microsoft.com/en-us/aspnet/core/performance/caching/memory
I implemented this.
https://www.nuget.org/packages/AspNetCore.ResponseCaching.Extensions/
https://github.com/speige/AspNetCore.ResponseCaching.Extensions
Doesn't support .net core 3 yet though.
Currently (April 2019) the answer appears to be: no, there is nothing out-of-the-box for this.
There are three viable approaches to accomplish this using .Net Core:
Fork the built-in ResponseCaching middleware and create a flag for cache-to-disk:
https://github.com/aspnet/AspNetCore/tree/master/src/Middleware/ResponseCaching
This might be annoying to maintain because the namespaces and class names will collide with the core framework.
Implement this missing feature in EasyCaching, which apparently already has caching to disk on their radar:
https://github.com/dotnetcore/EasyCaching/blob/master/ToDoList.md
A pull request may be more likely accepted, since it's a planned feature.
There is apparently a port of Strathweb.CacheOutput to .Net Core, which would allow one to implement IApiOutputCache to save to disk:
https://github.com/Iamcerba/AspNetCore.CacheOutput#server-side-caching
Although this question is about caching within .Net Core using the local file system, this could also be accomplished using a local instance of Sqlite on each server node, and then configuring EasyCaching for response caching and to point it to the Sqlite instance on localhost.
I hope this helps someone else who finds themselves in this scenario!

MFP 8.0 adapter cache

I am using MFP 8.0, and there are requirements that we want implement cache on the adapter level.
Whenever MFP server starts we want to dump all the database in cache till the server restart again.
Now whenever user hit some transaction or adapter procedure which call database so instead of calling database it must read from cache.
Adapters support read-only and transactional access modes to back-end systems.
Adapters are Maven projects that contain server-side code implemented in either Java or JavaScript. Adapters are used perform
any necessary server-side logic, and to transfer and retrieve
information from back-end systems to client applications and cloud
services.
JSONStore is an optional client-side API providing a lightweight, document-oriented storage system. JSONStore enables persistent storage
of JSON documents. Documents in an application are available in
JSONStore even when the device that is running the application is
offline. This persistent, always-available storage can be useful to
give users access to documents when, for example, there is no network
connection available in the device.
From your description, assuming you are talking about some custom DB where you have data stored, then you need to implement the logic of caching the data.
Adapter's have two classes <AdapterName>Application.java and <AdapterName>Resource.java. <>Application.java contains the lifecycle methods - init() and destroy().
You should put your custom code of loading data from your DB into cache in the init() method. And also take care of removing it in the destroy().
Now during transactional access (which hits <>Resource.java), you refer to the cache you have already created.
Your requirement, however may not be ideal for heavily loaded systems. You need to consider that:
a) Your adapter initialization is delayed. Any wrongly written code can also break the adapter initialization. An adapter isn't available to service your request until it has been initialized. In case of a clustered environment, the adapter load in all cluster members will delayed depending on the amount of data your are loading. Any client request intended for this adapter will get a runtime exception until the initialization is complete.
b) Holding the cache in memory means, so much space in the heap is used up. If your DB keeps growing, this adversely affects adapter initialization and also heap usage.
c) You are in charge maintaining the data at the latest level and also cleaning it up after use.
To summarize, while it is possible, it is not recommended. While this may work in case of very small data set, this cannot scale well. The design of adapters is to provide you transactional access to data/backend systems. You should use the adapter the way it was designed to.

Google App Engine automatically updating memcache

So here's the problem, I've created a database model. When I create the model, a = Model(args), and then perform a.put(), GAE seems to automatically update the memcache, because all the data seems up-to-date even without me hitting the database. Logging the number of elements in the cache works also shows the correct number of elements. But I'm not manually updating the cache. How do I prevent this? Cheers.
You can set policy functions:
Automatic caching is convenient for most applications but maybe your application is unusual and you want to turn off automatic caching for some or all entities. You can control the behavior of the caches by setting policy functions.
Memcache Policy
That's for NDB. You don't say what language/DB you are using but I'm sure it's all similar.

When is the configuration loaded with nHibernate?

I was reading that the initial load time for the configuration can be fairly long in nHibernate depending on the # of mapping tables, etc.
Is this done once and stored in session or cache?
Will it happen every time the ASP.NET process recycles?
A Configuration object is normally associated to an ISessionFactory. If you have lots of mappings building (by calling cfg.BuildSessionFactory) a session factory might be slow. That's why you need to construct a session factory only once and use it throughout your entire application. In an ASP.NET application when the process recycles, you will lose the reference to this session factory and it needs to be reconstructed again.
If you find it is extremely slow to construct your session factory you could improve performance by disabling the reflection optimizer : Environment.UseReflectionOptimizer = false (cf doc)
The Configuration is used to build the ISessionFactory. It's a one shot deal - which will occurs at the application startup.