Vulkan pipeline derivatives - vulkan

I'm trying to use derivatives when creating graphics pipelines. I have two graphics pipelines that I'm creating that are exactly the same, except for the specialization info.
I create the first pipeline with the VK_PIPELINE_CREATE_ALLOW_DERIVATIVES_BIT flag and the second one with VK_PIPELINE_CREATE_DERIVATIVE_BIT flag. When creating the second pipeline I set the basePipelineHandle as the first pipeline. Is there anything else I need to do?
When using VkPipelineCreationFeedbackEXT VK_PIPELINE_CREATION_FEEDBACK_BASE_PIPELINE_ACCELERATION_BIT_EXT is not set, but VK_PIPELINE_CREATION_FEEDBACK_VALID_BIT_EXT is, so the feedback is valid.
Validation layers report no errors. I'm currently using nvidia 1060 gtx, is it possible that it just doesn't support pipeline derivatives?
Also at the moment I'm not using pipeline cache, but would combining pipeline cache with derivatives work?

Related

What exactly is Orchestrators in ML?

Actually, in ML pipeline components we are specifying inputs and outputs clearly .
For example in TFX statisticgen take input from examplegen and outputs some statistics.so input and output is clear which is same in all components .so why we need orchestrators .if anyone knows please help me?
In real-life projects, everything can be much more complicated:
the input data can be from the different sources: database, file system, third-party services. So we need to do classical ETL before we can start working with data.
you can use different technologies in the one pipeline. For instance, Spark as a preprocessing tool, after you can need to use an instance with GPU for the model training.
last, but not least - in production you need to care much more things. For instance data validation, model evaluation, etc. I wrote a separate article about how to organize this part using Apache Airflow.

How to create your own federated dataset and do learning over multiple devices with TensorflowFederated?

I am trying to use TFF to implement federated learning. I have spun up 3 EC2 instances and set up the TFF in a conda environment. I am trying to figure out how to create a federated dataset with some CSV files and then start training over these EC2 instances by having one of them as the Central and the other 2 as the clients. In the TFF code i could see tff.CLIENTS has a URI attribute but not sure how to map it to the IP/Some-Endpoint for communicating between clients and Server.
I have searched for ways to do that in the using the current TFF functions provided but couldn't find any pointers on achieving this case. (As it is the placement literals tff.CLIENT and tff.SERVER are anyway not exposed via API currently and is planned for future releases)
Inside the tensorflow_federated\python\core\impl\placement_literals.py:
PlacementLiteral(object):
"""A representation of one of the globally recognized placement literals."""
def __init__(self, name, uri, default_all_equal, description):
self._name = name
self._uri = uri #URI for client/Server
self._description = description
self._default_all_equal = default_all_equal
NA
TFF currently only fully supports single-machine simulation. There is work being done to enable multi-machine simulation environments to enable faster simulation (though it will semantically be the same result), but it is on going.
I would recommended first starting with running TFF in single-machine simulation.

Where is Wengert List in TensorFlow?

TensorFlow use reverse-mode automatic differentiation(reverse mode AD), as shown in https://github.com/tensorflow/tensorflow/issues/675.
Reverse mode AD need a data structure called a Wengert List - see https://en.wikipedia.org/wiki/Automatic_differentiation#Reverse_accumulation.
However, searching through the TensorFlow repository with the keyword "Wengert List", I get nothing.
Do they use a different name, or do they get rid of Wengert List? If so, how?
AD terminology is very old. It was invented when there was no Python and things were complicated. Nowadays you could just use a regular Python list for that purpose.
Implementation of reverse AD is in gradients function of gradients_impl.py here
The data-structure used to store the tape is initialized on line 532 and it's a Python Queue
# Initialize queue with to_ops.
queue = collections.deque()
However, searching through the TensorFlow repository with the keyword "Wengert List", but I get nothing.
This is because TensorFlow is not tape based AD, it is graph based AD system.
Wengert list would be the tape describing the order in which operations were originally executed.
There is also source code transformation based AD and a nice example of that system is Tangent.
Nowadays almost no one uses tape (Wengert list) any more. Check for instance what PyTorch does (Page 2).
TensorFlow 2 uses a Wengert List (a tape) as does JAX and Autograd. This is because these tools keep track of the operations on variables with some sort of gradient_tape.
Tensorflow 1 did not use a Wengert list to keep track of what computation was being performed, instead it used a static graph to keep track of what computation was being performed. This had certain performance benefits, but limited what TensorFlow 1 was capable of doing.

Between-graph replication in tensorflow: sessions and variables

Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join().
We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor.
Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?
Thanks in advance!
EDIT:
To q.3:
It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions.
Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from #YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
Cluster: one local "calculation server" and N "workers".
"Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
"Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.
So, high-level algorithm:
launch all the units in cluster
build comp graph and training ops on calculation server
build comp graph on workers (variables are linked to calculation
server).
collect data with workers
perform training step on calculation server and update global
network
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
I also stumbled upon these and very similar and related questions. I tried to clarify all of this in my overview of distributed TensorFlow. Maybe that is useful for some.
More specifically, let me try to answer your questions:
You say you do between-graph replication, i.e. you build a separate computation graph for every worker. This implies that you also have a separate session everywhere, because there would be no way to use that computation graph otherwise. The server (tf.distribute.Server) will not use the local computation graph. It will just execute things when remote sessions (clients) connect to it. The session has the graph. If there would be only a single session, there would also only be a single graph, and then you have in-graph replication.
If you share the variables (e.g. they live on a parameter server), it is enough if one of the workers does the initialization (e.g. the parameter server itself). Otherwise it depends on the specific distributed strategy and how you do the synchronization of the variables. E.g. a mirrored variables has separate copies on every replica, and you would need to make sure in some way that they are synchronized.
There is only a single copy of the variable in this case, which lives on the parameter server. All read and write on this variable would be a RPC call to the parameter server.
I'm not exactly sure what you mean by main program. You would have multiple instances of your program, one for each worker. But you likely will mark one of the worker as the chief worker, which has some further responsibilities like saving the checkpoint. But otherwise, all the workers are equal and do all the same thing (this is again for between-graph replication). How your gradient accumulation or parameter update looks like depends on your strategy (e.g. whether you do sync training or async training, etc.).

tensorflow one of 20 parameter server is very slow

I am trying to train DNN model using tensorflow, my script have two variables, one is dense feature and one is sparse feature, each minibatch will pull full dense feature and pull specified sparse feature using embedding_lookup_sparse, feedforward could only begin after sparse feature is ready. I run my script using 20 parameter servers and increasing worker count did not scale out. So I profiled my job using tensorflow timeline and found one of 20 parameter server is very slow compared to the other 19. there is not dependency between different part of all the trainable variables. I am not sure if there is any bug or any limitation issues like tensorflow can only queue 40 fan out requests, any idea to debug it? Thanks in advance.
tensorflow timeline profiling
It sounds like you might have exactly 2 variables, one is stored at PS0 and the other at PS1. The other 18 parameter servers are not doing anything. Please take a look at variable partitioning (https://www.tensorflow.org/versions/master/api_docs/python/state_ops/variable_partitioners_for_sharding), i.e. partition a large variable into small chunks and store them at separate parameter servers.
This is kind of a hack way to log Send/Recv timings from Timeline object for each iteration, but it works pretty well in terms of analyzing JSON dumped data (compared to visualize it on chrome://trace).
The steps you have to perform are:
download TensorFlow source and checkout a correct branch (r0.12 for example)
modify the only place that calls SetTimelineLabel method inside executor.cc
instead of only recording non-transferable nodes, you want to record Send/Recv nodes also.
be careful to call SetTimelineLabel once inside NodeDone as it would set the text string of a node, which will be parsed later from a python script
build TensorFlow from modified source
modify model codes (for example, inception_distributed_train.py) with correct way of using Timeline and graph meta-data
Then you can run the training and retrieve JSON file once for each iteration! :)
Some suggestions that were too big for a comment:
You can't see data transfer in timeline that's because the tracing of Send/Recv is currently turned off, some discussion here -- https://github.com/tensorflow/tensorflow/issues/4809
In the latest version (nightly which is 5 days old or newer) you can turn on verbose logging by doing export TF_CPP_MIN_VLOG_LEVEL=1 and it shows second level timestamps (see here about higher granularity).
So with vlog perhaps you can use messages generated by this line to see the times at which Send ops are generated.