I have a question to ask. First one being, how is goodness value of static and dynamic processes calculated. I read it on a website that (i am re quoting the exact words) ..."The Linux scheduler is a priority based scheduler that schedules tasks based upon their static and dynamic priorities. When these priorities are combined they form a task's goodness ". So how is this combined goodness calculated. Because in the running process queue, both static and dynamic processes will be present.
This question is different frm another question Difference between 'Niceness' and 'Goodness' in linux kernel 2.4.27
I am asking about, how is goodness value calculated for static nd dynamic processes.
Related
I've got a problem which I think optaplanner may be able to solve, but I haven't seen a demo that quite fits what I'm looking to do. My problem set is scheduling IoT node usage for a testbed. Each test execution (job) requires different sets of constraints on the nodes it will use. For example, a job may ask for M nodes with resource A, and N nodes with resource B. It will also specify a length of time it needs the nodes for and a window in which the job start is acceptable. To successfully schedule a job, it must be able to claim enough resources to meet the job specific requirements (ie, hard limits).
Being new to optaplanner, my understanding is that most of the examples focus on only needing one resource per Job. Any insight into whether this problem could be solved with optaplanner and where to start would be highly appreciated.
If you haven't already, look at the (cheap time scheduling example](https://www.youtube.com/watch?v=r6KsveB6v-g&list=PLJY69IMbAdq0uKPnjtWXZ2x7KE1eWg3ns) and project job scheduling example.
The differentiating question is if when job J1 needs M nodes with resource A if whether or not any of those M nodes can also supply resource B, just not at the same time.
If that's not the case, this is an easy model: you can threat resource A as a capacity like cloud balancing.
If that is the case, it's a complex model (but still possible), for example the jobs are chained or time grained (=> planning var 1) and each job has tasks which are assigned to nodes (=> planning var 2). All of this is likely to need custom moves for efficiency.
Question about between-graph replication in distributed Tensorflow, because I didn't get few moments from tutorials. As I understood the current model:
We have parameter server which we just launch in separate process and make server.join().
We have workers, each of them builds the similar computational graph which contains parameter nodes linked to parameter server (through tf.train.replica_device_setter) and calculation nodes placed on workers themselves.
What I didn't find:
How sessions are working in this model? Because in examples/tutorials it is hidden behind tf.train.Supervisor.
Do we have separate sessions on each worker or just one huge session that accumulate graphs from all the workers and parameter server?
How global variables are initialized on parameter server? I wonder that I can initialize them in one of the worker processes (choose it as a "master"), if I linked these parameters on the worker through tf.train.replica_device_setter. Is that correct?
In the following gist:
https://gist.github.com/yaroslavvb/ea1b1bae0a75c4aae593df7eca72d9ca
global variables initialized just in parameter server process and all the workers consider them initialized. How is that possible, given that they even work in different sessions? I could not replicate it in simpler example.
I have main session in core program where I perform training of the model. Part of training loop is collection of data, which in turn requires calculation on the tensorflow cluster. So I need to create this cluster, put on the parameter server current state of the trained model, then collect data from calculation and continue with training loop. How can I: 1) pass current trained model to the cluster? 2) Extract collected data from cluster and pass it to main program?
Thanks in advance!
EDIT:
To q.3:
It was answered previously (In tensorflow, is variable value the only context information a session stores?) that in distributed runtime variables are shared between sessions.
Does it mean that when I create session with some "target", then all the variables will be shared between those sessions that run on the same graph?
Guess I can try answering these questions by myself, at least it may be helpful for other newbies trying to harness distributed Tensorflow, because as of now there is lack of concise and clear blog posts on that topic.
Hope more knowledgeable people will correct me if needed.
We have separate sessions on all the servers, and these sessions share their resources (variables, queues, and readers) but only in distributed setting (i.e. you pass server.target to tf.Session constructor).
Ref: https://www.tensorflow.org/api_docs/python/client/session_management#Session
Parameter variables usually are initialized in one "master" process. It can be process where parameter server is launched. But it is not strictly necessary to do it in just one process.
Because of p.1. Replicated :)
Thanks to ideas from #YaroslavBulatov, I came to the following approach, which appears to be the simplest possible:
Cluster: one local "calculation server" and N "workers".
"Calculation server" keeps all the parameters of global network and performs training steps. All training ops are assigned to it.
"Workers" collect data in parallel and then put it in Queue; these data are used by "calculation server" when doing training steps.
So, high-level algorithm:
launch all the units in cluster
build comp graph and training ops on calculation server
build comp graph on workers (variables are linked to calculation
server).
collect data with workers
perform training step on calculation server and update global
network
repeat 4-5 till convergence :)
As of now I did coordination between calculation server and workers through Queues (when to start collection of data and when to start training step), which is definitely not the most elegant solution. Any feedback is very welcome.
I also stumbled upon these and very similar and related questions. I tried to clarify all of this in my overview of distributed TensorFlow. Maybe that is useful for some.
More specifically, let me try to answer your questions:
You say you do between-graph replication, i.e. you build a separate computation graph for every worker. This implies that you also have a separate session everywhere, because there would be no way to use that computation graph otherwise. The server (tf.distribute.Server) will not use the local computation graph. It will just execute things when remote sessions (clients) connect to it. The session has the graph. If there would be only a single session, there would also only be a single graph, and then you have in-graph replication.
If you share the variables (e.g. they live on a parameter server), it is enough if one of the workers does the initialization (e.g. the parameter server itself). Otherwise it depends on the specific distributed strategy and how you do the synchronization of the variables. E.g. a mirrored variables has separate copies on every replica, and you would need to make sure in some way that they are synchronized.
There is only a single copy of the variable in this case, which lives on the parameter server. All read and write on this variable would be a RPC call to the parameter server.
I'm not exactly sure what you mean by main program. You would have multiple instances of your program, one for each worker. But you likely will mark one of the worker as the chief worker, which has some further responsibilities like saving the checkpoint. But otherwise, all the workers are equal and do all the same thing (this is again for between-graph replication). How your gradient accumulation or parameter update looks like depends on your strategy (e.g. whether you do sync training or async training, etc.).
NOTE: This question has been ported over from Programmers since it appears to be more appropriate here given the limitation of the language I'm using (VBA), the availability of appropriate tags here and the specificity of the problem (on the inference that Programmers addresses more theoretical Computer Science questions).
I'm attempting to build a Discrete Event Simulation library by following this tutorial and fleshing it out. I am limited to using VBA, so "just switch to [insert language here] and it's easy!" is unfortunately not possible. I have specifically chosen to implement this in Access VBA to have a convenient location to store configuration information and metrics.
How should I handle logging metrics in my Discrete Event Simulation engine?
If you don't want/need background, skip to The Design or The Question section below...
Simulation
The goal of a simulation of the type in question is to model a process to perform analysis of it that wouldn't be feasible or cost-effective in reality.
The canonical example of a simulation of this kind is a Bank:
Customers enter the bank and get in line with a statistically distributed frequency
Tellers are available to handle customers from the front of the line one by one taking an amount of time with a modelable distribution
As the line grows longer, the number of tellers available may have to be increased or decreased based on business rules
You can break this down into generic objects:
Entity: These would be the customers
Generator: This object generates Entities according to a distribution
Queue: This object represents the line at the bank. They find much real world use in acting as a buffer between a source of customers and a limited service.
Activity: This is a representation of the work done by a teller. It generally processes Entities from a Queue
Discrete Event Simulation
Instead of a continuous tick by tick simulation such as one might do with physical systems, a "Discrete Event" Simulation is a recognition that in many systems only critical events require process and the rest of the time nothing important to the state of the system is happening.
In the case of the Bank, critical events might be a customer entering the line, a teller becoming available, the manager deciding whether or not to open a new teller window, etc.
In a Discrete Event Simulation, the flow of time is kept by maintaining a Priority Queue of Events instead of an explicit clock. Time is incremented by popping the next event in chronological order (the minimum event time) off the queue and processing as necessary.
The Design
I've got a Priority Queue implemented as a Min Heap for now.
In order for the objects of the simulation to be processed as events, they implement an ISimulationEvent interface that provides an EventTime property and an Execute method. Those together mean the Priority Queue can schedule the events, then Execute them one at a time in the correct order and increment the simulation clock appropriately.
The simulation engine is a basic event loop that pops the next event and Executes it until there are none left. An event can reschedule itself to occur again or allow itself to go idle. For example, when a Generator is Executed it creates an Entity and then reschedules itself for the generation of the next Entity at some point in the future.
The Question
How should I handle logging metrics in my Discrete Event Simulation engine?
In the midst of this simulation, it is necessary to take metrics. How long are Entities waiting in the Queue? How many Acitivity resources are being utilized at any one point? How many Entities were generated since the last metrics were logged?
It follows logically that the metric logging should be scheduled as an event to take place every few units of time in the simulation.
The difficulty is that this ends up being a cross-cutting concern: metrics may need to be taken of Generators or Queues or Activities or even Entities. Consider also that it might be necessary to take derivative calculated metrics: e.g. measure a, b, c, and ((a-c)/100) + Log(b).
I'm thinking there are a few main ways to go:
Have a single, global Stats object that is aware of all of the simulation objects. Have the Generator/Queue/Activity/Entity objects store their properties in an associative array so that they can be referred to at runtime (VBA doesn't support much in the way of reflection). This way the statistics can be attached as needed Stats.AddStats(Object, Properties). This wouldn't support calculated metrics easily unless they are built into each object class as properties somehow.
Have a single, global Stats object that is aware of all of the simulation objects. Create some sort of ISimStats interface for the Generator/Queue/Activity/Entity classes to implement that returns an associative array of the important stats for that particular object. This would also allow runtime attachment, Stats.AddStats(ISimStats). The calculated metrics would have to be hardcoded in the straightforward implementation of this option.
Have multiple Stats objects, one per Generator/Queue/Activity/Entity as a child object. This might make it easier to implement simulation object-specific calculated metrics, but clogs up the Priority Queue a little bit with extra things to schedule. It might also cause tighter coupling, which is bad :(.
Some combination of the above or completely different solution I haven't thought of?
Let me know if I can provide more (or less) detail to clarify my question!
Any and every performance metric is a function of the model's state. The only time the state changes in a discrete event simulation is when an event occurs, so events are the only time you have to update your metrics. If you have enough storage, you can log every event, its time, and the state variables which got updated, and retrospectively construct any performance metric you want. If storage is an issue you can calculate some performance measures within the events that affect those measures. For instance, the appropriate time to calculate delay in queue is when a customer begins service (assuming you tagged each customer object with its arrival time). For delay in system it's when the customer ends service. If you want average delays, you can update the averages in those events. When somebody arrives, the size of the queue gets incremented, then they begin service it gets decremented. Etc., etc., etc.
You'll have to be careful calculating statistics such as average queue length, because you have to weight the queue lengths by the amount of time you were in that state: Avg(queue_length) = (1/T) integral[queue_length(t) dt]. Since the queue_length can only change at events, this actually boils down to summing the queue lengths multiplied by the amount of time you were at that length, then divide by total elapsed time.
How do you assess the level of difficulty writing own process manager based on the sources of hydra (mpich)? ie., for scale 1 to 100? It will be change the part corresponding to the assignment of processes to computers.
This shouldn't be too hard, but Hydra already implements several rank allocation strategies, so you might not even need to write any code.
You can already provide a user-specified rank allocation. Based on the provided configuration, Hydra can use the hwloc library to obtain hardware topology information and bind processes to cores.
Almost anywhere I read about programming with CUDA there is a mention of the importance that all of the threads in a warp do the same thing.
In my code I have a situation where I can't avoid a certain condition. It looks like this:
// some math code, calculating d1, d2
if (d1 < 0.5)
{
buffer[x1] += 1; // buffer is in the global memory
}
if (d2 < 0.5)
{
buffer[x2] += 1;
}
// some more math code.
Some of the threads might enter into one for the conditions, some might enter into both and other might not enter into either.
Now in order to make all the thread get back to "doing the same thing" again after the conditions, should I synchronize them after the conditions using __syncthreads() ? Or does this somehow happens automagically?
Can two threads be not doing the same thing due to one of them being one operation behind, thus ruining it for everyone? Or is there some behind the scenes effort to get them to do the same thing again after a branch?
Within a warp, no threads will "get ahead" of any others. If there is a conditional branch and it is taken by some threads in the warp but not others (a.k.a. warp "divergence"), the other threads will just idle until the branch is complete and they all "converge" back together on a common instruction. So if you only need within-warp synchronization of threads, that happens "automagically."
But different warps are not synchronized this way. So if your algorithm requires that certain operations be complete across many warps then you'll need to use explicit synchronization calls (see the CUDA Programming Guide, Section 5.4).
EDIT: reorganized the next few paragraphs to clarify some things.
There are really two different issues here: Instruction synchronization and memory visibility.
__syncthreads() enforces instruction synchronization and ensures memory visibility, but only within a block, not across blocks (CUDA Programming Guide, Appendix B.6). It is useful for write-then-read on shared memory, but is not appropriate for synchronizing global memory access.
__threadfence() ensures global memory visibility but doesn't do any instruction synchronization, so in my experience it is of limited use (but see sample code in Appendix B.5).
Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host.
If you just need to increment shared or global counters, consider using the atomic increment function atomicInc() (Appendix B.10). In the case of your code above, if x1 and x2 are not globally unique (across all threads in your grid), non-atomic increments will result in a race-condition, similar to the last paragraph of Appendix B.2.4.
Finally, keep in mind that any operations on global memory, and synchronization functions in particular (including atomics) are bad for performance.
Without knowing the problem you're solving it is hard to speculate, but perhaps you can redesign your algorithm to use shared memory instead of global memory in some places. This will reduce the need for synchronization and give you a performance boost.
From section 6.1 of the CUDA Best Practices Guide:
Any flow control instruction (if, switch, do, for, while) can significantly affect
the instruction throughput by causing threads of the same warp to diverge; that is,
to follow different execution paths. If this happens, the different execution paths
must be serialized, increasing the total number of instructions executed for this
warp. When all the different execution paths have completed, the threads converge
back to the same execution path.
So, you don't need to do anything special.
In Gabriel's response:
"Global instruction synchronization is not possible within a kernel. If you need f() done on all threads before calling g() on any thread, split f() and g() into two different kernels and call them serially from the host."
What if the reason you need f() and g() in same thread is because you're using register memory, and you want register or shared data from f to get to g?
That is, for my problem, the whole reason for synchronizing across blocks is because data from f is needed in g - and breaking out to a kernel would require a large amount of additional global memory to transfer register data from f to g, which I'd like to avoid
The answer to your question is no. You don't need to do anything special.
Anyway, you can fix this, instead of your code you can do something like this:
buffer[x1] += (d1 < 0.5);
buffer[x2] += (d2 < 0.5);
You should check if you can use shared memory and access global memory in a coalesced pattern. Also be sure that you DON'T want to write to the same index in more than 1 thread.