Technique to measure GPU utilization over a given period of time - gpu

We run an HPC cluster with GPUs. We would like to report the overall GPU utilization for the job. I know I can do it by periodically sampling in the background and doing the math. I was wondering if there was a tool where I could basically start the sampling period at the beginning of the job and then stop it at the end of the job and just have it report the overall average GPU utilization? For instance, AFAICT nvidia-smi will only do 1 second intervals. I am looking (hoping) for an option on it or a similar tool for start/stop functionality. Note that an arbitrary time period wont work unless I can end it early and get the results up that point as you never know how long the job will run. I would appreciate any pointers / ideas anyone could provide.

Related

Any Logic Freezes after 36 replications

I'm running a stochastic experiment and would therefore like to do N=500 (or some reasonably large N) replications of the simulation before collecting averaged results.
I've set up a Monte Carlo experiment to do this, and because I was told AnyLogic doesn't naturally average outputs over replications, I cumulatively add the output of each experiment and then once all experiments are finished I divide by the number of replications I ran. I don't store the outputs of each experiment just the cumulative value.
My problem is that the experiment seems to freeze after 36 replications and I'm not sure why this might happen. Note that Each replication takes around 5 seconds to run (and they are not taking progressively longer each time).
Has anyone else experienced something like this/can anyone suggest a way to diagnose the problem?
Yes, had it many times. Two options:
too little memory: increase the experiment memory
It is a fault in your model, has nothing to do with AnyLogic :) . You need to do some investigations yourself, probably some special infinite loop triggered in that iteration.

Recommended way of measuring execution time in Tensorflow Federated

I would like to know whether there is a recommended way of measuring execution time in Tensorflow Federated. To be more specific, if one would like to extract the execution time for each client in a certain round, e.g., for each client involved in a FedAvg round, saving the time stamp before the local training starts and the time stamp just before sending back the updates, what is the best (or just correct) strategy to do this? Furthermore, since the clients' code run in parallel, are such a time stamps untruthful (especially considering the hypothesis that different clients may be using differently sized models for local training)?
To be very practical, using tf.timestamp() at the beginning and at the end of #tf.function client_update(model, dataset, server_message, client_optimizer) -- this is probably a simplified signature -- and then subtracting such time stamps is appropriate?
I have the feeling that this is not the right way to do this given that clients run in parallel on the same machine.
Thanks to anyone can help me on that.
There are multiple potential places to measure execution time, first might be defining very specifically what is the intended measurement.
Measuring the training time of each client as proposed is a great way to get a sense of the variability among clients. This could help identify whether rounds frequently have stragglers. Using tf.timestamp() at the beginning and end of the client_update function seems reasonable. The question correctly notes that this happens in parallel, summing all of these times would be akin to CPU time.
Measuring the time it takes to complete all client training in a round would generally be the maximum of the values above. This might not be true when simulating FL in TFF, as TFF maybe decided to run some number of clients sequentially due to system resources constraints. In practice all of these clients would run in parallel.
Measuring the time it takes to complete a full round (the maximum time it takes to run a client, plus the time it takes for the server to update) could be done by moving the tf.timestamp calls to the outer training loop. This would be wrapping the call to trainer.next() in the snippet on https://www.tensorflow.org/federated. This would be most similar to elapsed real time (wall clock time).

Record Cpu usage per minute for particular process

I want to record Cpu usage ,cpu time ,VM size in notepad per minute for any particular process(not for all.Is there any way to this,because i work as a performance/stress tester and its my duty to take the cpu performance after at particular time and the script takes more time so it is some time inconvenient to me take the all reading
please suggest.
thank u.
Use performance monitor, if the Windows is the system that you are working on. It has all kinds of log options, and will do what you need.
Performance monitor gives you the option of recording the performance data for particular process. Look under 'process'...

Compare Round Robin and Multilevel Feedback Queue in terms of waiting time, response time, turnaround time

I want to make a comparison between RR and MLFQ in terms of waiting time, response time, turnaround time in 3 cases:
a) More CPU-bounded jobs than I/O
jobs
b) More I/O-bounded jobs than
CPU bounded jobs
c) When only a few
jobs need to schedule.
Could you help me to clarify or give me some sources for reference. Thanks a lot
There's some maths for this called "queueing theory", which can give you some equations to use.
Another way is to develop a simulation (software model) of the queue, and measure things (e.g. the distribution of response times) as you change various parameters (e.g. utilization).
The important thing to decide is the distribution of inter-arrival times of the input events (jobs to be processed): if they arrive regularly then there may be typically no queueing delay at all (assuming the system utilization is less than 100%), but if they arrive randomly (e.g. with a Poisson distribution) then there's (on average) a non-zero queue.

Algorithmically suggest best node to perform demanding computation

At work we perform demanding numerical computations.
We have a network of several Linux boxes with different processing capabilities. At any given time, there can be anywhere from zero to dozens of people connected to a given box.
I created a script to measure the MFLOPS (Million of Floating Point Operations per Second) using the Linpack Benchmark; it also provides number of cores and memory.
I would like to use this information together with the load average (obtained using the uptime command) to suggest the best computer for performing a demanding computation. In other words, its 3:00pm; I have a meeting in two hours; I need to run a demanding process: what node will get me the answer fastest?
I envision a script which will output a suggestion along the lines of:
SUGGESTED HOSTS (IN ORDER OF PREFERENCE)
HOST1.MYNETWORK
HOST2.MYNETWORK
HOST3.MYNETWORK
Such suggestion should favor fast computers (high MFLOPS) if the load average is low and, as load average increases for a given node, it should favor available nodes instead (i.e., I'd rather run in a slower computer with no users than in an eight-core with forty dudes logged in).
How should I prioritize? What algorithm (rationale) would you use? Again, what I have is:
Load Average (1min, 5min, 15min)
MFLOPS measure
Number of users logged in
RAM (installed and available)
Number of cores (important to normalize the load average)
Any thoughts? Thanks!
You don't have enough data to make an well-informed decision. It sounds as though the scheduling is very volatile: "At any given time, there can be anywhere from zero to dozens of people connected to a given box." So the current load does not necessarily reflect the future load of the machines.
To properly asses what hosts someone should use to minimize computation time would require knowing when the current jobs will terminate. If a powerful machine is about to be done doing most of its jobs, it would be a good candidate even though it currently has a high load.
If you want to guess purely on the current situation, you can do a weighed calculation to find out which hosts have the most MFLOPS available.
MFLOPS available = host's MFLOPS + (number of logical processors - load average)
Sort the hosts by MFLOPS available and suggest them in a descending order.
This formula assumes that the MFLOPS of a host is linearly related to its load average. This might not be exactly true, but it's probably fairly close.
I would favor the most recent load average since it's closer to the current/future situation, whereas, jobs from 15 minutes ago might have completed by now.
Have you considered a distributed approach to computation? Not all computations can be broken up such that more than one cpu can work on them. But perhaps your problem space can benefit from some parallelization. Have a look at Hadoop.
You don't need to know FLOPS. beowulf modules paralell computing center has I go to has the script for sure
PDC operates leading-edge, high-performance computers on a national level. PDC offers easily accessible computational resources that primarily cater to the ...