How to understand the SmUtil returned by nvmlDeviceGetProcessUtilization? - gpu

I'm writing a program that monitors how processes use the GPU and I found an API provided by nvml, nvmlDeviceGetProcessUtilization.
Acordding the comment of this API, It reads recent utilization of GPU SM (3D/Compute), framebuffer, video encoder, and video decoder for processes running.
I called the API every 10 seconds or 1 seconds and printed out the samples, as follows:
The "ReadTime" indicates when my program called the API. The "sample" are the samples returned by the API.
ReadTime:10:58:56.194 - sample:[Pid:28128, Timestamp:10:58:55.462519, SmUtil:05, MemUtil:01, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:28104, Timestamp:10:58:55.127657, SmUtil:05, MemUtil:02, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:28084, Timestamp:10:58:48.051124, SmUtil:03, MemUtil:01, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:28050, Timestamp:10:58:53.944518, SmUtil:03, MemUtil:01, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27989, Timestamp:10:58:47.043732, SmUtil:03, MemUtil:01, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27976, Timestamp:10:58:53.604955, SmUtil:09, MemUtil:03, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27814, Timestamp:10:58:48.386200, SmUtil:19, MemUtil:07, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27900, Timestamp:10:58:56.132879, SmUtil:17, MemUtil:06, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27960, Timestamp:10:58:51.423172, SmUtil:06, MemUtil:02, EncUtil:00, DecUtil:00]
ReadTime:10:58:56.194 - sample:[Pid:27832, Timestamp:10:58:47.883811, SmUtil:21, MemUtil:08, EncUtil:00, DecUtil:00]
SUM - GPUId:0, process:10, smSum:91, memSum:32, encSum:0, decSum:0
My question is:
Why is there only one sample per process in the samples returned by this API, regardless of whether I call it once every second or every 10 seconds?
The timestamps in each sample seem to be almost irregular. How does nvml determine the sampling time?
How is SmUtil derived? According to the description of nvmlUtilization_st struct in the nvml.h header file, it is used by nvmlDeviceGetUtilizationRates. The GPU usage refers to “Percent of time over the past sample period during which one or more kernels was executing on the GPU”. As my understanding, that is, even if the GPU has multiple cores, if only one core in the GPU is occupied in a time slice, the entire CPU is considered occupied. This time slice is used as a numerator to calculate the utilization of the entire GPU. So, how do you understand the SmUtil of the process returned by nvmlDeviceGetProcessUtilization?
How to understand the SmUtil returned by nvmlDeviceGetProcessUtilization?

Related

Optimize batch transform inference on sagemaker

With current batch transform inference I see a lot of bottlenecks,
Each input file can only have close to 1000 records
Currently it is processing 2000/min records on 1 instance of ml.g4dn.12xlarge
GPU instance are not necessarily giving any advantage over cpu instance.
I wonder if this is the existing limitation of the currently available tensorflow serving container v2.8. If thats the case config should I play with to increase the performance
i tried changing max_concurrent_transforms but doesn't seem to really help
my current config
transformer = tensorflow_serving_model.transformer(
instance_count=1,
instance_type="ml.g4dn.12xlarge",
max_concurrent_transforms=0,
output_path=output_data_path,
)
transformer.transform(
data=input_data_path,
split_type='Line',
content_type="text/csv",
job_name = job_name + datetime.now().strftime("%m-%d-%Y-%H-%M-%S"),
)
Generally speaking, you should first have a performing model (steps 1+2 below) yielding a satisfactory TPS, before you move over to batch transform parallelization techniques to push your overall TPS higher with parallization nobs.
Steps:
GPU enabling - Run manual test to see that your model can utilize GPU instances to begin with (this isn't related to batch transform).
picking instance - Use SageMaker Inference recommender to find the the most cost/effective instance type to run inference on.
Batch transform inputs - Sounds like you have multiple input files which is needed if you'll want to speed up the job by adding more instances.
Batch Transform Job single instance noobs - If you are using the CreateTransformJob API, you can reduce the time it takes to complete batch transform jobs by using optimal values for parameters such as MaxPayloadInMB, MaxConcurrentTransforms, or BatchStrategy. The ideal value for MaxConcurrentTransforms is equal to the number of compute workers in the batch transform job. If you are using the SageMaker console, you can specify these optimal parameter values in the Additional configuration section of the Batch transform job configuration page. SageMaker automatically finds the optimal parameter settings for built-in algorithms. For custom algorithms, provide these values through an execution-parameters endpoint.
Batch transform cluster size - Increase the instance_count to more than 1, using the cost/effective instance you found in (1)+(2).

Tensorboard printed summary Precision and Accuracy does not show the Step count

We want to correlate the StepCount with the trends of Precision and Accuracy. The Tensorboard GUI allows this: but we want to automate the results. The printed results do not print the Step Count - at least not by default:
Is there a way to tease the Step Count from Tensorboard's printed results?
The steps are actually available in the TF Events file but require coding to extract and correlate. I show how to do that here: How to parse the tensorflow events file?

How to report algorithm running time?

I am running a variational auto-encoder in TensorFlow, which could take a long time. Thus I want to report the time the algorithm has been running for as a scalar on TensorBoard.
One dirty way is to hard-code the start time of the compilation into a global variable, or pass it as an argument to the model function and compute the difference with current time.
Does Tensorflow have a native way to do it?
There is the tf.train.ProfilerHook. Comes with release 1.14.
Example usage:
estimator = tf.estimator.LinearClassifier(...)
hooks = [tf.train.ProfilerHook(output_dir=model_dir, save_secs=600, show_memory=False)]
estimator.train(input_fn=train_input_fn, hooks=hooks)
Executing the hook will generate files timeline-xx.json in output_dir.
Then open chrome://tracing/ in chrome browser and load the file. You will get a time usage timeline like below.

How to save training model at each training step instead of periodic save based on time interval.? - in TensorFlow-Slim

slim.learning.train(...) accepts two arguments pertaining to saving the model(save_interval_secs) or saving the summaries(save_summaries_secs). The problem with this API is, it only allows to save the model/summary based on some "time interval" but I need to do this based on "each step" of the training.
how to achieve this using TF-slim api.?
Here is the slim.learning train api -
def train(train_op,
logdir,
train_step_fn=train_step,
train_step_kwargs=_USE_DEFAULT,
log_every_n_steps=1,
graph=None,
master='',
is_chief=True,
global_step=None,
number_of_steps=None,
init_op=_USE_DEFAULT,
init_feed_dict=None,
local_init_op=_USE_DEFAULT,
init_fn=None,
ready_op=_USE_DEFAULT,
summary_op=_USE_DEFAULT,
**save_summaries_secs=600,**
summary_writer=_USE_DEFAULT,
startup_delay_steps=0,
saver=None,
**save_interval_secs=600,**
sync_optimizer=None,
session_config=None,
session_wrapper=None,
trace_every_n_steps=None,
ignore_live_threads=False):
Slim is deprecated, and using Estimator you get full control over saving / summary frequency.
You can also set the seconds to a very small number so it always saves.

TF Api Dataset: initialization

The tf.dataset works really greate, I was able to speed up learning ~2x. But I have still performance problem, the utilization of GPU is low (despite using tf.dataset with several workers).
My use case is following:
~400 of training examples, each have 10 input channels (take ~5GB)
The task is segmentation using ResNet50. The forward-backward take ~0.15s. Batch size = 32
The data loading is fast, take ~0.06s.
But after one epoch (400/32 ~= 13 iteration), the data loading take ~3.5 seconds, same like initialization of loader (it is more than processing all epoch). This make learning very slow.
My question is: is there are option to eliminate initialization after each epoch, just continuously feed the data ?
I was trying to set dataset.repeat(10) but it does no help.
The loading code and train is here: https://gist.github.com/melgor/0e681a4fe8f125d25573aa30d8ace5f3
The model is just ResNet transformed to Ecnoder-Decoder idea for image segmentation. The most of the code is taken from https://github.com/argman/EAST, but as here loading is very slow, I would like to transform it to TfRecords.
I partly resolve my problem with long initialization. I just make tge tfrecord file smaller.
In my base implementation I used raw string as images (so string from numpy array). The new 'tfrecord' contain compressed images using jpeg or png. Thanks to that it make the file 50x smaller what make initialization much faster. But there is also the cons of it: your images need to be uini8 (jpeg) or uint16 (png). In case of float, you can use uint16 but there will loss of information.
For encoding numpy array to compressed sting you can use Tensorflow itself:
encoded_jpeg = tf.image.encode_jpeg(tf.constant(img),format='rgb').eval(session=sess)
encoded_png = tf.image.encode_png(tf.constant(png_image)).eval(session=sess)