NVIDIA GPU slows down unexpectidely - gpu

We have an application taking data from a camera and processing it in real time through the gpu to render a scene. The gpu is an Nvidia RTX 3000 in a Lenovo laptop T15.
Our application is started and all goes well for a first session of rendering, FPS is tracking at 30, gpu power is at 70W, cpu is around 50%. One application session lasts for a few minutes and results are fully rendered in real time.
Then we initiate a second application session and the FPS plummets and rendering lags which is unacceptable for our application (in the health space). Power drops to 30W and the gpu appears to not be loaded at all neither is the cpu.
We tried moving part of the initialization code to when we detect the beginning of the session, it works better and let us handle a few sessions in a row, but eventually the behavior reappears after a while.
Temperature does not seem to be involved here so we do not think the gpu gets throttled for excess temperature.
any suggestion on where to look for and how to get a more deterministic behavior ? Anything specific we could reset between sessions?
thanks a lot for any help.

Related

Why is fence status checking and resetting in Vulkan so slow?

If I check the status of a fence with vkGetFenceStatus() it takes about 0.002 milliseconds. This may not seem like a long time, but that amount of time in a rendering or game engine is a very long time, especially when waiting on fences while doing other scheduled jobs will soon add up to time quickly approaching a millisecond. If the fence statuses are kept host-side why does it take so long to check these and reset them? Do other people get similar timings when calling this function?
Ideally, the time it takes to check for a fence being set shouldn't matter. While taking up 0.02% of a frame at 120FPS isn't ideal, at the end of the day, it should not be all that important. The ideal scenario works like this:
Basically, you should build your fence logic around the idea that you're only going to check the fence if it's almost certainly already set.
If you submit frame 1, you should not check the fence when you're starting to build frame 2. You should only check it when you're starting to build frame 3 (or 4, depending on how much delay you're willing to tolerate).
And most importantly, if it isn't set, that should represent a case where either the CPU isn't doing enough work or the GPU has been given too much work. If the CPU is outrunning the GPU, it's fine for the CPU to wait. That is, the CPU performance no longer matters, since you're GPU-bound.
So the time it takes to check the fence is more or less irrelevant.
If you're in a scenario where you're task dispatching and you want to run the graphics task ASAP, but you have other tasks available if the graphics task isn't ready yet, that's where this may become a problem. But even so, it would only be a problem for that small space of time between the first check to see if the graphics task is ready and the point where you've run out of other tasks to start and the CPU needs to start waiting on the GPU to be ready.
In that scenario, I would suggest testing the fence only twice per frame. Test it at the first opportunity; if its not set, do all of the other tasks you can. After those tasks are dispatched/done... just wait on the GPU with vkWaitForFences. Either the fence is set and the function will return immediately, or you're waiting for the GPU to be ready for more data.
There are other scenarios where this could be a problem. If the GPU lacks dedicated transfer queues, you may be testing the same fence for different purposes. But even in those cases, I would suggest only testing the fence once per frame. If the data upload isn't done, you either have to do a hard sync if that data is essential right now, or you delay using it until the next frame.
If this remains a concern, and your Vulkan implementation allows timeline semaphores, consider using them to keep track of queue progress. vkGetSemaphoreCounterValue may be faster than vkGetFenceStatus, since it's just reading a number.

Advanced GPU Control needed for browserwindows

I would like to set how much GPU RAM is being used as to prevent the app from overflowing on windows. Does anyone have any expert advice?
I am using electron to build an automatic player for windows. This player plays a mix of video's (h264encoded mp4), html and jpeg based on a schedule (sort of like a presentation).
I tested the app on several windows devices, the results vary greatly!
All devices are tiny computers by Asus. In general I noticed 2 distinct differences:
On devices that have no hardware acceleration the chromium gpu process uses up about 30MB of shared RAM, this number never changes, regardless of the content played. The CPU however has all the load here, meaning it is decoding the mp4's (h264) with software instead of hardware.
On devices with hardware acceleration the cpu load is of course less, but the RAM memory used up by the chromium gpu-process varies greatly. While displaying jpeg or html the RAM is about 0.5GB, when mp4'skick in the RAM memory easily goes up to 2GB and more.
On the stronger devices without hardware acceleration this is not a big issue, they have 8GB of shared memory or more and don't crash. However some of the other devices have only 4GB of shared memory and can run out of memory quite easily.
The result of this lack of memory is either the app crashes completely (message with memory overflow is displayed) or the app just hangs (keeps running but doesn't do anything anymore, usually just displays a white screen).
I know that I can pass certain flags to browserwindow using app.commandLine.appendSwitch.
These are a few of the flags that I tried and the effect they had, I found a list of them here:
--force-gpu-mem-available-mb=600 ==> no effect whatsoever, process behaves as before and still surpasses 2GB of RAM.
--disable-gpu ==> This one obviously worked but is undesirable because it disabled hardware acceleration completely
--disable-gpu-memory-buffer-compositor-resources ==> no change
--disable-gpu-memory-buffer-video-frames ==> no change
--disable-gpu-rasterization ==> no change
--disable-gpu-sandbox ==> no change
Why are some of these command line switches not having any effect on the GPU behaviour? All devices have onboard GPU and shared RAM. I know the Command Line Switches are being used on startup because when I check the processes in the windows task manager I can see the switches have been past to the processes (using the the command line tab in the task manager). So the switches are loaded but still appear to be ignored.
I would like to set how much GPU RAM is being used as to prevent the app from overflowing on windows. Does anyone have any expert advice?

Using a GPU both as video card and GPGPU

Where I work, we do a lot of numerical computations and we are considering buying workstations with NVIDIA video cards because of CUDA (to work with TensorFlow and Theano).
My question is: should these computers come with another video card to handle the display and free the NVIDIA for the GPGPU?
I would appreciate if anyone knows of hard data on using a video card for display and GPGPU at the same time.
Having been through this, I'll add my two cents.
It is helpful to have a dedicated card for computations, but it is definitely not necessary.
I have used a development workstation with a single high-end GPU for both display and compute. I have also used workstations with multiple GPUs, as well as headless compute servers.
My experience is that doing compute on the display GPU is fine as long as demands on the display are typical for software engineering. In a Linux setup with a couple monitors, web browsers, text editors, etc., I use about 200MB for display out of the 6GB of the card -- so only about 3% overhead. You might see the display stutter a bit during a web page refresh or something like that, but the throughput demands of the display are very small.
One technical issue worth noting for completeness is that the NVIDIA driver, GPU firmware, or OS may have a timeout for kernel completion on the display GPU (run NVIDIA's 'deviceQueryDrv' to see the driver's "run time limit on kernels" setting). In my experience (on Linux), with machine learning, this has never been a problem since the timeout is several seconds and, even with custom kernels, synchronization across multiprocessors constrains how much you can stuff into a single kernel launch. I would expect the typical runs of the pre-baked ops in TensorFlow to be two or more orders of magnitude below this limit.
That said, there are some big advantages of having multiple compute-capable cards in a workstation (whether or not one is used for display). Of course there is the potential for more throughput (if your software can use it). However, the main advantage in my experience, is being able to run long experiments while concurrently developing new experiments.
It is of course feasible to start with one card and then add one later, but make sure your motherboard has lots of room and your power supply can handle the load. If you decide to have two cards, with one being a low-end card dedicated to display, I would specifically advise against having the low-end card be a CUDA-capable card lest it get selected as a default for computation.
Hope that helps.
In my experience it is awkward to share a GPU card between numerical computation tasks and driving a video monitor. For example, there is limited memory available on any GPU, which is often the limiting factor in the size of a model you can train. Unless you're doing gaming, a fairly modest GPU is probably adequate to drive the video. But for serious ML work you will probably want a high-performance card. Where I work (Google) we typically put two GPUs in desk-side machines when one is to be used for numerical computation.

CUDA - nvidia driver crash while running

I run a raytracer in CUDA with N Bounces (each ray will bounce N times).
I view the results using openGL.
once N is small (1~4) everything works great. once i make N big (~10) each thread (about 800x1000) has to do a lot of computing and this when the screen goes black, and than back on, with the note that my nvidia crash.
i searched online and think now that what cause it some sort of a watch-dog timer since i use the same graphic card for my display and my computing (computing takes more than 2 sec so the driver reset itself).
is there a command to make the host (cpu) WAIT for the device(gpu) for as long as it takes?
what do i need to do? im stuck :(
thanks
Based on your description, you are running on Windows Vista or Windows 7. Windows operating systems have a watchdog timer, as you guessed. The watchdog timer only applies to GPUs with displays attached.
The easiest solution is to run 2 or more GPUs, and run CUDA on GPU(s) without a display attached.
You can disable the watchdog timer. See this question for more details. However you should do so with care—remember that when you have a long running kernel on your primary display GPU you will make your computer completely unresponsive (at least you won't be able to see what it is doing) until the kernel completes.

non-graphics benchmarks for gpu

Most of the benchmarks for gpu performance and load testing are graphics related. Is there any benchmark that is computationally intensive but not graphics related ? I am using
DELL XPS 15 laptop,
nvidia GT 525M graphics card,
Ubuntu 11.04 with bumblebee installed.
I want to load test my system to come up with a max load the graphics cards can handle. Are there any non-graphics benchmarks for gpu ?
What exactly do you want to measure?
To measure GFLOPS on the card just write a simple Kernel in Cuda (or OpenCL).
If you have never written anything in CUDA let me know and i can post something for you.
If your application is not computing intensive (take a look at a roofline paper) then I/O will be the bottleneck. Getting data from global (card) memory to the processor takes 100's of cycles.
On the other hand if your application IS compute intensive then just time it and calculate how many bytes you process per second. In order to hit the maximum GFLOPS (your card can do 230) you need many FLOPs per memory access, so that the processors are busy and not stalling for memory and switching threads.