On two PCs, the exact same data is exported from “Cinem4D” with the “Redshift” renderer.
Comparing the two, one uses the GPU at 100% while the other uses very little (it uses about the same amount of GPU memory).
Cinema4D, Redshift and GPU driver versions are the same.
GPU is RTX3060
64GB memory
OS is windows 10
M2.SSD
The only difference is the CPU.
12th Gen intel core i9-12900K using GPU at 100%
AMD Ryzen 9 5950 16 Core on the other
is.
Why is the GPU utilization so different?
Also, is it possible to adjust the PC settings to use 100%?
Related
This is a total newbie question but I've been searching for a couple days and cannot find the answer.
I am using cupy to allocate a large array of doubles (circa 655k rows x 4k columns ) which is about 16Gb in ram. I'm running on p2.8xlarge (the aws instance that claims to have 96GB of GPU ram and 8 GPUs), but when I allocate the array it gives me out of memory error.
Is this happening becaues the 96GB of ram is split into 8x12 GB lots that are only accessible to each GPU? Is there no concept of pooling the GPU ram across the GPUs (like regular ram in multiple CPU situation) ?
From playing around with it a fair bit, I think the answer is no, you cannot pool memory across GPUs. You can move data back and forth between GPUs and CPU but there's no concept of unified GPU ram accessible to all GPUs
I am using AMD Radeon Pro duo for my application in opencl.
It has a Dual Fiji GPUs, How can i configure Cross Fire to make them work as one device. I am using clgetdeviceinfo in opencl for checking the device compute units but it's showing 64 for each fiji GPU.
I have total 128 compute units in two GPUS, How to use all of them by using Crossfire.
OpenCL has device fission but not device fusion. Devices can share memory for efficiency but shaders can't be joined.
There are also some functions that can't synchronize between two GPUs yet:
Atomic functions in kernels
Prefetch command(which GPUs global cache?)
clEnqueueAcquireGLObject(which GPU's buffer?)
clCreateBuffer (which device memorry does it choose? we can't choose.)
clEnqueueTask (where does this task go?)
You should partition the encoding work in two pieces and run on both GPUs. This may even need cross-fire to be disabled if drivers have problems with it. This shouldn't be harder than writing a GPGPU encoder.
But you may need to copy data to only one of the devices, then copy half of data to other GPU from that buffer, instead of passing through pci-e twice. The inter-GPU connection must be faster than pci-e.
I have a Mac, and consequently have been running Tensorflow without GPU support (because it's not official yet). However, there are some hacked together impls that I'm thinking of installing... that is if the performance gains are worth the trouble. How much faster (approximately) would Tensorflow run on a Macbook Pro with GPU support?
Thanks
as a rule of thumb somewhere between 10 and 20 times - I've found just running the standard examples.
To give you an idea of the speed difference, I ran some language modelling code (similar to the PTB example), with a fairly large data set, on 3 different machines with the following results:
Intel Xeon X5690 (CPU only): 1 day, 19 hours
Nvidia Grid K520 (on Amazon AWS): 17 hours
Nvidia Tesla K80: 4 hours
I am interested to try out GPU programming. One thing not clear to me is, what hardware do I need? Is it right any PC with graphics card is good? I have very little knowledge of GPU programming, so the starting learning curve is best not steep. If I have to make a lot of hacks just in order to run some tutorial because my hardware is not good enough, I'd rather to buy a new hardware.
I have a retired PC (~10 year old) installed with Ubuntu Linux, I am not sure what graphics card it has, must be some old one.
I am also planning to buy a new sub-$500 desktop which to my casual research normally has AMD Radeon 7x or Nvidia GT 6x graphics card. I assume any new PC is good enough for the programming learning.
Anyway any suggestion is appreciated.
If you want to use CUDA, you'll need a GPU from NVidia, and their site explains the compute capabilities of their different products.
If you want to learn OpenCL, you can start right now with an OpenCL implementation that has a CPU back-end. The basics of writing OpenCL code targeting CPUs or GPUs is the same, and they differ mainly in performance tuning.
For GPU programming, any AMD or NVidia GPU made in the past several years will have some degree of OpenCL support, though there have been some new features introduced with newer generations that can't be easily emulated for older generations.
Intel's integrated GPUs in Ivy Bridge and later support OpenCL, but Intel only provides a GPU-capable OpenCL implementation for Windows, not Linux.
Also be aware that there is a huge difference between a mid-range and high-end GPU in terms of compute capabilities, especially where double-precision arithmetic is supported. Some low-end GPUs don't support double-precision at all, and mid-range GPUs often perform double-precision arithmetic 24 times slower than single-precision. When you want to do a lot of double-precision calculations, it's absolutely worth it to get a compute-oriented GPU (Radeon 7900 series or GeForce Titan and up).
If you want a low-end system with a non-trivial GPU power, you best bet at the moment is probably to get a system built around an AMD APU.
Most of the benchmarks for gpu performance and load testing are graphics related. Is there any benchmark that is computationally intensive but not graphics related ? I am using
DELL XPS 15 laptop,
nvidia GT 525M graphics card,
Ubuntu 11.04 with bumblebee installed.
I want to load test my system to come up with a max load the graphics cards can handle. Are there any non-graphics benchmarks for gpu ?
What exactly do you want to measure?
To measure GFLOPS on the card just write a simple Kernel in Cuda (or OpenCL).
If you have never written anything in CUDA let me know and i can post something for you.
If your application is not computing intensive (take a look at a roofline paper) then I/O will be the bottleneck. Getting data from global (card) memory to the processor takes 100's of cycles.
On the other hand if your application IS compute intensive then just time it and calculate how many bytes you process per second. In order to hit the maximum GFLOPS (your card can do 230) you need many FLOPs per memory access, so that the processors are busy and not stalling for memory and switching threads.