Julia uses only 20-30% of my CPU. What should I do? - optimization

I am running a program that does numeric ODE integration in Julia. I am running Windows 10 (64bit), with Intel Core i7-4710MQ # 2.50Ghz (8 logical processors).
I noticed that when my code was running on julia, only max 30% of CPU is in usage. Going into the parallelazation documentation, I started Julia using:
C:\Users\*****\AppData\Local\Julia-0.4.5\bin\julia.exe -p 8 and expected to see improvements. I did not see them however.
Therefore my question is the following:
Is there a special way I have to write my code in order for it to use CPU more efficiently? Is this maybe a limitation posed by my operating system (windows 10)?
I submit my code in the julia console with the command:
include("C:\\Users\\****\\AppData\\Local\\Julia-0.4.5\\13. Fast Filesaving Format.jl").
Within this code I use some additional packages with:
using ODE; using PyPlot; using JLD.
I measure the CPU usage with windows' "Task Manager".

The -p 8 option to julia starts 8 worker processes, and disables multithreading in libraries like BLAS and FFTW so that the workers don't oversubscribe the physical threads on the system – since this kills performance in well-balanced distributed workloads. If you want to get more speed out of -p 8 then you need to distribute work between those workers, e.g. by having each of them do an independent computation, or by having the collaborate on a computation via SharedArrays. You can't just add workers and not change the program. If you are using BLAS (doing lots of matrix multiplies) or FFTW (doing lots of Fourier transforms), then if you don't use the -p flag, you'll automatically get multithreading from those libraries. Otherwise, there is no (non-experimental) user-level threading in Julia yet. There is experimental threading support and version 1.0 will support threading, but I wouldn't recommend that yet unless you're an expert.

Related

Is it possible to use system memory instead of GPU memory for processing Dask tasks

We have been running DASK clusters on Kubernetes for some time. Up to now, we have been using CPUs for processing and, of course, system memory for storing our Dataframe of around 1,5 TB (per DASK cluster, split onto 960 workers). Now we want to update our algorithm to take advantage of GPUs. But it seems like the available memory on GPUs is not going to be enough for our needs, it will be a limiting factor(with our current setup, we are using more than 1GB of memory per virtual core).
I was wondering if it is possible to use GPUs (thinking about NVDIA, AMD cards with PCIe connections and their own VRAMS, not integrated GPUs that use system memory) for processing and system memory (not GPU memory/VRAM) for storing DASK Dataframes. I mean, is it technically possible? Have you ever tried something like this? Can I schedule a kubernetes pod such that it uses GPU cores and system memory together?
Another thing is, even if it was possible to allocate the system RAM as VRAM of GPU, is there a limitation to the size of this allocatable system RAM?
Note 1. I know that using system RAM with GPU (if it was possible) will create an unnecessary traffic through PCIe bus, and will result in a degraded performance, but I would still need to test this configuration with real data.
Note 2. GPUs are fast because they have many simple cores to perform simple tasks at the same time/in parallel. If an individual GPU core is not superior to an individual CPU core then may be I am chasing the wrong dream? I am already running dask workers on kubernetes which already have access to hundreds of CPU cores. In the end, having a huge number of workers with a part of my data won't mean better performance (increased shuffling). No use infinitely increasing the number of cores.
Note 3. We are mostly manipulating python objects and doing math calculations using calls to .so libraries implemented in C++.
Edit1: DASK-CUDA library seems to support spilling from GPU memory to host memory but spilling is not what I am after.
Edit2: It is very frustrating that most of the components needed to utilize GPUs on Kubernetes are still experimental/beta.
Dask-CUDA: This library is experimental...
NVIDIA device plugin: The NVIDIA device plugin is still considered beta and...
Kubernetes: Kubernetes includes experimental support for managing AMD and NVIDIA GPUs...
I don't think this is possible directly as of today, but it's useful to mention why and reply to some of the points you've raised:
Yes, dask-cuda is what comes to mind first when I think of your use-case. The docs do say it's experimental, but from what I gather, the team has plans to continue to support and improve it. :)
Next, dask-cuda's spilling mechanism was designed that way for a reason -- while doing GPU compute, your biggest bottleneck is data-transfer (as you have also noted), so we want to keep as much data on GPU-memory as possible by design.
I'd encourage you to open a topic on Dask's Discourse forum, where we can reach out to some NVIDIA developers who can help confirm. :)
A sidenote, there are some ongoing discussion around improving how Dask manages GPU resources. That's in its early stages, but we may see cool new features in the coming months!

What is the difference between the gem5 CPU models and which one is more accurate for my simulation?

When running a simulation in gem5, I can select a CPU with fs.py --cpu-type.
This option can also show a list of all CPU types if I use an invalid CPU type such as fs.py --cpu-type.
What is the difference between those CPU types and which one should I choose for my experiment?
Question inspired by: https://www.mail-archive.com/gem5-users#gem5.org/msg16976.html
An overview of the CPU types can be found at: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-cpu-types
In summary:
simplistic CPUs (derived from BaseSimpleCPU): for example AtomicSimpleCPU (the default one). They have no CPU pipeline, and therefor are completely unrealistic. However, they also run much faster. Therefore,they are mostly useful to boot Linux fast and then checkpoint and switch to a more detailed CPU.
Within the simple CPUs we can notably distinguish:
AtomicSimpleCPU: memory requests finish immediately
TimingSimpleCPU: memory requests actually take time to go through to the memory system and return. Since there is no CPU pipeline however, the simulated CPU stalls on every memory request waiting for a response.
An alternative to those is to use KVM CPUs to speed up boot if host and guest ISA are the same, although as of 2019, KVM is less stable as it is harder to implement and debug.
in-order CPUs: derived from the generic MinorCPU by parametrization, Minor stands for In Order:
for ARM: HPI is made by ARM and models a "(2017) modern in-order Armv8-A implementation". This is your best in-order ARM bet.
out-of-order CPUs, derived from the generic DerivO3CPU by parametrization, O3 stands for Out Of Order:
for ARM: there are no models specifically published by ARM as of 2019. The only specific O3 model available is ex5_big for an A15, but you would have to verify its authors claims on how well it models the real core A15 core.
If none of those are accurate enough for your purposes, you could try to create your own in-order/out-of-order models by parametrizing MinorCPU / DerivO3CPU like HPI and ex5_big do, although this could be hard to get right, as there isn't generally enough public information on non-free CPUs to do this without experiments or reverse engineering.
The other thing you will want to think about is the memory system model. There are basically two choices: classical vs Ruby, and within Ruby, several options are available, see also: https://cirosantilli.com/linux-kernel-module-cheat/#gem5-ruby-build

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

Keras using GPU vs CPU

Are there any differences from the programming point of view(syntax,functions or any other ) in 'Keras with back-end as Tensorflow' while working on 'Keras GPU' and 'Keras CPU'? I meant if one program can run on a GPU enabled Keras, Will the same program run on Keras CPU(efficiency doesn't matter)?
GPU code running on CPU? Sure, it is basic Multithreading. And you needed to avoid the race conditions anyway.
For most intents a GPU is just a giant load of realy weak CPU's, wich allows highly effective Multithreading (basically a display-high times display-width Core).
The other way (running procedural CPU code in a massively paralell GPU enviroment) is where the work lies.

RAR password recovery on GPU using ATI Stream processor

I'm newbie in GPU programming , and i work on brute force RAR Password Recovery on ATI Stream Processor using brook+ language, but i see that the kernel written in brook+ language doesn't allow any calling to normal functions (except kernel functions) , my questions is :
1) how to use unrar.dll (to unrar archive files) API in this situation? and is this the only way to program RAR password recovery?
2) what about crack and ElcomSoft software that use GPU , how they work ?
3) what exactly the role for the function work inside GPU (ATI Stream processor or CUDA) in this program?
4) is nVidia/CUDA technology is easier/more flexible than ATI/brook+ language ?
1) unrar.dll is a compiled dynamic link library. These execute on the CPU. GPUs have vastly different machine code and a very different execution model, so they can't run dlls.
You could try to implement a callback from the GPU to the CPU via events, or build an x86 interpreter on the GPU, but these would almost certainly run slower than just running on the CPU.
Using unrar.dll is not the only way to program RAR password recovery. You could instead just build your own code for CPU and GPU from scratch.
2) They work by having the CPU code explicitly request that some GPU code run on the GPU.
3) I don't know exactly. I would guess though that it has a GPU program that tries many different combinations, and benefits from having these run in parallel.
4) CUDA is more mature than brook+. brook+ may be just as easy for simple tasks, but isn't as fully featured. For new projects most people would now choose OpenCL over brook+.
(I'm not sure what you're intending to do, but none of the above seems likely to enable anything sinister.)