Disabling AVX2 in CPU for testing purposes - testing

I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to temporarly turn it off for testing purposes? Or to somehow emulate other CPU?

Yes, use an "emulation" (or dynamic recompilation) layer like Intel's Software Development Emulator (SDE), or maybe QEMU.
SDE is closed-source freeware, and very handy for both testing AVX512 code on old CPUs, or for simulating old CPUs to check that you don't accidentally execute instructions that are too new.
Example: I happened to have a binary that unconditionally uses an AVX2 vpmovzxwq load instruction (for a function I was testing). It runs fine on my Skylake CPU natively, but SDE has a -snb option to emulate a Sandybridge in both CPUID and actually checking every instruction.
$ sde64 -snb -- ./mask
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (SANDYBRIDGE): 0x401005: vpmovzxwq ymm2, qword ptr [rip+0xff2]
Image: /tmp/mask+0x5 (in multi-region image, region# 1)
Instruction bytes are: c4 e2 7d 34 15 f2 0f 00 00
There are options to emulate CPUs as old as -quark, -p4 (SSE2), or Core 2 Merom (-mrm), to as new as IceLake-Server (-icx) or Tremont (-tnt). (And Xeon Phi CPUs like KNL and KNM.)
It runs pretty quickly, using dynamic recompilation (JIT) so code using only instructions that are supported natively can run at basically native speed, I think.
It also has instrumentation options (like -mix to dump the instruction mix), and options to control the JIT more closely. I think you could maybe get it to not report AVX2 in CPUID, but still let AVX2 instructions run without faulting.
Or probably emulate a CPU that supports AVX2 but not FMA (there is a real CPU like this from Via, unfortunately). Or combinations that no real CPU has, like AVX2 but not popcnt, or BMI1/BMI2 but not AVX. But I haven't looked into how to do that.
The basic sde -help options only let you set it to specific Intel CPUs, and for checking for potentially-slow SSE/AVX transitions (without correct vzeroupper usage). And a few other things.
One important test-case that SDE is missing is AVX+FMA without AVX2 (AMD Piledriver / Steamroller, i.e. most AMD FX-series CPUs). It's easy to forget and use an AVX2 shuffle in code that's supposed to be AVX1+FMA3, and some compilers (like MSVC) won't catch this at compile time the way gcc -march=bdver2 would. (Bulldozer only has AVX + FMA4, not FMA3, because Intel changed their plans after it was too late for AMD to redesign.)
If you just want CPUID not report the presence of AVX2 (and FMA?) so your code uses its AVX1 or non-AVX versions of functions, you can do that with most VMs.
For AVX instructions to run without faulting, a bit in a control register has to be set. (So this works like a promise by the OS that it will correctly save/restore the new architectural state of YMM upper halves). So disabling AVX in CPUID will give you a VM instance where AVX instructions fault. (At least 256-bit instructions? I haven't tried this to see if 128-bit AVX instructions can still execute in this state on HW that supports AVX.)

Related

Is it safer to use OpenCL rather than SYCL when the objective is to have the most hardware-compatible program?

My objective is to obtain the ability of parallelizing a code in order to be able to run it on GPU, and the Graal would be to have a software that can run in parallel on any GPU or even CPU (Intel, NVIDIA, AMD, and so...).
From what I understood, the best solution would be to use OpenCL. But shortly after that, I also read about SYCL, that is supposed to simplify the codes that run on GPU.
But is it just that ? Isn't better to use a lower level language in order to be sure that it will be possible to be used in the most hardware possible ?
I know that all the compatibilities are listed on The Khronos Group website, but I am reading everything and its opposite on the Internet (like if a NVIDIA card supports CUDA, then it supports OpenCL, or NVIDIA cards will never work with OpenCL, even though OpenCL is supposed to work with everything)...
This is a new topic to me and there are lots of informations on the Internet... It would be great if someone could give me a simple answer to this question.
Probably yes.
OpenCL is supported on all AMD/Nvidia/Intel GPUs and on all Intel CPUs since around 2009. For best compatibility with almost any device, use OpenCL 1.2. The nice thing is that the OpenCL Runtime is included in the graphics drivers, so you don't have to install anything extra to work with it or to get it working on another machine.
SYCL on the other hand is newer and not yet established that well. For example, it is not officially supported (yet) on Nvidia GPUs: https://forums.developer.nvidia.com/t/is-sycl-available-on-cuda/37717/7
But there are already SYCL implememtations that are compatible with Nvidia/AMD GPUs, essentially built on top of CUDA or OpenCL 1.2, see here: https://stackoverflow.com/a/63468372/9178992

Simple explanation of terms while running configure command during Tensorflow installation

I am installing tensorflow from this link.
When I run the ./configurecommand, I see following terms
XLA JIT
GDR
VERBS
OpenCL
Can somebody explain in simple language, what these terms mean and what are they used for?
XLA stands for 'Accelerated Linear Algebra'. The XLA page states that: 'XLA takes graphs ("computations") [...] and compiles them into machine instructions for various architectures." As far as I understand will this take the computation you define in tensorflow and compile it. Think of producing code in C and then running it through the C compiler for the CPU and load the resulting shared library with the code for the full computation instead of making separate calls from python to compiled functions for each part of your computation. Theano does something like this by default. JIT stands for 'just in time compiler', i.e. the graph is compiled 'on the fly'.
GDR seems to be support for exchanging data between GPUs on different servers via GPU direct. GPU direct makes it possible that e.g. the network card which receives data from another server over the network writes directly into the local GPU's memory without going through the CPU or main memory.
VERBS refers to the Infiniband Verbs application programming interface ('library'). Infiniband is a low latency network used in many supercomputers for example. This can be used when you want to run tensorflow on more than one server for communication between them. The Verbs API is to Infiniband what the Berkeley Socket API is to TCP/IP communication (although there are many more communication options and different semantics optimized for performance with Verbs).
OpenCL is a programming language suited for executing parallel computing tasks on CPU and non-CPU devices such as GPUs, with a C like syntax. With respect to C however there are certain restrictions such as no support for recursion etc. One could probably say that OpenCL is to AMD what CUDA is to NVIDIA (although also OpenCL is also used by other companies like ALTERA).

What instruction set does the Nvidia GeForce 6xx Series use?

Does the GeForce 6xx Series GPUS use RISC, CISC or VLIW style instructions?
In one source, at http://www.motherboardpoint.com/risc-cisc-t241234.html someone said "GPUs are probably closer to VLIW than to RISC or CISC".
In another source, at http://en.wikipedia.org/wiki/Very_long_instruction_word#implementations it says "both Nvidia and AMD have since moved to RISC architectures in order to improve performance on non-graphics workload"
AFAIK, Nvidia does not publicly document it's hardware instruction sets.
The best you can see officially is PTX ISA which is the instruction set of a virtual machine which Nvidia's compiler (or drivers) then convert to the real instruction set to be executed on specific GPU. cuobjdump utility can show you disassembled GPU code. IMHO it looks like a fairly typical RISC -- load+store+operations on registers.
On the other hand, some operations are very complex. For instance, texture lookup instruction does a lot -- it may interpolate coordinates, deal with coordinates being out of range, fetch required data and convert it to desired data type. While the syntax remains RISC-y, the substance feels like CISC.
They use a combination of VLIW and SIMID, this allows them to complete several processes at once rather than doing them one by one like RISC or CISC.

GPU Processing in vb.net

I've got a program that takes about 24 hours to run. It's all written in VB.net and it's about 2000 lines long. It's already multi-threaded and this works perfectly (after some sweat and tears). I typically run the processes with 10 threads but I'd like to increase that to reduce processing time, which is where using the GPU comes into it. I've search google for everything related that I can think of to find some info but no luck.
What I'm hoping for is a basic example of a vb.net project that does some general operations then sends some threads to the GPU for processing. Ideally I don't want to have to pay for it. So something like:
'Do some initial processing eg.
dim x as integer
dim y as integer
dim z as integer
x=int(textbox1.text)
y=int(textbox2.text)
z=x*y
'Do some multi-threaded operations on the gpu eg.
'show some output to the user once this has finished.
Any help or links will be much appreciated. I've read plenty of articles about it in c++ and other languages but I'm rubbish at understanding other languages!
Thanks all!
Fraser
The VB.NET compiler does not compile onto the GPU, it compiles down to an intermediate language (IL) that is then just-in-time compiled (JITed) for the target architecture at runtime. Currently only x86, x64 and ARM targets are supported. CUDAfy (see below) takes the IL and translates it into CUDA C code. In turn this is compiled with NVCC to generate code that the GPU can execute. Note that this means that you are limited to NVidia GPUs as CUDA is not supported on AMD.
There are other projects that have taken the same approach, for example a Python to CUDA translator in Copperhead.
CUDAfy - A wrapper on top of the CUDA APIs with additional libraries for FFTs etc. There is also a commercial version. This does actually
CUDAfy Translator
Using SharpDevelop's decompiler ILSpy as basis the translator converts .NET code to CUDA C.
There are other projects to allow you to use GPUs from .NET languages. For example:
NMath - A set of math libraries that can be used from .NET and are GPU enabled.
There may be others but these seem to be the main ones. If you decide to use CUDAfy then you will still need to invest some time in understanding enough of CUDA and how GPUs work to port your algorithm to fit the GPU data-parallel model. Unless it is something that can be done out of the box with one of the math libraries.
It's important to realize that there is still a performance hit for accessing the GPU from a .NET language. You must pay a cost for moving (marshaling) data from the .NET managed runtime into the native runtime. The overhead here largely depends on not only the size but also the type of data and if it can be marshaled without conversion. This is in addition to the cost of moving data from the CPU to the GPU etc.

RAR password recovery on GPU using ATI Stream processor

I'm newbie in GPU programming , and i work on brute force RAR Password Recovery on ATI Stream Processor using brook+ language, but i see that the kernel written in brook+ language doesn't allow any calling to normal functions (except kernel functions) , my questions is :
1) how to use unrar.dll (to unrar archive files) API in this situation? and is this the only way to program RAR password recovery?
2) what about crack and ElcomSoft software that use GPU , how they work ?
3) what exactly the role for the function work inside GPU (ATI Stream processor or CUDA) in this program?
4) is nVidia/CUDA technology is easier/more flexible than ATI/brook+ language ?
1) unrar.dll is a compiled dynamic link library. These execute on the CPU. GPUs have vastly different machine code and a very different execution model, so they can't run dlls.
You could try to implement a callback from the GPU to the CPU via events, or build an x86 interpreter on the GPU, but these would almost certainly run slower than just running on the CPU.
Using unrar.dll is not the only way to program RAR password recovery. You could instead just build your own code for CPU and GPU from scratch.
2) They work by having the CPU code explicitly request that some GPU code run on the GPU.
3) I don't know exactly. I would guess though that it has a GPU program that tries many different combinations, and benefits from having these run in parallel.
4) CUDA is more mature than brook+. brook+ may be just as easy for simple tasks, but isn't as fully featured. For new projects most people would now choose OpenCL over brook+.
(I'm not sure what you're intending to do, but none of the above seems likely to enable anything sinister.)