Can we offload OpenMp to any Intel GPU? - gpu

I'm using Ubuntu 14.04.
Is there a way to use openMp and offload the parallel code into the Intel GPUs such as Intel HD graphics ?
If yes:
which icc version do I need ? (can I do it with gcc ?)
which Intel processors are supported ?

As far as I know you can only offload OpenMP code on Intel MIC/Xeon Phi.
However in the (near ?) future OpenMP 4 should offer this kind of feature (see this post).
So GPGPU on Intel HD graphics can only be done with OpenCL and Intel CILK for the moment I think.

Some OpenMP 4 constructs work on Intel GPUs with the Intel C/C++ compiler.
I've tested the following code on Xeon E3, probably Haswell (v3) generation and with Intel compiler version 15 or 16 (probably the latter). I tested on Linux and find that it is not supported on Mac.
void vadd4(int n, float * RESTRICT a, float * RESTRICT b, float * RESTRICT c)
{
#if defined(_OPENMP) && (_OPENMP >= 201307)
//#pragma omp target teams distribute map(to:n,a[0:n],b[0:n]) map(from:c[0:n])
#pragma omp target map(to:n,a[0:n],b[0:n]) map(from:c[0:n])
#pragma omp parallel for simd
#else
#warning No OpenMP target/simd support!
#pragma omp parallel for
#endif
for(int i = 0; i < n; i++)
c[i] = a[i] + b[i];
}
The full test code I used to evaluate Intel GPU compute software is https://github.com/jeffhammond/HPCInfo/blob/master/openmp/offload/test_vadd.c.
Unfortunately, the distribute and teams are not supported for the -qopenmp-offload=gfx target, so one needs some preprocessing to generate functionally portable code.
Additional documentation includes:
https://software.intel.com/en-us/articles/how-to-offload-computation-to-intelr-graphics-technology
https://software.intel.com/en-us/articles/pldi-tutorial-using-the-intelr-c-compiler-for-general-purpose-computation-offload-to-intelr
Disclaimer: I work for Intel, but in a research capacity. I am not responsible for implementing or supporting the Intel compiler or Intel GPU software.

Related

How do you make SYCL "default_selector" select an Intel GPU rather than an NVIDIA GPU?

I am currently working on a project using SYCL to apply an unsharp mask to an image. My machine has an NVIDIA and an Intel GPU inside it. I am starting with the following code:
default_selector deviceSelector;
queue myQueue(deviceSelector);
The issue is that the line of code "default_selector deviceSelector;" automatically grabs the NVIDIA GPU inside my machine, this breaks all the code that follows as SYCL does not work with NVIDIA.
Therefore my question is - how can I force "default_selector deviceSelector;" to get my Intel GPU and not the NVIDIA GPU? Perhaps I can say something like:
if (device.has_extension(cl::sycl::string_class("Intel")))
if (device.get_info<info::device::device_type>() == info::device_type::gpu)
then select this GPU;//pseudo code
Thus making the code skip over the NVIDIA GPU and guaranteeing the selecting of my Intel GPU.
You are checking the extensions contain an entry called "Intel" which it would not. Extensions are things the device supports, such as SPIR-V You can see the supported extensions by calling clinfo at the command line. To choose the Intel GPU you need to check the manufacturer of the device to select the correct one.
So in the sample code for custom device selection https://github.com/codeplaysoftware/computecpp-sdk/blob/master/samples/custom-device-selector.cpp#L46
You would need to just have something like
if (device.get_info<info::device::name>() == "Name of device") {
return 100;
}
You could print out the value of
device.get_info<info::device::name>
to get the value to check against.

How to use perf_event_open under sampling mode to read the value of BRANCH STACK?

I use perf_event_open() under sampling mode to sample the value of branch stack, but I don't know why!!
attr.sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_BRANCH_STACK
if I don't set PERF_SAMPLE_BRANCH_STACK to attr.sample_type, everything is ok!! I don't know why!!!!!!!!!!
static int perf_event_open(struct perf_event_attr *attr,
pid_t pid,int cpu,int group_fd,unsigned long flags)
{
return syscall(__NR_perf_event_open,attr,pid,cpu,group_fd,flags);
}
int main(int argc, char** argv)
{
pid_t pid = 0;
// create a perf fd
struct perf_event_attr attr;
memset(&attr,0,sizeof(struct perf_event_attr));
attr.size=sizeof(struct perf_event_attr);
// disable at init time
attr.disabled=1;
// set what is the event
attr.type=PERF_TYPE_HARDWARE;
attr.config=PERF_COUNT_HW_BRANCH_INSTRUCTIONS;
// how many clocks to trigger sampling
attr.sample_period=1000000;
// what to sample is IP
attr.sample_type=PERF_SAMPLE_IP|PERF_SAMPLE_BRANCH_STACK;
// notify every 1 overflow
attr.wakeup_events=1;
attr.branch_sample_type = PERF_SAMPLE_BRANCH_ANY_RETURN;
// open perf fd
int perf_fd=perf_event_open(&attr,pid,-1,-1,0);
if(perf_fd<0)
{
perror("perf_event_open() failed!");
return errno;
}
failed! error : Operation not supported!
I can think of three reasons why would that error occur in your case:
You're running the code on an IBM POWER processor. On these processors PERF_SAMPLE_BRANCH_STACK is supported and some of the branch filters are supported in the hardware, but PERF_SAMPLE_BRANCH_ANY_RETURN is not supported on any of the current POWER processors. You said that the code works fine by removing PERF_SAMPLE_BRANCH_STACK, but that doesn't tell us whether the problem is from PERF_SAMPLE_BRANCH_STACK or PERF_SAMPLE_BRANCH_ANY_RETURN.
You're running the code on a hypervisor (e.g., KVM). Most hypervisors (if not all) don't virtualize branch sampling. Yet the host processor may actually support branch sampling and maybe even the ANY_RETURN filter.
The processor doesn't support the branch sampling feature. These include Intel processors that are older than the Pentium 4.
Not all Intel processors support the ANY_RETURN filter in hardware. This filter is supported starting with Core2. However, on Intel processors, for branch filters that are not supported in the hardware, Linux provides software filtering, so PERF_SAMPLE_BRANCH_ANY_RETURN should still work on these processors.
There could be other reasons that I have missed.
error : Operation not supported
The perf_event_open() manual page says about this error:
EOPNOTSUPP
Returned if an event requiring a specific hardware feature is
requested but there is no hardware support. This includes
requesting low-skid events if not supported, branch tracing if
it is not available, sampling if no PMU interrupt is
available, and branch stacks for software events.
And about PERF_SAMPLE_BRANCH_STACK it says:
PERF_SAMPLE_BRANCH_STACK (since Linux 3.4)
This provides a record of recent branches, as provided
by CPU branch sampling hardware (such as Intel Last
Branch Record). Not all hardware supports this fea‐
ture.
So it looks like your hardware doesn't support this.

Disabling AVX2 in CPU for testing purposes

I've got an application that requires AVX2 to work correctly. A check was implemented to check during application start if CPU has AVX2 instruction. I would like to check if it works correctly, but i only have CPU that has AVX2. Is there a way to temporarly turn it off for testing purposes? Or to somehow emulate other CPU?
Yes, use an "emulation" (or dynamic recompilation) layer like Intel's Software Development Emulator (SDE), or maybe QEMU.
SDE is closed-source freeware, and very handy for both testing AVX512 code on old CPUs, or for simulating old CPUs to check that you don't accidentally execute instructions that are too new.
Example: I happened to have a binary that unconditionally uses an AVX2 vpmovzxwq load instruction (for a function I was testing). It runs fine on my Skylake CPU natively, but SDE has a -snb option to emulate a Sandybridge in both CPUID and actually checking every instruction.
$ sde64 -snb -- ./mask
TID 0 SDE-ERROR: Executed instruction not valid for specified chip (SANDYBRIDGE): 0x401005: vpmovzxwq ymm2, qword ptr [rip+0xff2]
Image: /tmp/mask+0x5 (in multi-region image, region# 1)
Instruction bytes are: c4 e2 7d 34 15 f2 0f 00 00
There are options to emulate CPUs as old as -quark, -p4 (SSE2), or Core 2 Merom (-mrm), to as new as IceLake-Server (-icx) or Tremont (-tnt). (And Xeon Phi CPUs like KNL and KNM.)
It runs pretty quickly, using dynamic recompilation (JIT) so code using only instructions that are supported natively can run at basically native speed, I think.
It also has instrumentation options (like -mix to dump the instruction mix), and options to control the JIT more closely. I think you could maybe get it to not report AVX2 in CPUID, but still let AVX2 instructions run without faulting.
Or probably emulate a CPU that supports AVX2 but not FMA (there is a real CPU like this from Via, unfortunately). Or combinations that no real CPU has, like AVX2 but not popcnt, or BMI1/BMI2 but not AVX. But I haven't looked into how to do that.
The basic sde -help options only let you set it to specific Intel CPUs, and for checking for potentially-slow SSE/AVX transitions (without correct vzeroupper usage). And a few other things.
One important test-case that SDE is missing is AVX+FMA without AVX2 (AMD Piledriver / Steamroller, i.e. most AMD FX-series CPUs). It's easy to forget and use an AVX2 shuffle in code that's supposed to be AVX1+FMA3, and some compilers (like MSVC) won't catch this at compile time the way gcc -march=bdver2 would. (Bulldozer only has AVX + FMA4, not FMA3, because Intel changed their plans after it was too late for AMD to redesign.)
If you just want CPUID not report the presence of AVX2 (and FMA?) so your code uses its AVX1 or non-AVX versions of functions, you can do that with most VMs.
For AVX instructions to run without faulting, a bit in a control register has to be set. (So this works like a promise by the OS that it will correctly save/restore the new architectural state of YMM upper halves). So disabling AVX in CPUID will give you a VM instance where AVX instructions fault. (At least 256-bit instructions? I haven't tried this to see if 128-bit AVX instructions can still execute in this state on HW that supports AVX.)

How to program intel hd graphics gpu clock rate?

I found a tool (Intel Extreme Tuning Utility, os win 7 ulti x64) what I can use to change the gpu clock on my laptop (cpu type Intel Core i5-4210U, built in Intel HD Graphics 4400). I marked the slider (belongs to that function) with red on this screen of Intel XTU to avoid any mistake.
I would be happy to build such functionality into my own program. That is enough if my program works at least on my own processor (processor type linked above).
My problem is, that I do not know, how to access gpu clock rate or absolute gpu clock, or whatever really exists behind the scene. Some documentation or any advice in advance will be great.

OpenCL assembly optimization for "testing carry flag after adding"

In my OpenCL kernel, I find this:
error += y;
++y;
error += y;
// The following test may be implemented in assembly language in
// most machines by testing the carry flag after adding 'y' to
// the value of 'error' in the previous step, since 'error'
// nominally has a negative value.
if (error >= 0)
{
error -= x;
--x;
error -= x;
}
Obviously, those operations can easily be optimized using some nifty assembly instructions. How can I optimized this code in OpenCL?
You don't. The OpenCL compiler decides what to do with the code, depending on the target hardware and the optimization settings, which can be set as pragmas or as parameters when building the kernel. If it is smart enough, it'll use the nifty assembly instructions for the platform on which the kernel is to be run. If not, well, it won't.
You have to keep in mind that OpenCL is a general framework applicable to many devices, not just your standard consumer-grade processor, so going "under the hood" is not really possible due to differences in assembly instructions (i.e. OpenCL is meant to be portable, if you start writing x86 opcodes in your kernel, how is it going to run on a graphics card for instance?)
If you need absolute maximum performance on a specific device, you shouldn't be using OpenCL, IMHO.