PGI compiler issue in routine directive

PGI compiler issue in routine directive - gpu

This is a section of C++ develoepd muti-gpu code.
I am trying the CPU version where OPENACC=0
#if (OPENACC==1)
#pragma acc routine
#endif
void myCass::method( int i, int j, int dir, int index )
{
#if (OPENACC==1)
double Sn[ZSIZE];
#else
double *Sn=new double[ZSIZE] (double *Sn=()malloc(ZSIZE))
#endif
}
The following method gives the compiler error "PGCC-S-1000-Call in
OpenACC region to procedure '_Znam' which has no acc routine
information" but if I replace the "new" with C-style allocation
(i.e. malloc ) in compiles fine. Is this something to be expected? I
use PGI version 18.1
Is it safe to use a large private variable such as Sn ?

You need to give a level of parallelism on your acc routine pragma. It looks like the correct pragma in your case would be acc routine seq since it contains no loops to parallelize.

"_Znam" is the mangled name for the new operator which doesn't have a device side equivalent. However, there is a CUDA device malloc routine and why it works.
Though, I would highly recommend you not perform device side dynamic allocation. Besides being very slow, the device heap is very small (default is around 8MB). While you can increase this by setting the PGI environment variable "PGI_ACC_CUDA_HEAPSIZE", the max is still only around 32MB. (Note that I haven't tested the max device heap size in awhile so this may have increased but I'd still not recommended doing device side allocation.)
Also, yes, if the level of parallelism is not specified, PGI will default to using "seq".

Related

What is the order of local variables on the stack?

I'm currently trying to do some tests with the buffer overflow vulnerability.
Here is the vulnerable code
void win()
{
printf("code flow successfully changed\n");
}
int main(int argc, char **argv)
{
volatile int (*fp)();
char buffer[64];
fp = 0;
gets(buffer);
if(fp) {
printf("calling function pointer, jumping to 0x%08x\n", fp);
fp();
}
}
The exploit is quite sample and very basic: all what I need here is to overflow the buffer and override the fp value to make it hold the address of win() function.
While trying to debug the program, I figured out that fb is placed below the buffer (i.e with a lower address in memory), and thus I am not able to modify its value.
I thought that once we declare a local variable x before y, x will be higher in memory (i.e at the bottom of the stack) so x can override y if it exceeds its boundaries which is not the case here.
I'm compiling the program with gcc gcc version 5.2.1, no special flags (only tested -O0)
Any clue?

The order of local variable on the stack is unspecified.
It may change between different compilers, different versions or different optimization options. It may even depend on the names of the variables or other seemingly unrelated things.

The order of local variables on the stack is not defined until compile/link (build) time. I'm no expert certainly, but I think you'd have to do some sort of a "hex dump", or perhaps run the code in a debugger environment to find out where it's allocated. And I'd also guess that this would be a relative address, not an absolute one.
And as #RalfFriedl has indicated in his answer, the location could change under any number of compiler options invoked for different platforms, etc.
But we do know it's possible to do buffer overflows. Although you do hear less about them now due to defensive measures such as address space layout randomization (ASLR), they're still around and paying the bills for someone I'd guess. There are of course many many online articles on the subject; here's one that seems fairly current(https://www.synopsys.com/blogs/software-security/detect-prevent-and-mitigate-buffer-overflow-attacks/).
Good luck (should you even say that to someone practicing buffer overflow attacks?). At any rate, I hope you learn some things, and use it for good :)

How to demonstrate a memory misalignment error in C on a macbook pro (Intel 64 bit processor)

In an attempt to understand C memory alignment or whatever the term is (data structure alignment?), i'm trying to write code that results in a alignment error. The original reason that brought me to learning about this is that i'm writing data parsing code that reads binary data received over the network. The data contains some uint32s, uint64s, floats, and doubles, and i'd like to make sure they are never corrupted due to errors in my parsing code.
An unsuccessful attempt at causing some problem due to misalignment:
uint32_t integer = 1027;
uint8_t * pointer = (uint8_t *)&integer;
uint8_t * bytes = malloc(5);
bytes[0] = 23; // extra byte to misalign uint32_t data
bytes[1] = pointer[0];
bytes[2] = pointer[1];
bytes[3] = pointer[2];
bytes[4] = pointer[3];
uint32_t integer2 = *(uint32_t *)(bytes + 1);
printf("integer: %u\ninteger2: %u\n", integer, integer2);
On my machine both integers print out the same. (macbook pro with Intel 64 bit processor, not sure what exactly determines alignment behaviour, is it the architecture? or exact CPU model? or compiler maybe? i use Xcode so clang)
I guess my processor/machine/setup supports unaligned reads so it takes the above code without any problems.
What would a case where parsing of say an uint32_t would fail because of code not taking alignment in account? Is there a way to make it fail on a modern Intel 64 bit system? Or am i safe from alignment errors when using simple datatypes like integers and floats (no structs)?
Edit: If anyone's reading this later, i found a similar question with interesting info: Mis-aligned pointers on x86

Normally, the x86 architecture doesn't have alignment requirements [except for some SIMD instructions like movdqa).
However, since you're trying to write code to cause such an exception ...
There is an alignment check exception bit that can be set into the x86 flags register. If you turn in on, an unaligned access will generate an exception which will show up [under linux at least] as a bus error (i.e. SIGBUS)
See my answer here: any way to stop unaligned access from c++ standard library on x86_64? for details and some sample programs to generate an exception.

Is inline PTX more efficient than C/C++ code?

I have noticed that PTX code allows for some instructions with complex semantics, such as bit field extract (bfe), find most-significant non-sign bit (bfind), and population count (popc).
Is it more efficient to use them explicitly rather than write code with their intended semantics in C/C++?
For example: "population count", or popc, means counting the one bits. So should I write:
__device__ int popc(int a) {
int d = 0;
while (a != 0) {
if (a & 0x1) d++;
a = a >> 1;
}
return d;
}
for that functionality, or should I, rather, use:
__device__ int popc(int a) {
int d;
asm("popc.u32 %1 %2;"："=r"(d): "r"(a));
return d;
}
? Will the inline PTX be more efficient? Should we write inline PTX to to get peak performance?
also - does GPU have some extra magic instruction corresponding to PTX instructions?

The compiler may identify what you're doing and use a fancy instruction to do it, or it may not. The only way to know in the general case is to look at the output of the compilation in ptx assembly, by using -ptx flag added to nvcc. If the compiler generates it for you, there is no need to hand-code the inline assembly yourself (or use an instrinsic).
Also, whether or not it makes a performance difference in the general case depends on whether or not the code path is used in a significant way, and on other factors such as the current performance limiters of your kernel (e.g. compute-bound or memory-bound).

A few more points in addition to #RobertCrovella's answer:
Even if you do use PTX for something - that should happen rarely. Limit it to small functions of no more than a few PTX lines - which you can then re-use for multiple purposes as you see fit, with most of your coding being in C/C++.
An example of this principle are the intrinsics #njuffa mentiond, in (that's not an official copy of the file I think). Please read it through to see which intrinsics are available to you. That doesn't mean you should use them all, of course.
For your specific example - you do want the PTX over the first version; it certainly won't do any harm. But, again, it is also an example of how you do not need to actually write PTX, since popc has a corresponding __popc intrinsic (again, as #njuffa has noted).
You might also want to have a look at the source code of some CUDA-based libraries to see what kind of PTX snippets they've chosen to use.

Input setting using Registers

I have a simple c program for printing n Fibonacci numbers and I would like to compile it to ELF object file. Instead of setting the number of fibonacci numbers (n) directly in my c code, I would like to set them in the registers since I am simulating it for an ARM processor.How can I do that?
Here is the code snippet
#include <stdio.h>
#include <stdlib.h>
#define ITERATIONS 3
static float fib(float i) {
return (i>1) ? fib(i-1) + fib(i-2) : i;
}
int main(int argc, char **argv) {
float i;
printf("starting...\n");
for(i=0; i<ITERATIONS; i++) {
printf("fib(%f) = %f\n", i, fib(i));
}
printf("finishing...\n");
return 0;
}
I would like to set the ITERATIONS counter in my Registers rather than in the code.
Thanks in advance

The register keyword can be used to suggest to the compiler that it uses a registers for the iterator and the number of iterations:
register float i;
register int numIterations = ITERATIONS;
but that will not help much. First of all, the compiler may or may not use your suggestion. Next, values will still need to be placed on the stack for the call to fib(), and, finally, depending on what functions you call within your loop, code in the procedure are calling could save your register contents in the stack frame at procedure entry, and restore them as part of the code implementing the procedure return.
If you really need to make every instruction count, then you will need to write machine code (using an assembly language). That way, you have direct control over your register usage. Assembly language programming is not for the faint of heart. Assembly language development is several times slower than using higher level languages, your risk of inserting bugs is greater, and they are much more difficult to track down. High level languages were developed for a reason, and the C language was developed to help write Unix. The minicomputers that ran the first Unix systems were extremely slow, but the reason C was used instead of assembly was that even then, it was more important to have code that took less time to code, had fewer bugs, and was easier to debug than assembler.
If you want to try this, here are the answers to a previous question on stackoverflow about resources for ARM programming that might be helpful.
One tactic you might take is to isolate your performance-critical code into a procedure, write the procedure in C, the capture the generated assembly language representation. Then rewrite the assembler to be more efficient. Test thoroughly, and get at least one other set of eyeballs to look the resulting code over.
Good Luck!

Make ITERATIONS a variable rather than a literal constant, then you can set its value directly in your debugger/simulator's watch or locals window just before the loop executes.
Alternatively as it appears you have stdio support, why not just accept the value via console input?

Measuring performance after Tail Call Optimization(TCO)

I have an idea about what it is. My question is :-
1.) If i program my code which is amenable to Tail Call optimization(Last statement in a function[recursive function] being a function call only, no other operation there) then do i need to set any optimization level so that compiler does TCO. In what mode of optimization will compiler perform TCO, optimizer for space or time.
2.) How do i find out which all compilers (MSVC, gcc, ARM-RVCT) does support TCO
3.) Assuming some compiler does TCO, we enable it then, What is the way to find out that the compielr has actually done TCO? Will Code size, tell it or Cycles taken to execute it will tell that or both?
-AD

Most compilers support TCO, it is a relatively old technique. As far as how to enable it with a specific compiler, check the documentation for your compilers. gcc will enable the optimization at every optimization level except -O1, I think the specific option for this is -foptimize-sibling-calls. As far as how to tell how/if the compiler is doing TCO, look at the assembler output (gcc -S for example) or disassemble the object code.

Optimization is Compiler specific. Consult the documentation for the various optimization flags for them
You will find that in the Compilers documentation too. If you are curious, you can write a tail recursive function and pass it a big argument, and lookout for a stack-overflow. (tho checking the generated assembler might be a better choice, if you understand the code generated.)
You just use the debugger, and look out the address of function arguments/local variables. If they increase/decrease on each logical frame that the debugger shows (or if it actually only shows one frame, even though you did several calls), you know whether TCO was done or wasn't done.

If you want your compiler to do tail call optimization, just check either
a) the doc of the compiler at which optimization level it will be performed or
b) check the asm, if the function will call itself (you dont even need big asm knowledge to spot the just the symbol of the function again)
If you really really want tail recursion my question would be:
Why dont you perform the tail call removal yourself? It means nothing else than removing the recursion, and if its removable then its not only possible by the compiler on low level but also on algorithmic level by you, that you can programm it direct into your code (it means nothing else than go for a loop instead of a call to yourself).

One way to determine if tail-call is happening is to see if you can force a stack overflow. The following program does not produce a stack overflow using VC++ 2005 Express Edition and, even though its results exceed the capacity of long double rather quickly, you can tell that all of the iterations are being processed when TCO is happening:
/* FibTail.c 0.00 UTF-8 dh:2008-11-23
* --|----1----|----2----|----3----|----4----|----5----|----6----|----*
*
* Demonstrate Fibonacci computation by tail call to see whether it is
* is eliminated through compiler optimization.
*/
#include <stdio.h>
long double fibcycle(long double f0, long double f1, unsigned i)
{ /* accumulate successive fib(n-i) values by tail calls */
if (i == 0) return f1;
return fibcycle(f1, f0+f1, --i);
}
long double fib(unsigned n)
{ /* the basic fib(n) setup and return. */
return fibcycle(1.0, 0.0, n);
}
int main(int argc, char* argv[ ])
{ /* compute some fibs until something breaks */
int i;
printf("\n i fib(i)\n\n");
for (i = 1; i > 0; i+=i)
{ /* Do for powers of 2 until i flips negative
or stack overflow, whichever comes first */
printf("%12d %30.20LG \n", i, fib((unsigned) i) );
}
printf("\n\n");
return 0;
}
Notice, however, that the simplifications to make a pure tail-call in fibcycle is tantamount to figuring out an interative version that doesn't do a tail-call at all (and will work with or without TCO in the compiler.
It might be interesting to experiment in order to see how well the TCO can find optimizations that are not already near-optimal and easily replaced by iterations.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas