I used ROCM to compile the HIP program.
When a simple, but thread-intensive HIP program was run, I used ROCgdb for debugging and found that using the "info threads" command did not see the expected number of wavefronts.
I have compiled the rocm toolchain with Debug type, which can compile HIP programs correctly.
So, can someone tell me how to see a previously unseen wavefront activated in rocgdb debugging?
Related
I have tried to compile Tensorflow 2.0 to get the benefits of extra cpu instructions like avx, but to no avail. I have read through How to compile Tensorflow with SSE4.2 and AVX instructions? but I am still confused as unless you are building for another PC surely -march=native should just work. I have tried building 2 times with different instructions and am still getting the warning message.
I think I used the below, and I think I have the logs still saved if someone wants to help.
"bazel build //tensorflow/tools/pip_package:build_pip_package
d_pip_package --config=mkl"
"bazel build -c opt --copt=-march=native --config=mkl //tensorflow/tools/pip_package:build_pip_package
This is only for the satisfaction of understanding what is going on. I currently don't need the benefit the optimisation will bring, but I do not understand why the method I used isn't working as I followed it exactly.
As noted by my edit in the top answer on the question you linked, it seems bazel and/or TensorFlow's build scripts are buggy. They mishandle -march=native and fail to pass it on to the compiler. I'm guessing it does something wrong with args that have an = in their name, because args like -mfma work.
You are correct, if they were correctly passing -march=native to the compiler there would be no problem, and no need for any of this complication.
IDK why nobody's fix this huge inconvenience yet, instead leaving lots of users who aren't experts on x86 CPU features to stumble around trying to figure out which features their CPU has and how to enable them for gcc/clang. This is exactly what -march=native is for, along with the other important feature of setting tuning options appropriately for the machine you're compiling on.
I had a look once, but I don't actually use TensorFlow and don't know bazel so I got bogged down in the maze of build machinery between that command line and actual invocation of g++ ... foo.cpp
Observations
When the Linux executable is compiled as PIE (Position Independent Executable, default on Ubuntu 18.04), the symbols from shared libraries (e.g. libc) will be resolved when the program starts executing, setting LD_BIND_NOW environment variable to null will not defer this process.
However, if the executable is compiled with the -no-pie flag, the symbol resolution can be controlled by LD_BIND_NOW.
Question
Is it possible to control when the symbols from share libraries is resolved on a ELF PIE executable?
Below is the code and system info in my test,
ubuntu: 18.04
kernel: Linux 4.15.0-50-generic #54-Ubuntu SMP Mon May 6 18:46:08 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux
gcc: gcc (Ubuntu 7.4.0-1ubuntu1~18.04) 7.4.0
#include <stdio.h>
int main() {
printf("Hello world!\n");
}
Details of experiments (that leads to above conclusions).
The experiments are carried out in gdb-peda. Search for gdb-peda in the output will reveal the commands used for each step.
The address stored at GOT entry for puts will be displayed (by disp) whenever the execution proceeds in gdb. So the stage when it is patched with the real address can be easily spotted.
Outputs for -no-pie binary test.
Outputs for pie binary test.
BTW, the same question was originally posted in Reverse Engineering Stack Exchange, and the answer there confirmed above observations.
When the Linux executable is compiled as PIE (Position Independent Executable, default on Ubuntu 18.04), the symbols from shared libraries (e.g. libc) will be resolved when the program starts executing, setting LD_BIND_NOW environment variable to null will not defer this process.
However, if the executable is compiled with the -no-pie flag, the symbol resolution can be controlled by LD_BIND_NOW.
You are mistaken: the symbol resolution happens only when the program starts executing in either case, regardless of whether LD_BIND_NOW is defined or not.
What LD_BIND_NOW controls is whether all functions are resolved at once (during program startup), or whether such symbols are resolved lazily (the first time a program calls a given unresolved function). More on lazy resolution here.
As far as I understand, nothing in the above picture changes between PIE and non-PIE binaries. I'd be interested to know how you came to your conclusion that there is a difference.
I've included an OpenCL kernel (.cl file) in my OS X framework, and I'm able to reference it from one of my implementation (.m) files.
However, when I compile, I get the following error, related with the kernel:
openclc: error: cannot specify -o when generating multiple output files
This error appears once for every architecture found in the OPENCL_ARCHS build option. I've tried to leave all but one architecture (gpu_64 or gpu_32, tried both), however the error persists.
I went through two examples offered by Apple (Hello World and n-Body simulation, both of which compile and run fine on my system), looking for any special build options, but I failed to find any.
Any thoughts?
Thanks.
EDIT: Added Xcode7 tag, as I am working in Xcode 7 Beta.
Context: I have several loops in an Objective-C library I am writing which deal with processing large text arrays. I can see that right now it is running in a single threaded manner.
I understand that LLVM is now capable of auto-vectorising loops, as described at Apple's session at WWDC. It is however very cautious in the way it does it, one reason being the possibility of variables being modified due to CPU pipelining.
My question: how can I see where LLVM has vectorised my code, and, more usefully, how can I receive debug messages that explain why it can't vectorise my code? I'm sure if it can see why it can't auto-vectorise it, it could point that out to me and I could make the necessary manual adjustments to make it vectorisable.
I would be remiss if I didn't point out that this question has been more or less asked already, but quite obtusely, here.
Identifies loops that were successfully vectorized:
clang -Rpass=loop-vectorize
Identifies loops that failed vectorization and indicates if vectorization was specified:
clang -Rpass-missed=loop-vectorize
Identifies the statements that caused vectorization to fail:
clang -Rpass-analysis=loop-vectorize
Source: http://llvm.org/docs/Vectorizers.html#diagnostics
The standard llvm toolchain provided by Xcode doesn't seem to support getting debug info from the optimizer. However, if you roll your own llvm and use that, you should be able to pass flags as mishr suggested above. Here's the workflow I used:
1. Using homebrew, install llvm
brew tap homebrew/versions
brew install llvm33 --with-clang --with-asan
This should install the full and relatively current llvm toolchain. It's linked into /usr/local/bin/*-3.3 (i.e. clang++-3.3). The actual on-disk location is available via brew info llvm33 - probably /usr/local/Cellar/llvm33/3.3/bin.
2. Build the single file you're optimizing, with homebrew llvm and flags
If you've built in Xcode, you can easily copy-paste the build parameters, and use your clang++-3.3 instead of Xcode’s own clang.
Appending -mllvm -debug-only=loop-vectorize will get you the auto-vectorization report. Note: this will likely NOT work with any remotely complex build, e.g. if you've got PCH's, but is a simple way to tweak a single cpp file to make sure it's vectorizing correctly.
3. Create a compiler plugin from the new llvm
I was able to build my entire project with homebrew llvm by:
Grabbing this Xcode compiler plugin: http://trac.seqan.de/browser/trunk/util/xcode/Clang%20LLVM%20MacPorts.xcplugin.zip?order=name
Modifying the clang-related paths to point to my homebrew llvm and clang bin names (by appending '-3.3')
Placing it in /Library/Application Support/Developer/5.0/Xcode/Plug-ins/
Relaunching Xcode should show this plugin in the list of available compilers. At this point, the -mllvm -debug-only=loop-vectorize flag will show the auto-vectorization report.
I have no idea why this isn't exposed in the Apple builds.
UPDATE: This is exposed in current (8.x) versions of Xcode. The only thing required is to enable one or more of the loop-vectorize flags.
Assuming you are using opt and you have a debug build of llvm, you can do it as follows:
opt -O1 -loop-vectorize -debug-only=loop-vectorize code.ll
where code.ll is the IR you want to vectorize.
If you are using clang, you will need to pass the -debug-only=loop-vectorize flag using -mllvm option.
I followed the Linux build instructions and when I try running "ninja -C out/Debug chrome", I just get the output "Illegal Instruction (core dumped)". Now, I wish I could actually find where the core dump is located to see if there is more specific information in there...
For reference, I am trying to run Ninja on Ubuntu 13.10.
Has anyone else experienced this while building Chromium or while trying to build anything else using Ninja? Also, where could I find the core dump?
The error message "Illegal Instruction (core dumped)" indicates that the current binary is using an instruction that is not supported by your CPU.
Please check whether software used for compilation (compiler, linker, ar, ninja-build etc.) is matching your CPU architecture. Unless you have no fancy system like ARM or POWER, you mixed up 32 bit (e.g. i586) and 64 bit (x86-64).
Or you compile to a wrong target. Does your compiler flags include flags beginning with -m like "-march="? That could lead to the same error but only if the compiled code is executed.
Have you built gyp or ninja-build yourself? This would be an other place to make such a mistake.