I encountered a problem when using OpenMP to parallelize my code. I have attached the simplest code that can reproduce my problem.
#include <iostream>
#include <vector>
using namespace std;
int main()
{
int n = 10;
int size = 1;
vector<double> vec(1, double(1.0));
double sum = 0.0;
#pragma omp parallel for private(vec) reduction(+: sum)
for (int i = 0; i != n; ++i)
{
/* in real case, complex operations applied on vec here */
sum += vec[0];
}
cout << "sum: " << sum << endl;
return 0;
}
I compile with g++ with flag -fopenmp, and the error message from g++ prompts "Segmentation fault (core dumped)". I am wondering what's wrong with the code.
Note that vec should be set to private since in the real code a complex operation is applied on vec in the for-loop.
The problem indeed comes from the private(vec) clause. There are two issues with this code.
First, from a semantics perspective, the private(vec) should be shared(vec), as the intent seems to be to work on the same std::vector instance in parallel. So, the code should look like this:
#pragma omp parallel for shared(vec), reduction(+: sum)
for (int i = 0; i != n; ++i)
{
sum += vec[0];
}
In the previous code, the private(vec) made a private instance of std::vector for each thread and was supposed to initialize these instances by calling the default constructor of std::vector.
Second, the segmentation fault then arises from the fact that there's no vec[0] element in any of the private instances. This can be confirmed by calling vec.size() fro the threads.
PS: shared(vec) would be been the default sharing for vec as per the OpenMP specification anyways.
Related
I’m trying to compute on GPU, using OpenACC, the sum between two vectors of std::vector. As compiler I’m using GCC+NVPTX with OpenACC support but when I compile the code with these flags:
g++ -fopenacc -offload=nvptx-none -fopt-info-optimized-omp -g -std=c++17
But I got: “array_1 does not have pointer or array type” and “array_2 does not have pointer or array type”. Is there any way to use std::vector with OpenACC?
This a minimal reproducible example:
int main(int argc, char **argv) {
std::vector<std::vector<float>> array1,array2;
float result[1000]={0.0};
for(int i=0; i<1000; i++){
std::vector<float> accumulator1, accumulator2;
for (int j=0; j<1000; j++){
accumulator1.push_back(99.99);
accumulator2.push_back(66.66);
}
array1.push_back(accumulator1);
array2.push_back(accumulator2);
}
#pragma acc data copyin(array1[:1000][:1000],array2[:1000][:1000])
#pragma acc data copy(result[:1000])
#pragma acc parallel loop
for(int i=0; i<1000; i++){
for (int j=0; j<1000; j++){
result[i] += array1[i][j] + array2[i][j];
}
}
for(int i=0; i<10; i++){
std::cout << result[i] << std::endl;
}
return 0;
}
Compiling with GCC+NVPTX is mandatory for me, but also trying to compile it with nvc++ returns:
main:
18, Generating copyin(array1,array2) [if not already present]
Generating copy(result[:]) [if not already present]
Generating NVIDIA GPU code
23, #pragma acc loop gang, vector(128) /* blockIdx.x threadIdx.x */
24, #pragma acc loop seq
24, Complex loop carried dependence of prevents parallelization
Loop carried dependence of result prevents parallelization
Loop carried backward dependence of result prevents vectorization
std::vector<std::vector<float, std::allocator<float>>, std::allocator<std::vector<float, std::allocator<float>>>>::operator [](unsigned long):
3, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code
std::vector<float, std::allocator<float>>::operator [](unsigned long):
3, include "vector"
64, include "stl_vector.h"
771, Generating implicit acc routine seq
Generating acc routine seq
Generating NVIDIA GPU code
And launching the application:
Failing in Thread:0
call to cuInit returned error 999: Unknown
Any advice? Thanks
Since two days I am trying to make printf\sprintf working in my project...
MCU: STM32F722RETx
I tried to use newLib, heap3, heap4, etc, etc. nothing works. HardFault_Handler is run evry time.
Now I am trying to use simple implementation from this link and still the same problem. I suppose my device has some problem with double numbers, becouse program run HardFault_Handler from this line if (value != value) in _ftoa function.( what is strange because this stm32 support FPU)
Do you guys have any idea? (Now I am using heap_4.c)
My compiller options:
target_compile_options(${PROJ_NAME} PUBLIC
$<$<COMPILE_LANGUAGE:CXX>:
-std=c++14
>
-mcpu=cortex-m7
-mthumb
-mfpu=fpv5-d16
-mfloat-abi=hard
-Wall
-ffunction-sections
-fdata-sections
-O1 -g
-DLV_CONF_INCLUDE_SIMPLE
)
Linker options:
target_link_options(${PROJ_NAME} PUBLIC
${LINKER_OPTION} ${LINKER_SCRIPT}
-mcpu=cortex-m7
-mthumb
-mfloat-abi=hard
-mfpu=fpv5-sp-d16
-specs=nosys.specs
-specs=nano.specs
# -Wl,--wrap,malloc
# -Wl,--wrap,_malloc_r
-u_printf_float
-u_sprintf_float
)
Linker script:
/* Highest address of the user mode stack */
_estack = 0x20040000; /* end of RAM */
/* Generate a link error if heap and stack don't fit into RAM */
_Min_Heap_Size = 0x200; /* required amount of heap */
_Min_Stack_Size = 0x400; /* required amount of stack */
/* Specify the memory areas */
MEMORY
{
RAM (xrw) : ORIGIN = 0x20000000, LENGTH = 256K
FLASH (rx) : ORIGIN = 0x08000000, LENGTH = 512K
}
UPDATE:
I don't think so it is stack problem, I have set configCHECK_FOR_STACK_OVERFLOW to 2, but hook function is never called. I found strange think: This soulution works:
float d = 23.5f;
char buffer[20];
sprintf(buffer, "temp %f", 23.5f);
but this solution not:
float d = 23.5f;
char buffer[20];
sprintf(buffer, "temp %f",d);
No idea why passing variable by copy, generate a HardFault_Handler...
You can implement a hard fault handler that at least will provide you with the SP location to where the issue is occurring. This should provide more insight.
https://www.freertos.org/Debugging-Hard-Faults-On-Cortex-M-Microcontrollers.html
It should let you know if your issue is due to a floating point error within the MCU or if it is due to a branching error possibly caused by some linking problem
I also had error with printf when using FreeRTOS for my SiFive HiFive Rev B.
To solve it, I rewrite _fstat and _write functions to change output function of printf
/*
* Retarget functions for printf()
*/
#include <errno.h>
#include <sys/stat.h>
int _fstat (int file, struct stat * st) {
errno = -ENOSYS;
return -1;
}
int _write (int file, char * ptr, int len) {
extern int uart_putc(int c);
int i;
/* Turn character to capital letter and output to UART port */
for (i = 0; i < len; i++) uart_putc((int)*ptr++);
return 0;
}
And create another uart_putc function for UART0 of SiFive HiFive Rev B hardware:
void uart_putc(int c)
{
#define uart0_txdata (*(volatile uint32_t*)(0x10013000)) // uart0 txdata register
#define UART_TXFULL (1 << 31) // uart0 txdata flag
while ((uart0_txdata & UART_TXFULL) != 0) { }
uart0_txdata = c;
}
The newlib C-runtime library (used in many embedded tool chains) internally uses it's own malloc-family routines. newlib maintains some internal buffers and requires some support for thread-safety:
http://www.nadler.com/embedded/newlibAndFreeRTOS.html
hard fault can caused by unaligned Memory Access:
https://www.keil.com/support/docs/3777.htm
I have this piece of c/c++ code:
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads.
The ret value is always 0 (for 10 times).
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads. The ret value is always 0 (for 10 times).
Your program contains at least one data race, therefore its behavior is undefined.
The provided source is also is incomplete, so it's impossible to be sure that I can test the same thing you are testing. Nevertheless, I performed the minimum augmentation needed for g++ to compile it without warnings, and tested that:
#include <cstdlib>
#include <cstdio>
#include <pthread.h>
pthread_mutex_t nItermutex = PTHREAD_MUTEX_INITIALIZER;
const int nFracK = 100;
const int dQ = 4;
int nIter = 0;
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
return NULL;
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
int main(void) {
Opt();
return 0;
}
The fact that its behavior is undefined notwithstanding, when I run this program on my Linux machine, it invariably prints exactly ten "Thread start" lines, albeit not all with distinct numbers. The most plausible conclusion is that the program indeed does start ten (additional) threads, which is consistent with the fact that the output also seems to indicate that each call to pthread_create() indicates success by returning 0. I therefore reject your assertion that fewer than ten threads are actually started.
Presumably, the followup question would be why the program does not print the expected output, and here we return to the data race and accompanying undefined behavior. The main thread writes a text representation of iteration variable i into local array data of function Opt, and passes a pointer to that same array to each call to pthread_create(). When it then cycles back to do it again, there is a race between the newly created thread trying to read back the data and the main thread overwriting the array's contents with new data. I suppose that your idea was to avoid passing &i, but this is neither better nor fundamentally different.
You have several options for avoiding a data race in such a situation, prominent among them being:
initialize each thread indirectly from a different object, for example:
int start[10];
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun, &start[i]);
}
Note there that each thread is passed a pointer to a different array element, which the main thread does not subsequently modify.
initialize each thread directly from the value passed to it. This is not always a viable alternative, but it is possible in this case:
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun,
reinterpret_cast<void *>(static_cast<std::intptr_t>(i)));
}
accompanied by corresponding code in the thread function:
int start = reinterpret_cast<std::intptr_t>(vargp) % nFracK;
This is a fairly common idiom, though more often used when writing in pthreads's native language, C, where it's less verbose.
Use a mutex, semaphore, or other synchronization object to prevent the main thread from modifying the array before the child has read it. (Left as an exercise.)
Any of those options can be used to write a program that produces the expected output, with each thread responsible for printing one line. Supposing, of course, that the expectations of the output do not include that the relative order of the threads' outputs will be the same as the relative order in which they were started. If you want that, then only the option of synchronizing the parent and child threads will achieve it.
I am developing a simple NAND module in SystemC. By specification, it should have a 4 ns delay so I tried to describe it with a process with a "wait" statement and SC_THREAD, as follows:
//file: nand.h
#include "systemc.h"
SC_MODULE(nand2){
sc_in<bool> A, B;
sc_out<bool> F;
void do_nand2(){
bool a, b, f;
a = A.read();
b = B.read();
f = !(a && b);
wait(4, SC_NS);
F.write(f);
}
SC_CTOR(nand2){
SC_THREAD(do_nand2);
sensitive << A << B;
}
};
To simulate I've created another module the outputs the stimulus for the NAND, as follows:
//file: stim.h
#include "systemc.h"
SC_MODULE(stim){
sc_out<bool> A, B;
sc_in<bool> Clk;
void stimGen(){
wait();
A.write(false);
B.write(false);
wait();
A.write(false);
B.write(true);
wait();
A.write(true);
B.write(true);
wait();
A.write(true);
B.write(false);
}
SC_CTOR(stim){
SC_THREAD(stimGen);
sensitive << Clk.pos();
}
};
Having these two modules described, the top module (where sc_main is) looks like this:
//file: top.cpp
#include "systemc.h"
#include "nand.h"
#include "stim.h"
int sc_main(int argc, char* argv[]){
sc_signal<bool> ASig, BSig, FSig;
sc_clock Clk("Clock", 100, SC_NS, 0.5);
stim Stim("Stimulus");
Stim.A(ASig); Stim.B(BSig); Stim.Clk(Clk);
nand2 nand2("nand2");
nand2.A(ASig); nand2.B(BSig); nand2.F(FSig);
sc_trace_file *wf = sc_create_vcd_trace_file("sim");
sc_trace(wf, Stim.Clk, "Clock");
sc_trace(wf, nand2.A, "A");
sc_trace(wf, nand2.B, "B");
sc_trace(wf, nand2.F, "F");
sc_start(400, SC_NS);
sc_close_vcd_trace_file(wf);
return 0;
}
The code was compiled and simulated with no errors, however when visualizing the .vcd file in gtkwave the output (F) gets stuck in 1, only showing the delay in the beginning of the simulation.
To test if there were any errors in the code I removed the "wait" statements and changed SC_THREAD to SC_METHOD in the nand.h file and simulated again, now getting the correct results, but without the delays of course.
What am I doing wrong?
It's best if you use an SC_METHOD for process do_nand2, which is sensitive to the inputs. A thread usually has an infinite loop inside of it and it runs for the entire length of the simulation. A method runs only once from beginning to end when triggered. You use threads mostly for stimulus or concurrent processes and threads may, or may not be sensitive to any events.
Just solved the problem:
instead of using
wait(4, SC_NS);
with SC_THREAD I used
next_trigger(4, SC_NS);
with SC_METHOD and it worked just fine.
I have the following code which progressively goes through a string of bits and rearrange them into blocks of 20bytes. I'm using 32*8 blocks with 40 threads per block. However the process takes something like 36ms on my GT630M. Are there any further optimization I can do? Especially with regard to removing the if-else in the inner most loop.
__global__ void test(unsigned char *data)
{
__shared__ unsigned char dataBlock[20];
__shared__ int count;
count = 0;
unsigned char temp = 0x00;
for(count=0; count<(streamSize/8); count++)
{
for(int i=0; i<8; i++)
{
if(blockIdx.y >= i)
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))>>(blockIdx.y - i);
else
temp |= (*(data + threadIdx.x*(blockIdx.x + gridDim.x*(i+count)))&(0x01<<blockIdx.y))<<(i - blockIdx.y);
}
dataBlock[threadIdx.x] = temp;
//do something
}
}
It's not clear what your code is trying to accomplish, but a couple obvious opportunities are:
1) if possible, use 32-bit words instead of unsigned char.
2) use block sizes that are multiples of 32.
3) The conditional code may not be costing you as much as you expect. You can check by compiling with --cubin --gpu-architecture sm_xx (where xx is the SM version of your target hardware), and using cuobjdump --dump-sass on the resulting cubin file to look at the generated assembly. You may have to modify the source code to loft the common subexpression into a separate variable, and/or use the ternary operator ? : to hint to the compiler to use predication.