I'm trying to run the OpenACC tutorial at https://gcc.gnu.org/wiki/OpenACC#OpenACC_kernels_Construct_Optimization_Tutorial
The compiler is g++ 9.2 64-bit as part of the MSYS MINGW64 package.
C:\Users\TJ\Documents\GpuDemo>where g++
C:\Users\TJ\Documents\GpuDemo>g++ --version
g++ (Rev2, Built by MSYS2 project) 9.2.0
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
Here's the command that builds my code:
g++ -m64 -std=c++17 gpudemo.cpp -o gpudemo.exe -fopenmp -fopenacc
The single-thread and OpenMP multi-thread calls work fine. But the OpenACC code is not going to the GPU; it's running on the CPU. The GPU run time is the same as the single-thread run time. My computer is a Lenovo D20 with dual Intel Xeon 5675 processors (6 cores each) and an NVidia GeForce GTX 970 video card, running Windows 7 Pro SP1 64-bit.
Program output:
Multiply a 2000x2000 matrix.
single thread: 54104.1 milliseconds
multi thread: 5036.29 milliseconds
GPU: 54371.1 milliseconds
If I set the environment variable ACC_DEVICE_TYPE=NVIDIA, it gives an error "libgomp: device type NVIDIA not supported."
How can I get this tutorial code to use the GPU?
// https://gcc.gnu.org/wiki/OpenACC
#include <iostream>
#include <chrono>
#define N 2000
void matrix_multiply_single_thread (float r[N][N], const float a[N][N], const float b[N][N])
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
float sum = 0;
for (int k = 0; k < N ; k++)
sum += a[i][k] * b[k][j];
r[i][j] = sum;
void matrix_multiply_multi_thread (float r[N][N], const float a[N][N], const float b[N][N])
#pragma omp parallel for
for (int j = 0; j < N; j++)
for (int i = 0; i < N; i++)
float sum = 0;
for (int k = 0; k < N ; k++)
sum += a[i][k] * b[k][j];
r[i][j] = sum;
void matrix_multiply_gpu (float r[N][N], const float a[N][N], const float b[N][N])
#pragma acc kernels \
copy(r[0:N][0:N], a[0:N][0:N], b[0:N][0:N])
#pragma acc loop independent
for (int j = 0; j < N; j++)
#pragma acc loop independent
for (int i = 0; i < N; i++)
float sum = 0;
// #pragma acc loop seq
#pragma acc loop independent reduction(+: sum)
for (int k = 0; k < N ; k++)
sum += a[i][k] * b[k][j];
r[i][j] = sum;
static float a[N][N], b[N][N], r[N][N];
int main()
std::cout << "Multiply a " << N << "x" << N << " matrix.\n\n";
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
a[i][j] = rand();
b[i][j] = rand();
auto start = std::chrono::high_resolution_clock::now();
matrix_multiply_single_thread(r, a, b);
auto finish = std::chrono::high_resolution_clock::now();
auto microseconds = std::chrono::duration_cast<std::chrono::microseconds>(finish - start);
double milliseconds = (double)microseconds.count() / 1000;
std::cout << "\nsingle thread: " << milliseconds << " milliseconds\n";
start = std::chrono::high_resolution_clock::now();
matrix_multiply_multi_thread(r, a, b);
finish = std::chrono::high_resolution_clock::now();
microseconds = std::chrono::duration_cast<std::chrono::microseconds>(finish - start);
milliseconds = (double)microseconds.count() / 1000;
std::cout << "multi thread: " << milliseconds << " milliseconds\n";
start = std::chrono::high_resolution_clock::now();
matrix_multiply_gpu(r, a, b);
finish = std::chrono::high_resolution_clock::now();
microseconds = std::chrono::duration_cast<std::chrono::microseconds>(finish - start);
milliseconds = (double)microseconds.count() / 1000;
std::cout << "GPU: " << milliseconds << " milliseconds\n";
return 0;
Thanks for your interest in this. I'm part of the team who contributed OpenACC support and GPU code offloading to GCC, and we're still working on that.
The compiler you're using has not been built with support for GPU code offloading -- as indicated by the error message "libgomp: device type NVIDIA not supported" that you ran into.
Indeed, we so far haven't seen any reports of people building GCC with code offloading support for Windows hosts. It's likely that a bit of development effort for GCC/nvptx-tools will be required, but neither have we so far been contracted to work on that, nor has any volunteer contributed the respective code changes.
I am trying to learn Metal through the Apple documentation. So far, I have finished writing an application that calculates the square root of 4096 random numbers. However, when I run it through the terminal, it immediately throws a segmentation fault.
Segmentation fault: 11
Saving session...
...copying shared history...
...saving history...truncating history files...
[Process completed]
So far, I have tried inserting std::couts almost everywhere in the code and I have found the problem to be with the function that generates the random numbers (generateRandomFloatData(id<MTLBuffer> buffer)).
When I tried to print out the address of the input buffer, I got this output:
Segmentation fault: 11
Saving session...
...copying shared history...
...saving history...truncating history files...
[Process completed]
Weirdly, it prints out the address of a NULL pointer.
More testing revealed that changing the function to input a char pointer correctly outputs an address 0x7ffee8bd8620 pointing to the string.
Is there a problem in my code?
// main.mm
// MetalComputeCPP
// Created by [] on 5/1/21.
// Copyright © 2021 thng. All rights reserved.
#include <iostream>
#include <ApplicationServices/ApplicationServices.h>
#include <Metal/Metal.h>
#include <Foundation/Foundation.h>
#include <chrono>
const unsigned int arrayLength = 1 << 12;
const unsigned int bufferSize = arrayLength * sizeof(float);
void generateRandomFloatData(id<MTLBuffer> buffer) {
std::cout << ((float*)buffer.contents) << "\n";
float* dataPtr = ((float*)buffer.contents);
for (unsigned long index = 0; index < arrayLength; index++)
dataPtr[index] = (float)((rand()/(float)(RAND_MAX))*10);
std::cout << dataPtr[index] << "\n";
int main(int argc, const char * argv[]) {
id<MTLDevice> _mDevice = MTLCreateSystemDefaultDevice();
NSError* error = nil;
id<MTLLibrary> defaultLibrary = [_mDevice newDefaultLibrary];
id<MTLFunction> SqrtFunction = [defaultLibrary newFunctionWithName:#"SqrtArray"];
id<MTLComputePipelineState> _mSqrtFunctionPSO = [_mDevice newComputePipelineStateWithFunction: SqrtFunction error:&error];
id<MTLCommandQueue> _mCommandQueue = _mDevice.newCommandQueue;
id<MTLBuffer> _mBufferA = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
id<MTLBuffer> _mBufferResult = [_mDevice newBufferWithLength:bufferSize options:MTLResourceStorageModeShared];
MTLSize gridSize = MTLSizeMake(arrayLength, 1, 1);
NSUInteger threadGroupSize = _mSqrtFunctionPSO.maxTotalThreadsPerThreadgroup;
if (threadGroupSize > arrayLength)
threadGroupSize = arrayLength;
MTLSize threadgroupSize = MTLSizeMake(threadGroupSize, 1, 1);
std::cout << "Generated random float data.\n";
id<MTLCommandBuffer> commandBuffer = _mCommandQueue.commandBuffer;
id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
[computeEncoder setComputePipelineState:_mSqrtFunctionPSO];
[computeEncoder setBuffer:_mBufferA offset:0 atIndex:0];
[computeEncoder setBuffer:_mBufferResult offset:0 atIndex:1];
[computeEncoder dispatchThreads:gridSize
[computeEncoder endEncoding];
[commandBuffer commit];
std::chrono::high_resolution_clock::time_point start = std::chrono::high_resolution_clock::now();
[commandBuffer waitUntilCompleted];
std::chrono::high_resolution_clock::time_point end = std::chrono::high_resolution_clock::now();
uint64_t time = std::chrono::duration_cast<std::chrono::nanoseconds>(end-start).count();
float* a = ((float*)_mBufferA.contents);
float* result = ((float*)_mBufferResult.contents);
bool err = false;
for (unsigned long index = 0; index < arrayLength; index++)
if (abs(result[index] - (float)sqrt(a[index])) > 0.0001) err = true;
std::cout << "√" << a[index] << (err ? " != " : " = ") << result[index] << "\n";
std::cout << time << " nanoseconds\n";
printf("Compute results as expected\n");
return 0;
// File.metal
// MetalComputeCPP
// Created by [] on 5/1/21.
// Copyright © 2021 thng. All rights reserved.
#include <metal_stdlib>
using namespace metal;
kernel void SqrtArray(device const float* inA,
device float* outB,
uint ind [[thread_position_in_grid]]) {
//(x^n-k)' = (nx^(n-1))
outB[ind] = 0.1;
for (int i = 0; i < 20; i++) {
outB[ind] = outB[ind]-((outB[ind]*outB[ind]-inA[ind])/(outB[ind]*2));
buffer in generateRandomFloatData is nil because _mBufferA is nil.
_mBufferA is nil because _mDevice is nil.
MTLCreateSystemDefaultDevice returns nil because (from MTLCreateSystemDefaultDevice)
In macOS, in order for the system to provide a default Metal device object, you must link to the CoreGraphics framework. You usually need to do this explicitly if you are writing apps that don't use graphics by default, such as command line tools.
Your previous question:
Why does Metal not work when run via the Terminal but is fine when run through Xcode?
In Xcode MTLCreateSystemDefaultDevice returns on my Mac
_mDevice: <CaptureMTLDevice: 0x10050bbb0> -> <MTLDebugDevice: 0x10050aae0> -> <MTLIGAccelDevice: 0x1031c8000>
name = Intel HD Graphics 4000
In Terminal MTLCreateSystemDefaultDevice returns
_mDevice: <MTLIGAccelDevice: 0x7f9c32f17000>
name = Intel HD Graphics 4000
Apparenlty Xcode wraps the device in a debugging device, which has the side effect of fixing the issue.
I have implemented Merge & Quick Sort in the textbook what I've learned, and it says Time Complexities of each sorts are like this:
Merge Sort: O(n.log(n)) / Quick Sort: average O(n.log(n)) and O(n2) in the worst case (if key array is sorted).
So I executed the programs with Two types of Arrays: sorted and random, with different sizes.
Since I wanted to get the Average time, I have tried 10 times per each case.
Here is the code of Merge & Quick Sort:
#include <iostream>
#include <ctime>
#include <vector>
#include <algorithm>
using namespace std;
void Merge(vector<int>& s, int low, int mid, int high) {
int i = low;
int j = mid + 1;
int k = low;
vector<int> u(s);
while (i <= mid && j <= high) {
if (s.at(i) < s.at(j)) {
u.at(k) = s.at(i);
} else {
u.at(k) = s.at(j);
if (i > mid) {
for (int a = j; a < high + 1; a++) {
u.at(k) = s.at(a);
} else {
for (int a = i; a < mid + 1; a++) {
u.at(k) = s.at(a);
for (int a = low; a < high + 1; a++)
s.at(a) = u.at(a);
void MergeSort(vector<int>& s, int low, int high) {
int mid;
if (low < high) {
mid = (low + high) / 2;
MergeSort(s, low, mid);
MergeSort(s, mid + 1, high);
Merge(s, low, mid, high);
void swap(int& a, int& b) {
int tmp = a;
a = b;
b = tmp;
void Partition(vector<int>& s, int low, int high, int& pvpoint) {
int j;
int pvitem;
pvitem = s.at(low);
j = low;
for (int i = low + 1; i <= high; i++) {
if (s.at(i) < pvitem) {
swap(s.at(i), s.at(j));
pvpoint = j;
swap(s.at(low), s.at(pvpoint));
void QuickSort(vector<int>& s, int low, int high) {
int pvpoint;
if (high > low) {
Partition(s, low, high, pvpoint);
QuickSort(s, low, pvpoint - 1);
QuickSort(s, pvpoint + 1, high);
And each of these main() functions are printing the execution times in SORTED, and RANDOM key arrays.
you can see the result with adding one of these main functions in Visual Studio(C++):
//Sorted key array
int main() {
int s;
for (int i = 1; i < 21; i++) { //Size is from 300 to 6000
s = i * 300;
vector<int> Arr(s);
cout << "N : " << s << "\n";
//Assign Random numbers to each elements
Arr.front() = rand() % Arr.size();
for (int j = 1; j < Arr.size(); j++) { Arr.at(j) = ((737 * Arr.at(j - 1) + 149) % (Arr.size() * 5)); }
sort(Arr.begin(), Arr.end());
//QuickSort(Arr, 0, Arr.size() - 1); <- you can switch using this instead of MergeSort(...) below
for (int i = 0; i < 10; i++) { //print 10 times of execution time
clock_t start, end;
start = clock();
MergeSort(Arr, 0, Arr.size() - 1);
end = clock() - start;
printf("%12.3f ", (double)end * 1000.0 / CLOCKS_PER_SEC);
cout << endl;
return 0;
//Random key array
int main() {
int s;
for (int i = 1; i < 21; i++) {
s = i * 3000;
vector<int> Arr(s);
cout << "N : " << s << "\n";
for (int i = 0; i < 10; i++) {
//Assign Random numbers to each elements
Arr.front() = rand() % Arr.size();
for (int j = 1; j < Arr.size(); j++) { Arr.at(j) = ((737 * Arr.at(j - 1) + 149) % (Arr.size() * 5)); }
//QuickSort(Arr, 0, Arr.size() - 1); <- you can switch using this instead of MergeSort(...) below
clock_t start, end;
start = clock();
MergeSort(Arr, 0, Arr.size() - 1);
end = clock() - start;
printf("%12.3f ", (double)end * 1000.0 / CLOCKS_PER_SEC);
cout << endl;
return 0;
And the THING is, the result is not matching with their time complexity. for example, Merge sort in(RANDOM Array)
size N=3000 prints 20 ms, but size N=60000 prints 1400~1600 ms !! it supposed to print almost 400 ms because Time complexity (Not in worse case) in Quick Sort is O(n.log(n)), isn't it? I want to know what affects to this time and how could I see the printed time that I expected.
You posted the same code in this question: Calculate Execution Times in Sort algorithm and you did not take my answer into account.
Your MergeSort function has a flaw: you duplicate the whole array in merge causing a lot of overhead and quadratic time complexity. This innocent looking definition: vector<int> u(s); defines u as a vector initialized as a copy of s, the full array.
C++ is a very powerful language, often too powerful, littered with traps and pitfalls such as this. It is a very good thing you tried to verify that your program meets the expected performance from the known time complexity of the algorithm. Such a concern is alas too rare.
Here are some guidelines:
For getting execution time:
#include <time.h>
int main()
struct timeval stop, start;
int arr[10000];
gettimeofday(&start, NULL);
mergeSort(arr, 0, 9999);
gettimeofday(&stop, NULL);
printf("Time taken for Quick sort is: %ld microseconds\n",
I am running a CUDA kernel which seems to be indexing out of bounds and I can not figure out why. I get error 8 write-of-size in cuda-memcheck.
I have tried to change the number of blocks and the number of threads in each block as well as only running a fraction of all iterations needed. Here is some usefull information as well as a replicable example which gives the error:
blockSize: 128
numBlocks: 512
Nvidia GTX 970
#include <iostream>
#include <cuda_runtime_api.h>
#include <cuda.h>
#include <thrust/host_vector.h>
#include <thrust/device_vector.h>
#include <vector>
#include <iterator>
#include <cuda_profiler_api.h>
#include <algorithm>
#include <cmath>
#include <numeric>
#include <stdio.h>
#include <fstream>
int NchooseK(const int &N, const int &K)
int result = 1;
for (int i = 1; i <= K; i++)
result *= N - (K - i);
result /= i;
return result;
inline int get_flatten_size(const unsigned int N){
int sum = 0;
for(int i=1; i<=N ; i++){
sum +=i*NchooseK(N,i);
return sum;
std::vector<int> comb(const int &N, const int &K, const int &length)
//void comb(int N, int K, int length)
int k;
std::vector<int> vec(K);
std::vector<int> flatten_vec(0);
std::string bitmask(K, 1); // K leading 1's
bitmask.resize(N, 0); // N-K trailing 0's
for (int j = 0; j < length; j++) {
k = 0;
for (int i = 0; i < N; ++i) // [0..N-1] integers
if (bitmask[i]) {
//std::cout << i << " ";
vec[k] = i;
//std::cout << std::endl;
std::prev_permutation(bitmask.begin(), bitmask.end());
flatten_vec.insert(flatten_vec.end(), vec.begin(),vec.end());
return flatten_vec;
void get_matrix_indices(const unsigned int N, int *sub_col, int *sub_size, int *cumulative_size)
int size, itterator = 0;
cumulative_size[0] = 0;
std::vector<int> size_i_columns;
std::vector<int> all_columns(0);
for(int i=1; i<=N; i++){
size = NchooseK(N,i);
size_i_columns = comb(N,i,size);
for(int j=0; j<size; j++){
//sub_col = &all_columns[0];
for(int i = 0; i < all_columns.size(); i++) sub_col[i] = all_columns[i];
void comb_ols(const unsigned int M, const unsigned int N, int* sub_col, int *sub_size, int* cumulative_size, const unsigned int numberOfCalculations, const unsigned int max_size){
int size;
int start_index;
int index = blockIdx.x*blockDim.x+threadIdx.x;
int stride = blockDim.x*gridDim.x;
double *sub_matrix = new double[M*(1+max_size)];
for(int i = index; i < numberOfCalculations; i+=stride){
size = sub_size[i];
start_index = cumulative_size[i];
for(int j = 0; j < size; j++){
for(int k = 0; k<M; k++){
sub_matrix[k] = 1;
delete [] sub_matrix;
And then we the main function:
int main()
int N = 17;
int M = 263;
const unsigned int regressors = N-1;
const unsigned int numberOfCalculations = (int) (exp2((double) regressors) - 1);
const unsigned int size_sub_col = get_flatten_size(regressors);
int blockSize =128;
int numBlocks = (numberOfCalculations + blockSize-1)/blockSize;
std::cout << "\nblockSize :" << blockSize;
std::cout << "\nnumBlocks :" << numBlocks;
std::cout << "\nblockSize*numBlocks :" << blockSize*numBlocks;
std::cout << "\nregressors :" << regressors;
std::cout << "\nNumberOfCalculations :" << numberOfCalculations;
std::cout << "\nsize_sub_col :" << size_sub_col << '\n' ;
int *sub_size, *cumulative_size, *sub_columns;
cudaMallocManaged(&sub_size, numberOfCalculations*sizeof(int));
cudaMallocManaged(&cumulative_size, (numberOfCalculations+1)*sizeof(int));
cudaMallocManaged(&sub_columns, size_sub_col*sizeof(int));
get_matrix_indices(regressors,sub_columns, sub_size, cumulative_size);
const unsigned int max_size = N*M;
comb_ols<<<numBlocks, blockSize>>>(M,N,sub_columns, sub_size, cumulative_size, numberOfCalculations, max_size);
return 0;
I fail to see why the threads would try to access illegal memory space. The way I understood is that the matrix sub_matrix will be initilized on each thread once and then the parallel for loop happens. Thus should each thread have the necessary memory space. Am I allocating too much memory on the GPU? How is "new sub_matrix" handled here?
If I read your code correctly, each thread is attempting to allocate M * (1 + M*N) doubles, which is 263 * ( 1 + 263*17) = 1,176,136 doubles, or 8.97Mb of heap memory per thread. You launch 128 * 512 threads. That would mean you require 588Gb of heap space for the kernel to run successfully.
Clearly your GPU lacks that amount of memory and the out of bounds memory access comes from failures in the new call (which you can check for, BTW).
Might I suggest that something in the size calculations for the heap memory you require is wrong. Otherwise you have an extremely unrealistic problem for the GPU and will require some other approach.
Note that even if you manage to redesign things to limit the code to a feasible malloc heap memory size, you will still need, in all likelihood, to resize the malloc heap to a suitable size before running the kernel. The cudaDeviceSetLimit API can be used for this.
I wrote in this forum asking for help to solve this problem that took ame a lot of my time,i write my first program using systemC, I will expain my aim as much as I can , I stored 2 matrix of pixel value of image in two different text files, I write a systemC code that load two matrix and apply somme of absolute difference, if number of different superior of a Threshold the code displays message (motion).
My code composed of two modules, the first module check if there a number stored in a text file, if yes this Module will automates the other module to load the two matrix and compare them, I really need this code for my project graduation any help or suggestion.
#include "systemC.h"
#include "string.h"
#include "stdio.h"
#include <time.h>
#include <math.h> /* fabs */
#include <fstream>
#include <iostream>
#include <fstream>
using namespace std;
double elapsed;
int H = 0;
int D = 0;
int a, b;
int in = false;
int L = 0;
char *mode1 = "r";
char *mode2 = "w";
int i, j, k;
int rows1, cols1, rows2, cols2;
bool fileFound = false;
FILE *SwitchContext;
FILE *image1;
FILE *image2;
FILE *image3;
int sum = 0;
clock_t start = clock();
sc_in<bool>sig ;
void synchroprocess()
cout << "\n Running Automation";
SwitchContext = fopen("F:/SWITCH CONTEXT.txt", mode2);
fscanf(SwitchContext, "%d", &L);
while (L != 0)
cout << "waiting...";
sig == true;
void MotionDetector()
image3 = fopen("F:/image3.txt", mode2);
char *mode1 = "r";
char *mode2 = "w";
image1 = fopen("F:/image1.txt", mode1);
if (!image1)
printf("File Not Found!!\n");
fileFound = true;
fileFound = false;
while (fileFound);
image2 = fopen("F:/image2.txt", mode1);
if (!image2)
printf("File Not Found!!\n");
fileFound = true;
fileFound = false;
while (fileFound);
rows1 = rows2 = 384;
cols1 = cols2 = 512;
int **mat1 = (int **)malloc(rows1 * sizeof(int*));
for (i = 0; i < rows1; i++)
mat1[i] = (int *)malloc(cols1 * sizeof(int));
i = 0;
int **mat2 = (int **)malloc(rows2 * sizeof(int*));
for (i = 0; i < rows2; i++)
mat2[i] = (int *)malloc(cols2 * sizeof(int));
i = 0;
while (!feof(image1))
for (i = 0; i < rows1; i++)
for (j = 0; j < cols1; j++)
fscanf(image1, "%d%", &mat1[i][j]);
i = 0;
j = 0;
while (!feof(image2))
for (i = 0; i < rows2; i++)
for (j = 0; j < cols2; j++)
fscanf(image2, "%d%", &mat2[i][j]);
i = 0;
j = 0;
for (i = 0; i < rows1; i++)
for (j = 0; j < cols1; j++) {
a = abs(mat1[i][j] = mat2[i][j]);
b = b + a;
i = j = 0;
D = b / 196608;
if (D > 0.9)
for (i = 0; i < rows1; i++) {
for (j = 0; j < cols1; j++)
fprintf(image3, "%d ", mat2[i][j]);
fprintf(image3, "\n");
printf("\n Image Saved....");
std::ofstream mon_fichier("F:\toto.txt");
mon_fichier << elapsed << '\n';
clock_t end = clock();
elapsed = ((double)end - start) / CLOCKS_PER_SEC;
printf("time is %f", elapsed);
int sc_main(int argc, char* argv[])
imageProcess master("EE2");
What you did is basically wrong.
You copy pasted code to SC_MODULE, this code is simple C code
(Do not mix C and C++ files)
This is not how you use clock
What you should do:
You need to check if your algorithm works, for this you do not need SystemC at all
Then you can replace data types with HW one and check if it still works
Then you have to find which data interface is used in HW and how to use this interface
Then you have to tweak your alg. to work with this interface (There you can use SC_MODULE, sc ports etc...)
Also take look at SC_CTHREAD, you will need it.
Without any informations about target platform I can not provide any other help.
Here's a simple program:
void multiply(const int* v_in, const int* w_in, int n_v, int n_w, int* w_out)
for(int i=0; i<n_w; i++)
int sum=0;
for(int j=0; j<n_v; j++)
sum += (w_in[i]*v_in[j])>>1;
Presume n_v, n_w ~10^6. Clearly, there's at least a dozen equivalent ways to do this in CUDA, with different ways to subdivide (n_v*n_w) operations into threads, with and without shared memory... Which way should, theoretically speaking, be the fastest?
void multiply(const int* v_in, const int* w_in, int n_v, int n_w, int* w_out)
int *v = shared; // dynamic
for(int i = block.rank; i < n_w; i += block.size)
int w = w_in[i]; // coalesced
int sum=0;
for(int j=0; j<n_v; j += block.size) { // assumption
v[block.rank] = v_in[j+block.rank];
for(int k = 0; k < block.size; ++k)
sum += (w*v[k])>>1; //
__synch(); // ouch
w_out[i] = sum; // ditto