Simple cblas gemm code but strange result - blas

The following is the c code
const int M = 4;
const int N = 1;
const int K = 2;
const int LDA = M;
const int LDB = K;
const int LDC = M;
float input_data[2]{1, 1};
float weight_data[8]{1.1, 2.01, 3.001, 4.0001, 5.1, 6.01, 7.001, 8.0001};
float output_data[4];
cblas_sgemm(CblasColMajor, CblasNoTrans, CblasNoTrans, M, N, K, 1, weight_data, LDA, input_data, LDB, 0, output_data, LDC);
The expected result is {6.2, 8.02, 10.002, 12.0002}. Instead, I got {4.101, 6.0101, 12.101, 14.0101}.
The code is very simple. I have checked the document many times, but don't know where I did wrong.
Could you please help point out the problem? Thanks in advance!
Update:
I tried 2*2 and 3*2 weight_data, both results are correct. However 4*2 weight_data produces wrong result

It turns out a bug of OpenBLAS. I have never thought openblas may have a bug. It spends me two days.
https://github.com/xianyi/OpenBLAS/issues/1870

Related

How to handle large batched multiplication in array fire

i'm new to arrayfire and i'm currently having some problems. I'm currently doing a large batch of matrix multiplications like something below but I run out of memory. Could someone show me an example code to deal with this issue?
int n = 60;
int m = 18000;
int k = 600;
int t = 200;
af::array A = randu(m, k, t);
af::array B = randu(k, n);
af::array C = af::constant(0, m, n, t);
C = matmul(A, B);

CUDA profiling - high shared transactions/access but low local replay rate

After running the Visual Profiler, guided analysis tells me that I'm memory-bound, and that in particular my shared memory accesses are poorly aligned/accessed - basically every line I access shared memory is marked as ~2 transactions per access.
However, I couldn't figure out why that was the case (my shared memory is padded/strided so that there shouldn't be bank conflicts), so I went back and checked the shared replay metric - and that says that only 0.004% of shared accesses are replayed.
So, what's going on here, and what should I be looking at to speed up my kernel?
EDIT: Minimal reproduction:
import numpy as np
import pycuda.autoinit
import pycuda.driver as cuda
from pycuda.compiler import SourceModule
import pycuda.gpuarray as gp
mod = SourceModule("""
(splitting the code block to get both Python and CUDA/C++ coloring)
typedef unsigned char ubyte;
__global__ void identity(ubyte *arr, int stride)
{
const int dim2 = 16;
const int dim1 = 64;
const int dim0 = 33;
int shrstrd1 = dim2;
int shrstrd0 = dim1 * dim2;
__shared__ ubyte shrarr[dim0 * dim1 * dim2];
auto shrget = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k) -> int{
return shrarr[i * shrstrd0 + j * shrstrd1 + k];
};
auto shrset = [shrstrd0, shrstrd1, &shrarr](int i, int j, int k, ubyte val) -> void {
shrarr[i * shrstrd0 + j * shrstrd1 + k] = val;
};
int in_x = threadIdx.x;
int in_y = threadIdx.y;
shrset(in_y, in_x, 0, arr[in_y * stride + in_x]);
arr[in_y * stride + in_x] = shrget(in_y, in_x, 0);
}
""",
(ditto)
options=['-std=c++11'])
#Equivalent to identity<<<1, dim3(32, 32, 1)>>>(arr, 64);
identity = mod.get_function("identity")
identity(gp.zeros((64, 64), np.ubyte), np.int32(64), block=(32, 32, 1))
2 transactions per access, shared replay overhead 0.083. Decreasing dim2 to 8 makes the problem go away, which I also don't understand.
Partial answer: I had a fundamental misunderstanding of how shared memory banks worked (namely, that they are banks of around a thousand byte-banks each) and so didn't realize that they looped around, so that too much padding meant that 32 row elements might end up using each bank more than once.
Presumably, though, that conflict just didn't come up every time - instead it came up, oh, about 85 times a block, from the numbers.
I'll leave this here for a day in hopes of a more complete explanation, then close and accept this answer.

PyCUDA large nonuniform matrix operations

I am working with large, nonuniform matrices and am having problems with what I believe to be mismatching on the elements.
In example.py, get_simulated_ipp() builds echo and tx, two linear arrays of size 250000 and 25000 respectively. The code also hardcoded sr=25.
My code is attempting to complex multiply tx into echo along different stretches, depending on specified ranges and value of sr. This will then be stored in an array S.
After searching through some other people's examples, I found a way of building blocks and grids here that I thought would work well. I'm unfamiliar with C code, but have been trying to learn over the past week. Here is my code:
#!/usr/bin/python
#This iteration only works on the first and last elements, mismatching after that.
# However, this doesn't result in any empty elements in S
import numpy as np
import example as ex
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
#pull simulated data and get info about it
((echo,tx)) = ex.get_simulated_ipp()
ranges = np.arange(4000,6000).astype(np.int32)
S = np.zeros([len(ranges),len(tx)],dtype=np.complex64)
sr = ex.sr
#copying input to gpu
# will try this explicitly if in/out (in the function call) don't work
block_dim_x = 8 #thread number is product of block dims,
block_dim_y = 8 # want a multiple of 32 (warp multiple)
blocks_x = np.ceil(len(ranges)/block_dim_x).astype(np.int32).item()
blocks_y = np.ceil(len(tx)/block_dim_y).astype(np.int32).item()
kernel_code="""
#include <cuComplex.h>
__global__ void complex_mult(cuFloatComplex *tx, cuFloatComplex *echo, cuFloatComplex *result,
int *ranges, int sr)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned long int idx = threads_in_block * block_num + thread_num;
//aligning the i,j to idx, something is mismatched?
int i = ((idx % (threads_in_block * gridDim.x)) % blockDim.x) +
((block_num % gridDim.x) * blockDim.x);
int j = ((idx - (threads_in_block * block_num)) / blockDim.x) +
((block_num / gridDim.x) * blockDim.y);
result[idx] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
}
"""
## want something to work like this:
## result[i][j] = cuCmulf(echo[j+ranges[i]*sr], tx[j]);
#includes directory of where cuComplex.h is located
mod = SourceModule(kernel_code, include_dirs=['/usr/local/cuda-7.0/include/'])
complex_mult = mod.get_function("complex_mult")
complex_mult(cuda.In(tx), cuda.In(echo), cuda.Out(S), cuda.In(ranges), np.int32(sr),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
compare = np.zeros_like(S) #built to compare CPU vs GPU calcs
txidx = np.arange(len(tx))
for ri,r in enumerate(ranges):
compare[ri,:] = echo[txidx+r*sr]*tx
print np.subtract(S, compare)
At the bottom here, I've put in a CPU implementation of what I'm attempting to accomplish and put in a subtraction. The result is that the very first and very last elements come out as 0+0j, but the rest do not. The kernel is attempting to align an i and j to the idx so that I can traverse echo, ranges, and tx more easily.
Is there a better way to implement something like this? Also, why might the result not come out as all 0+0j as I intend?
Edit:
Trying a little example to get a better grasp of how the arrays are being indexed with this block/grid configuration, I stumbled upon something very strange. Before, I tried to index the elements, I just wanted to run a little test multiplication. It seems like my block/grid covers all of the ary_in matrix, but the result ends up only doubling the top half of ary_in and the bottom half is returning whatever was left over from the bottom half calculation previously.
If I change blocks_x to 4 so that I cover more space than needed, however, the doubling works fine. If I then run it with a 4x4 grid, with * 3 instead, it'll work out fine with ary_out as ary_in tripled. When I run it again with a 2x4 grid and only doubling, the top half of ary_out returns the doubled ary_in, but the bottom half returns the previous result in memory, a tripled value instead. I would understand this to be something in my index/block/grid mapping wrongly to the values, but I can't figure out what.
ary_in = np.arange(128).reshape((8,16))
print ary_in
ary_out = np.zeros_like(ary_in)
block_dim_x = 4
block_dim_y = 4
blocks_x = 2
blocks_y = 4
limit = block_dim_x * block_dim_y * blocks_x * blocks_y
mod = SourceModule("""
__global__ void indexing_order(int *ary_in, int *ary_out, int n)
{
unsigned int block_num = blockIdx.x + blockIdx.y * gridDim.x;
unsigned int thread_num = threadIdx.x + threadIdx.y * blockDim.x;
unsigned int threads_in_block = blockDim.x * blockDim.y;
unsigned int idx = threads_in_block * block_num + thread_num;
if (idx < n) {
// ary_out[idx] = thread_num;
ary_out[idx] = ary_in[idx] * 2;
}
}
""")
indexing_order = mod.get_function("indexing_order")
indexing_order(drv.In(ary_in), drv.Out(ary_out), np.int32(limit),
block=(block_dim_x,block_dim_y,1),
grid=(blocks_x,blocks_y))
print ary_out
FINAL EDIT:
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;
I still don't understand how to get this working with non even division of block dimensions into the output array (e.g. 16x16), even with an if (idx
I figured out the problems. In the edit just above, the ary_in is by default an int64, mismatching with the int initialization in the C code of an int32. This only allocated half the amount of data needed on the GPU for the entire array, so only the top half was moved over and operated on. Adding a .astype(np.int32) solved this problem.
This allowed me to figure out the the ordering of the indexing in my case and fix the main code with:
int i = idx / row_len;
int j = idx % row_len;

Swift get Fraction of Float

I've been trying out swift lately and i've come across a rather simple Problem.
In Obj-C when i want to get the fraction digits of a float i'd do the following:
float x = 3.141516
int integer_x = (int)x;
float fractional_x = x-integer_x;
//Result: fractional_x = 0.141516
in Swift:
let x:Float = 3.141516
let integerX:Int = Int(x)
let fractionalX:Float = x - integerX
-> this results in an error because of mismachting types
Any Idea how to do it correctly?
Thanks in Advance
Malte
Use the modf function:
let v = 3.141516
var integer = 0.0
let fraction = modf(v, &integer)
println("fraction: \(fraction)");
output:
fraction: 0.141516
For float instead of double just use: modff
Use .truncatingRemainder(dividingBy:) which replaced the modulo operator (x % 1), which (for modulo)
immediately (it is only one character), and
with few cpu cycles (presumably only one cycle, since modulo is a common cpu instruction)
gives the fractional part.
let x:Float = 3.141516
let fracPart = x.truncatingRemainder(dividingBy: 1) // fracPart is now 0.141516
fracPart will assume the value: 0.141516. This works for double and float.
The problem is that you cannot subtract Float and Int, you should convert one of this value to be the same as another, try that:
let fractionalX:Float = x - Float(integerX)
Swift 3 does not like the modulus operator %. It wants me to use truncatingRemainder of Double type.
let x1:Double = 123.00
let t1 = x1.truncatingRemainder(dividingBy: 1)
print("t1 = \(t1)")
let x2:Double = 123.45
let t2 = x2.truncatingRemainder(dividingBy: 1)
print("t2 = \(t2)")
Produces output:
t1 = 0.0
t2 = 0.450000000000003
To remove the 3 quadrillionth artifact you should probably round the result.
Why using an int whatsoever?
What about this instead:
import Darwin
let x = 3.1415926
let xf = x - (x > 0 ? floor(x) : ceil(x))
It will use doubles by default here. Feel free to use floats if it's what you need:
let x: Float = 3.1415926
Is that what you are looking for?

Divide integer by 16 without using division or cast

OKAY... let me rephrase this question...
How can I obtain x 16ths of an integer without using division or casting to double....
int res = (ref * frac) >> 4
(but worry a a bit about overflow. How big can ref and frac get? If it could overflow, cast to a longer integer type first)
In any operation of such kind it makes sense to multiply first, then divide. Now, if your operands are integers and you are using a compileable language (eg. C), use shr 4 instead of /16 - this will save some processor cycles.
Assuming everything here are ints, any optimizing compiler worth its salt will notice 16 is a power of two, and shift frac accordingly -- so long as optimizations are turned on. Worry more about major optimizations the compiler can't do for you.
If anything, you should bracket ref * frac and then have the divide, as any value of frac less than 16 will result in 0, whether by shift or divide.
You can use left shift or right shift:
public static final long divisionUsingMultiplication(int a, int b) {
int temp = b;
int counter = 0;
while (temp <= a) {
temp = temp<<1;
counter++;
}
a -= b<<(counter-1);
long result = (long)Math.pow(2, counter-1);
if (b <= a) result += divisionUsingMultiplication(a,b);
return result;
}
public static final long divisionUsingShift(int a, int b) {
int absA = Math.abs(a);
int absB = Math.abs(b);
int x, y, counter;
long result = 0L;
while (absA >= absB) {
x = absA >> 1;
y = absB;
counter = 1;
while (x >= y) {
y <<= 1;
counter <<= 1;
}
absA -= y;
result += counter;
}
return (a>0&&b>0 || a<0&&b<0)?result:-result;
}
I don't understand the constraint, but this pseudo code rounds up (?):
res = 0
ref= 10
frac = 2
denominator = 16
temp = frac * ref
while temp > 0
temp -= denominator
res += 1
repeat
echo res