As a part of Nand2Tetris course I'm taking this term, I have to build a binary number multiplication chip.
I have built some chip which probably treats the most cases well enough.
But it has some bugs when I multiply number likes -5 and -2.
It gives me -32758.
Here is the HDL code:
// This file is part of www.nand2tetris.org
// and the book "The Elements of Computing Systems"
// by Nisan and Schocken, MIT Press.
// File name: projects/02/Mul.hdl
/**
* The chip will multiply 2 numbers.
* Handling overflows: any number larger than 16 bits
* can be truncated to include only the 16 least significant bits.
*/
CHIP Mul{
IN a[16], b[16]; // Two 16-bit numbers to multiply
OUT out[16]; // 16-bit output number
PARTS:
Mux16(a=false, b=a, sel=b[0], out=out0); // If the current bit of b is 1 then output a, else 0
ShiftLeft(in=a, out=shift1); // Shift a left as we grow to ten's position
Mux16(a=false, b= shift1, sel=b[1], out=out1);
ShiftLeft(in=shift1, out=shift2);
Mux16(a=false, b= shift2, sel=b[2], out=out2);
ShiftLeft(in=shift2, out=shift3);
Mux16(a=false, b= shift3, sel=b[3], out=out3);
ShiftLeft(in=shift3, out=shift4);
Mux16(a=false, b= shift4, sel=b[4], out=out4);
ShiftLeft(in=shift4, out=shift5);
Mux16(a=false, b= shift5, sel=b[5], out=out5);
ShiftLeft(in=shift5, out=shift6);
Mux16(a=false, b= shift6, sel=b[6], out=out6);
ShiftLeft(in=shift6, out=shift7);
Mux16(a=false, b= shift7, sel=b[7], out=out7);
ShiftLeft(in=shift7, out=shift8);
Mux16(a=false, b= shift8, sel=b[8], out=out8);
ShiftLeft(in=shift8, out=shift9);
Mux16(a=false, b= shift9, sel=b[9], out=out9);
ShiftLeft(in=shift9, out=shift10);
Mux16(a=false, b= shift10, sel=b[10], out=out10);
ShiftLeft(in=shift10, out=shift11);
Mux16(a=false, b= shift11, sel=b[11], out=out11);
ShiftLeft(in=shift11, out=shift12);
Mux16(a=false, b= shift12, sel=b[12], out=out12);
ShiftLeft(in=shift12, out=shift13);
Mux16(a=false, b= shift13, sel=b[13], out=out13);
ShiftLeft(in=shift13, out=shift14);
Mux16(a=false, b= shift14, sel=b[14], out=out14);
ShiftLeft(in=shift14, out=shift15);
Mux16(a=false, b= shift15, sel=b[15], out=out15);
//add all options
Add16(a=out0, b=out1, out=firstAdd0);
Add16(a=out2, b=out3, out=firstAdd1);
Add16(a=out4, b=out5, out=firstAdd2);
Add16(a=out6, b=out7, out=firstAdd3);
Add16(a=out8, b=out9, out=firstAdd4);
Add16(a=out10, b=out11, out=firstAdd5);
Add16(a=out12, b=out13, out=firstAdd6);
Add16(a=out14, b=out15, out=firstAdd7);
Add16(a=firstAdd0, b=firstAdd1, out=secondAdd0);
Add16(a=firstAdd2, b=firstAdd3, out=secondAdd1);
Add16(a=firstAdd4, b=firstAdd5, out=secondAdd2);
Add16(a=firstAdd6, b=firstAdd7, out=secondAdd3);
Add16(a=secondAdd0, b=secondAdd1, out=thirdAdd0);
Add16(a=secondAdd2, b=secondAdd3, out=thirdAdd1);
Add16(a=thirdAdd0, b=thirdAdd1, out=out);
}
Does someone know what's the issue?
Thanks!
In signed 2's complement representation the most significant bit is negative. As a result, the last partial product (out15) and must be subtracted rather than added from the sum.
Have a look at http://www-inst.eecs.berkeley.edu/~eecs151/sp18/files/Lecture21.pdf for more information on signed 2's complement multiplies.
Related
I have a very simple function where I'm passing in a char array and doing a simple character match. I want to return an array of 1/0 depending on which characters are matched.
Problem: although I can see the value has been set in the data structure (as I print it in the function after it's assigned) when the int array is copied back from the device the values aren't as expected.
I'm sure it's something silly.
import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy as np
mod = SourceModule("""
__global__ void test(const char *q, const int chrSize, int *d, const int intSize) {
int v = 0;
if( q[threadIdx.x * chrSize] == 'a' || q[threadIdx.x * chrSize] == 'c' ) {
v = 1;
}
d[threadIdx.x * intSize] = v;
printf("x=%d, y=%d, val=%c ret=%d\\n", threadIdx.x, threadIdx.y, q[threadIdx.x * chrSize], d[threadIdx.x * intSize]);
}
""")
func = mod.get_function("test")
# input data
a = np.asarray(['a','b','c','d'], dtype=np.str_)
# allocate/copy to device
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
# destination array
d = np.zeros((4), dtype=np.int16)
# allocate/copy to device
d_gpu = cuda.mem_alloc(d.nbytes)
cuda.memcpy_htod(d_gpu, d)
# run the function
func(a_gpu, np.int8(a.dtype.itemsize), d_gpu, np.int8(d.dtype.itemsize), block=(4,1,1))
# copy data back and priint
cuda.memcpy_dtoh(d, d_gpu)
print(d)
Output:
x=0, y=0, val=a ret=1
x=1, y=0, val=b ret=0
x=2, y=0, val=c ret=1
x=3, y=0, val=d ret=0
[1 0 0 0]
Expected output:
x=0, y=0, val=a ret=1
x=1, y=0, val=b ret=0
x=2, y=0, val=c ret=1
x=3, y=0, val=d ret=0
[1 0 1 0]
You have two main problems, neither of which have anything to do with memcpy_dtoh:
You have declared d and d_gpu as dtype np.int16, but the kernel is expecting C++ int, leading to a type mistmatch. You should use the np.int32 type to define the arrays.
The indexing of d within the kernel is incorrect. If you have declared the array to the compiler as a 32 bit type, indexing the array as d[threadIdx.x] will automatically include the correct alignment for the type. Passing and using intSize to the kernel for indexing d is not required and it is incorrect to do so.
If you fix those two issues, I suspect the code will work as intended.
NumPy provides well-defined C APIs so that one can easily handle NumPy array in C/C++ space. For example, if I have a C function that takes C arrays (pointers) as arguments, I can just #include <numpy/arrayobject.h>, and pass a NumPy array to it by accessing its data member (or use the C API PyArray_DATA).
Recently I want to achieve the same for CuPy, but I cannot find a header file that I can include. To be specific, my goal is as follows:
I have some CUDA kernels and their callers written in C/C++. The callers run on host but take handles of memory buffers on device as arguments. The computed results of the callers are also stored on device.
I want to wrap the callers into Python functions so that I can control when to transfer data from device to host in Python. That means I have to wrap the resulted device memory pointers in Python objects. CuPy's ndarray is the best choice I can think of.
I can't use CuPy's user-defined-kenrel mechanism because the functions I want to wrap are not directly CUDA kernels. They must contain host code.
Currently, I've found a workaround. I write the Python functions in cython, which take CuPy arrays as inputs and return CuPy arrays. And then I cast .data.ptr attribute into C's size_t type, and then further cast it to whatever pointer type I need. Example code follows.
Example Code
//kernel.cu
#include <math.h>
__global__ void vecSumKernel(float *A, float *B, float *C, int n) {
int i = threadIdx.x + blockIdx.x * blockDim.x;
if (i < n)
C[i] = A[i] + B[i];
}
// This is the C function I want to wrap into Python.
// Notice it does not allocate any memory on device. I want that to be done by cupy.
extern "C" void vecSum(float *A_d, float *B_d, float *C_d, int n) {
int threadsPerBlock = 512;
if (threadsPerBlock > n) threadsPerBlock = n;
int nBlocks = (int)ceilf((float)n / (float)threadsPerBlock);
vecSumKernel<<<nBlocks, threadsPerBlock>>>(A_d, B_d, C_d, n);
}
//kernel.h
#ifndef KERNEL_H_
#define KERNEL_H_
void vecSum(float *A_d, float *B_d, float *C_d, int n);
#endif
# test_module.pyx
import cupy as cp
import numpy as np
cdef extern from "kernel.h":
void vecSum(float *A_d, float *B_d, float *C_d, int n)
cdef vecSum_wrapper(size_t aPtr, size_t bPtr, size_t cPtr, int n):
# here the Python int -- cp.ndarray.data.ptr -- is first cast to size_t,
# and then cast to (float *).
vecSum(<float*>aPtr, <float*>bPtr, <float*>cPtr, n)
# This is the Python function I want to use
# a, b are cupy arrays
def vec_sum(a, b):
a_ptr = a.data.ptr
b_ptr = b.data.ptr
n = a.shape[0]
output = cp.empty(shape=(n,), dtype=a.dtype)
c_ptr = output.data.ptr
vecSum_wrapper(a_ptr, b_ptr, c_ptr, n)
return output
Compile and Run
To compile, one can first compile the kernel.cu into a static library, say, libVecSum. Then use cython to compile test_module.pyx int test_module.c, and build the Python extension as usual.
# setup.py
from setuptools import Extension, setup
ext_module = Extension(
"cupyExt.test_module",
sources=["cupyExt/test_module.c"],
library_dirs=["cupyExt/"],
libraries=['libVecSum', 'cudart'])
setup(
name="cupyExt",
version="0.0.0",
ext_modules = [ext_module],
)
It seems working.
>>> import cupy as cp
>>> from cupyExt import test_module
>>> a = cp.ones(5, dtype=cp.float32) * 3
>>> b = cp.arange(5, dtype=cp.float32)
>>> c = test_module.vec_sum(a, b)
>>> print(c.device)
<CUDA Device 0>
>>> print(c)
[3. 4. 5. 6. 7.]
Any better ways?
I am not sure if this way is memory safe. I also feel the casting from .data.ptr to C pointers is not good. I want to know people's thoughts and comments on this.
I have a problem with my Negamax algorithm and hope someone could help me.
I'm writing it in Cython
my search method is a following:
cdef _search(self, object game_state, int depth, long alpha, long beta, int max_depth):
if depth == max_depth or game_state.is_terminated:
value = self.evaluator.evaluate(game_state) evaluates based on current player
return value, []
moves = self.prepare_moves(depth, game_state) # getting moves and sorting
max_value = LONG_MIN
for move in moves:
new_board = game_state.make_move(move)
value, pv_moves = self._search(new_board, depth + 1, -beta, -alpha, max_depth, event)
value = -value
if max_value < value:
max_value = value
best_move = move
best_pv_moves = pv_moves
if alpha < max_value:
alpha = max_value
if max_value >= beta:
return LONG_MAX, []
best_pv_moves.insert(0, best_move)
return alpha, best_pv_moves
In many examples you break after a cutoff is detected but when I do this the algorithm don't find the optimal solution. I'm testing against some chess puzzles and I was wondering why this is the case. If I return the maximum number after a cutoff is detected It works fine but I takes a long time (252sec for depth 6)...
Speed: Nodes pre Second : 21550.33203125
Or if you have other improvements let me know (I use transposition table, pvs and killer heuristics)
Turn out I used the c limits
cdef extern from "limits.h":
cdef long LONG_MAX
cdef long LONG_MIN
and when you try to invert LONG_MIN, with -LONG_MIN you get LONG_MIN, because of an overflow?
I'm trying to calculate the average Luminance of an RGB image. To do this, I find the luminance of each pixel i.e.
L(r,g,b) = X*r + Y*g + Z*b (some linear combination).
And then find the average by summing up luminance of all pixels and dividing by width*height.
To speed this up, I'm using pyopencl.reduction.ReductionKernel
The array I pass to it is a Single Dimension Numpy Array so it works just like the example given.
import Image
import numpy as np
im = Image.open('image_00000001.bmp')
data = np.asarray(im).reshape(-1) # so data is a single dimension list
# data.dtype is uint8, data.shape is (w*h*3, )
I want to incorporate the following code from the example into it . i.e. I would make changes to datatype and the type of arrays I'm passing. This is the example:
a = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
b = pyopencl.array.arange(queue, 400, dtype=numpy.float32)
krnl = ReductionKernel(ctx, numpy.float32, neutral="0",
reduce_expr="a+b", map_expr="x[i]*y[i]",
arguments="__global float *x, __global float *y")
my_dot_prod = krnl(a, b).get()
Except, my map_expr will work on each pixel and convert each pixel to its luminance value.
And reduce expr remains the same.
The problem is, it works on each element in the array, and I need it to work on each pixel which is 3 consecutive elements at a time (RGB ).
One solution is to have three different arrays, one for R, one for G and one for B ,which would work, but is there another way ?
Edit: I changed the program to illustrate the char4 usage instead of float4:
import numpy as np
import pyopencl as cl
import pyopencl.array as cl_array
deviceID = 0
platformID = 0
workGroup=(1,1)
N = 10
testData = np.zeros(N, dtype=cl_array.vec.char4)
dev = cl.get_platforms()[platformID].get_devices()[deviceID]
ctx = cl.Context([dev])
queue = cl.CommandQueue(ctx)
mf = cl.mem_flags
Data_In = cl.Buffer(ctx, mf.READ_WRITE, testData.nbytes)
prg = cl.Program(ctx, """
__kernel void Pack_Cmplx( __global char4* Data_In, int N)
{
int gid = get_global_id(0);
//Data_In[gid] = 1; // This would change all components to one
Data_In[gid].x = 1; // changing single component
Data_In[gid].y = 2;
Data_In[gid].z = 3;
Data_In[gid].w = 4;
}
""").build()
prg.Pack_Cmplx(queue, (N,1), workGroup, Data_In, np.int32(N))
cl.enqueue_copy(queue, testData, Data_In)
print testData
I hope it helps.
The numpy polynomial fit function for masked arrays, ma.polyfit, crashes on integer iput:
import numpy.ma as ma
x = ma.arange(2)
y = ma.arange(2)
p1 = ma.polyfit(np.float32(x), y, deg=1)
p2 = ma.polyfit( x , y, deg=1)
The last line results in an error:
ValueError: data type <type 'numpy.int64'> not inexact
Why can't I fit data with integer x-values (it's no problem with the normal numpy.polyfit function), is this a (known) bug?
It is indeed a bug from numpy.ma : the rcond (a parameter to exclude some values ) takes len(x)*np.finfo(x.dtypes).eps as a default value, and np.int32 does not have any epsfield (because an int does not have a relative precision).
import numpy.ma as ma
eps = np.finfo(np.float32).eps
x = ma.arange(2)
y = ma.arange(2)
p1 = ma.polyfit(np.float32(x), y, deg=1, rcond = len(x)*eps)
p2 = ma.polyfit( x , y, deg=1, rcond = len(x)*eps)
I've looked quickly into numpy's issues, and this bug does not seems to figured there. It might be a good idea to raise a new issue : New Issue