This code works for multiply 2 matrix which it send matrix a and matrix b and them pointer to matrixMultiply method.
But I have trouble at the line "matrix12[i][j] += matrix1[i][k] * matrix2[k];"
double **matrixMultiply(double *matrix1,double *matrix2,int row1,int col1,int col2){
double **matrix12 = malloc(sizeof(double*)*row1);
for (int i=0; i<row1; i++){
matrix12[i] = malloc(sizeof(double*)col2);
for (int j=0; j<col2; j++){
matrix12[i][j] = 0.0;
for (int k=0; k<col1; k++){
matrix12[i][j] += matrix1[i][k] * matrix2[k]; //invalid operands to binary expression
}
}
}
return matrix12;
}
double *kmult = *matrixMultiply(a, b, 4, 4, 4,);
Ps.This code declare in ViewController.m
Are you sure Matrix1 is a 2-dimension matrix? It's declared at a 1-dimension matrix: double *matrix1 instead of "double **matrix1"
That's probably why the line matrix12[i][j] += matrix1[i][k] * matrix2[k]; doesn't work.
Related
I need the method to check sum CRC8.
I found this code, but it's not working:
- (int)crc8Checksum:(NSString*)dataFrame{
char j;
int crc8 = 0;
int x = 0;
for (int i = 0; i < [dataFrame length]; i++){
x = [dataFrame characterAtIndex:i];
for (int k = 0; k < 8; k++){
j = 1 & (x ^ crc8);
crc8 = floor0(crc8 / 2) & 0xFF;
x = floor0(x / 2) & 0xFF;
if (j != 0 ){
crc8 = crc8 ^ 0x8C;
}
}
}
return crc8;
}
Help me please!
What do you mean "it's not working"? There are 14 different CRC-8 definitions in this catalog, and probably many more out there in the wild. Do you have some CRC values you are comparing to? Is there documentation on what CRC you actually need? What are your test messages and corresponding expected CRCs?
You can't just pick some random CRC-8 code and expect it to do what you need.
That particular code computes a CRC-8/MAXIM in the linked catalog. However it is truly awful code. With unnecessary divides and floors. Here is a better, simpler, faster inner loop:
crc8 ^= x;
for (int k = 0; k < 8; k++)
crc8 = crc8 & 1 ? (crc8 >> 1) ^ 0x8c : crc8 >> 1;
You can get it faster still with tables and algorithms that compute the CRC a byte at a time or a machine word at a time.
The x in the code has its own problems, since an NSString can be a string of unicode characters, so characterAtIndex may not return a byte, and length may not return the number of bytes. You need a way to get the message as a series of bytes.
I'm trying to compute batch 1D FFTs using cufftPlanMany. The data set comes from a 3D field, stored in a 1D array, where I want to compute 1D FFTs in the x and y direction. The data is stored as shown in the figure below; continuous in x then y then z.
Doing batch FFTs in the x-direction is (I believe) straighforward; with input stride=1, distance=nx and batch=ny * nz, it computes the FFTs over elements {0,1,2,3}, {4,5,6,7}, ..., {28,29,30,31}. However, I can't think of a way to achieve the same for the FFTs in the y-direction. A batch for each xy plane is again straightforward (input stride=nx, dist=1, batch=nx results in FFTs over {0,4,8,12}, {1,5,9,13}, etc.). But with batch=nx * nz, going from {3,7,11,15} to {16,20,24,28}, the distance is larger than 1. Can this somehow be done with cufftPlanMany?
I think that the short answer to your question (possibility of using a single cufftPlanMany to perform 1D FFTs of the columns of a 3D matrix) is NO.
Indeed, transformations performed according to cufftPlanMany, that you call like
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
must obey the Advanced Data Layout. In particular, 1D FFTs are worked out according to the following layout
input[b * idist + x * istride]
where b addresses the b-th signal and istride is the distance between two consecutive items in the same signal. If the 3D matrix has dimensions M * N * Q and if you want to perform 1D transforms along the columns, then the distance between two consecutive elements will be M, while the distance between two consecutive signals will be 1. Furthermore, the number of batched executions must be set equal to M. With those parameters, you are able to cover only one slice of the 3D matrix. Indeed, if you try increasing M, then the cuFFT will start trying to compute new column-wise FFTs starting from the second row. The only solution to this problem is an iterative call to cufftExecC2C to cover all the Q slices.
For the record, the following code provides a fully worked example on how performing 1D FFTs of the columns of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
printf("\n");
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { N }; // --- Size of the Fourier transform
int istride = M, ostride = M; // --- Distance between two successive input/output elements
int idist = 1, odist = 1; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = M; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
for (int k=0; k<Q; k++)
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast(d_matrix.data()) + k * M * N), (cufftComplex*)(thrust::raw_pointer_cast(d_matrix_out.data()) + k * M * N), CUFFT_FORWARD);
cufftDestroy(handle);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
}
The situation is different for the case when you want to perform 1D transforms of the rows. In that case, the distance between two consecutive elements is 1, while the distance between two consecutive signals is M. This allows you to set a number of N * Q transformations and then invoking cufftExecC2C only one time. For the record, the code below provides a full example of 1D transformations of the rows of a 3D matrix.
#include <thrust/device_vector.h>
#include <cufft.h>
/********************/
/* CUDA ERROR CHECK */
/********************/
#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
if (code != cudaSuccess)
{
fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
if (abort) exit(code);
}
}
int main() {
const int M = 3;
const int N = 4;
const int Q = 2;
thrust::host_vector<float2> h_matrix(M * N * Q);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp;
temp.x = (float)(j + k * M);
//temp.x = 1.f;
temp.y = 0.f;
h_matrix[k*M*N+j*M+i] = temp;
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
printf("\n");
thrust::device_vector<float2> d_matrix(h_matrix);
thrust::device_vector<float2> d_matrix_out(M * N * Q);
// --- Advanced data layout
// input[b * idist + x * istride]
// output[b * odist + x * ostride]
// b = signal number
// x = element of the b-th signal
cufftHandle handle;
int rank = 1; // --- 1D FFTs
int n[] = { M }; // --- Size of the Fourier transform
int istride = 1, ostride = 1; // --- Distance between two successive input/output elements
int idist = M, odist = M; // --- Distance between batches
int inembed[] = { 0 }; // --- Input size with pitch (ignored for 1D transforms)
int onembed[] = { 0 }; // --- Output size with pitch (ignored for 1D transforms)
int batch = N * Q; // --- Number of batched executions
cufftPlanMany(&handle, rank, n,
inembed, istride, idist,
onembed, ostride, odist, CUFFT_C2C, batch);
cufftExecC2C(handle, (cufftComplex*)(thrust::raw_pointer_cast(d_matrix.data())), (cufftComplex*)(thrust::raw_pointer_cast(d_matrix_out.data())), CUFFT_FORWARD);
cufftDestroy(handle);
for (int k=0; k<Q; k++)
for (int j=0; j<N; j++)
for (int i=0; i<M; i++) {
float2 temp = d_matrix_out[k*M*N+j*M+i];
printf("%i %i %i %f %f\n", i, j, k, temp.x, temp.y);
}
}
I guess, idist=nx*nz could also jump a whole plane and batch=nz would then cover one yx plane. The decision should be made according to whether nx or nz is larger.
Is it expected that the following test should fail?
The test compares results of a 2D and a 3D AffineTransformation. Both are constructed to have unit scaling and zero offsets in the y and z direction, but to have non-zero and non-unity scaling and offset in the x direction. All other off-diagonal elements are zero. It is my belief that these transformations are identical in the x and y directions, and hence should produce identical results.
Furthermore I have found that the test passes if I use this Kernel:
using K = CGAL::Exact_predicates_exact_constructions_kernel;
Is it to be expected that the test passes if I use this Kernel? Should the test fail with either kernel or pass with either kernel?
TEST(TransformerTest, testCGALAffine) {
using K = CGAL::Exact_predicates_inexact_constructions_kernel;
using Float = typename K::FT;
using Transformation_2 = K::Aff_transformation_2;
using Transformation_3 = K::Aff_transformation_3;
using Point_2 = typename K::Point_2;
using Point_3 = typename K::Point_3;
double lowerCorner(17.005142946538115);
double upperCorner(91.940521484752139);
int resolution = 48;
double tmpScaleX((upperCorner - lowerCorner) / resolution);
Float scaleX(tmpScaleX);
Float zero(0);
Float unit(1);
// create a 2D voxel to world transform
Transformation_2 transformV2W_2(scaleX, zero, Float(lowerCorner),
zero, unit, zero,
unit);
// create it's inverse: a 2D world to voxel transform
auto transformW2V_2 = transformV2W_2.inverse();
// create a 3D voxel to world transform
Transformation_3 transformV2W_3(scaleX, zero, zero, Float(lowerCorner),
zero, unit, zero, zero,
zero, zero, unit, zero,
unit);
// create it's inverse: a 3D world to voxel transform
auto transformW2V_3 = transformV2W_3.inverse();
for (int i = 0; i < 3; ++i) {
for (int j = 0; j < 2; ++j) {
EXPECT_EQ(transformV2W_2.cartesian(i, j), transformV2W_3.cartesian(i, j)) << i << ", " << j;
EXPECT_EQ(transformW2V_2.cartesian(i, j), transformW2V_3.cartesian(i, j)) << i << ", " << j;
}
}
std::mt19937_64 rng(0);
std::uniform_real_distribution<double> randReal(0, resolution);
// compare the results of 2D and 3D transformations of random locations
for (int i = 0; i < static_cast<int>(1e4); ++i) {
Float x(randReal(rng));
Float y(randReal(rng));
auto world_2 = transformV2W_2(Point_2(x, y));
auto world_3 = transformV2W_3(Point_3(x, y, 0));
EXPECT_EQ(world_2.x(), world_3.x()) << world_2 << ", " << world_3;
auto voxel_2 = transformW2V_2(world_2);
auto voxel_3 = transformW2V_3(world_3);
EXPECT_EQ(voxel_2.x(), voxel_3.x()) << voxel_2 << ", " << voxel_3;
}
}
I have the following code, but in this line of code I have warning x[i] = (rhs[i] - x[i - 1]) / b;, compiler is telling me that rhs[i] is a garbage value. Why it's happend? And how to remove this warning?
double* getFirstControlPoints(double* rhs, const int n) {
double *x;
x = (double*)malloc(n * sizeof(double));
double *tmp; // Temp workspace.
tmp = (double*)malloc(n * sizeof(double));
double b = 2.0;
x[0] = rhs[0] / b;
for (int i = 1; i < n; i++) // Decomposition and forward substitution.
{
tmp[i] = 1 / b;
b = (i < n - 1 ? 4.0 : 3.5) - tmp[i];
x[i] = (rhs[i] - x[i - 1]) / b; //The left operand of '-' is a garbage value
}
for (int i = 1; i < n; i++) {
x[n - i - 1] -= tmp[n - i] * x[n - i]; // Backsubstitution.
}
free(tmp);
return x;
}
All compiler warnings and calling getFirstControlPoints you may see on screenshots.
You need a check to make sure you have at least 4 points in the points array because this loop (line 333):
for (NSInteger i = 1 ; i < n - 1 ; ++i) {
// initialisation stuff
}
will not execute at all for n = 0, 1, 2.
Assume that points has 3 objects in it, At line 311 you set n to the count - 1 i.e. n == 2
Then the loop condition is i < 2 - 1 i.e. i < 1.
I think you need the loop condition to be i < n
if points.count is 0 or 1 you are facing some problems, because then, n is -1 or 0, and you access rhs[n-1]; and you malloc n* bytes;
maybe that can be the problem. that you put some rangechecks int o the code?
I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).