Blas DGEMV input error - blas

I'm having trouble figuring out why a piece of blas call is throwing n error. The problem call is the last blas call. The code compiles without issue and runs fine up until this call then fails with the following message.
** ACML error: on entry to DGEMV parameter number 6 had an illegal value
As far as I can tell everything the input types are correct and array a has
I would really appreciate an insight into the problem.
Thanks
#include <stdlib.h>
#include <stdio.h>
#include <string.h>
#include "cblas.h"
#include "array_alloc.h"
int main( void )
{
double **a, **A;
double *b, *B, *C;
int *ipiv;
int n, nrhs;
int info;
int i, j;
printf( "How big a matrix?\n" );
fscanf( stdin, "%i", &n );
/* Allocate the matrix and set it to random values but
with a big value on the diagonal. This makes sure we don't
accidentally get a singular matrix */
a = alloc_2d_double( n, n );
A= alloc_2d_double( n, n );
for( i = 0; i < n; i++ ){
for( j = 0; j < n; j++ ){
a[ i ][ j ] = ( ( double ) rand() ) / RAND_MAX;
}
a[ i ][ i ] = a[ i ][ i ] + n;
}
memcpy(A[0],a[0],n*n*sizeof(double)+1);
/* Allocate and initalise b */
b = alloc_1d_double( n );
B = alloc_1d_double( n );
C = alloc_1d_double( n );
for( i = 0; i < n; i++ ){
b[ i ] = 1;
}
cblas_dcopy(n,b,1,B,1);
/* the pivot array */
ipiv = alloc_1d_int( n );
/* Note we MUST pass pointers, so have to use a temporary var */
nrhs = 1;
/* Call the Fortran. We need one underscore on our system*/
dgesv_( &n, &nrhs, a[ 0 ], &n, ipiv, b, &n, &info );
/* Tell the world the results */
printf( "info = %i\n", info );
for( i = 0; i < n; i++ ){
printf( "%4i ", i );
printf( "%12.8f", b[ i ] );
printf( "\n" );
}
/* Want to check my lapack result with blas */
cblas_dgemv(CblasRowMajor,CblasTrans,n,n,1.0,A[0],1,B,1,0.0,C,1);
return 0;
}

The leading dimension (LDA) needs to be at least as large as the number of columns (n) for a RowMajor matrix. You’re passing a LDA of 1.
Separately, I’m slightly suspicious of your matrix types; without seeing how alloc_2d_double is implemented there’s no way to be sure if you’re laying out the matrix correctly or not. Generally speaking, intermixing pointer-to-pointer-style “matrices” with BLAS-style matrices (contiguous arrays with row or column stride) is something of a code smell. (However, it is possible to do correctly, and you may well be handling it properly; it’s just not possible to tell if this is the case from the code you posted).

Related

uninitialized local variable with cin use

I am working on this code (in c++) and I finished but i have 2 errors on line 19 when I use them in for loops about variables y and m, saying that they are uninitialized local variables. I don't see how this is possible because I declared them at the beginning as int and their value is assigned when the user inputs in cin.
#include <iostream>
#include <string>
#include <cmath>
#include <math.h>
#include <vector>
using namespace std;
int main()
{
int a, b, n, l = 0;
cin >> a, b, n;
for (int i = 0; i < 20; i++)
{
for (int j = 0; j < 20; j++)
{
if (l < (i*a + j*b) && (i*a + j*b) <= n)
l = i*a + j*b;
}
}
cout << l;
return 0;
}
I'm not in a position to test this, but Multiple inputs on one line suggests that your syntax should be
cin >> a >> b >> c;
Regardless, I think the compiler is suggesting that assignment to all variables isn't guaranteed by cin so without explicit initialisation when they're declared you're assuming too much.

Gaussian Elimination in OpenMP - Performance Problems

I'm new to openMP, and I was trying to parallelize a Gaussian Elimination, and I'm having troubles with performance. I'm compiling the code below using:
gcc -o gaussian_elimination gaussian_elimination.c -lm -lgsl -lgslcblas -fopenmp -Wall
And setting the number of threads on the terminal with export OMP_NUM_THREADS
And my problem is that the parallel version of this code is running way slower than the serial version of the same. I believe that this is because I declared #pragma parallel for inside the external loop, and this would force openMP to create and destroy thread at each iteration, which would be incredibly costly, but I haven't seen any other clear way to do the same kind of operation, and I don't think I can exchange the external loop with the internal parallel ones.
I'm probably missing something, but I have not found any other forum threads here commenting on this particular problem. As far as execution correctness goes, my code seems to be functioning alright, the problem is just performance-wise.
Thanks in Advance
#include <stdlib.h>
#include <stdio.h>
#include <math.h>
#include <omp.h>
#include <stdbool.h>
#include <time.h>
#include <gsl/gsl_linalg.h>
#include <gsl/gsl_rng.h>
#define DEBUG_MODE false
int random_matrix(double *A, int N,long long int seed);
int print_matrix(double *A, int N);
int print_vector(float *b,int N);
int main(int argc, char **argv){
int N=1000;
int i,j,k,l,i_p,s,err,D=N+1;
long long int seed=9089123498274; // just a fixed seed only not to bother
double *A,pivot,sw,tmp,begin,end,time_spent;
double *Aref,*bref;
gsl_matrix_view gsl_m;
gsl_vector_view gsl_b;
gsl_vector *gsl_x;
gsl_permutation *gsl_p;
/* Input */
//scanf("%d",&N);
A = (double*)malloc(N*(N+1)*sizeof(double));
if(A==NULL){
printf("Matrix A not allocated\n");
return 1;
}
Aref = (double*)malloc(N*N*sizeof(double));
if(Aref==NULL){
printf("Matrix A not allocated\n");
return 1;
}
bref = (double*)malloc(N*sizeof(double));
if(bref==NULL){
printf("Vector B not allocated\n");
return 2;
}
/*
for(i=0;i<N;i+=1)
for(j=0;j<N;j+=1)
scanf("%f",&(A[i*N+j]));
for(i=0;i<N;i+=1)
scanf("%f",&(b[i]));
*/
/*
for(i=0;i<N*N;i++)
A[i]=(float) a_data[i];
for(i=0;i<N;i+=1)
b[i]=(float) b_data[i]; */
err= random_matrix(A,N,seed);
if(err!=0)
return err;
for(i=0;i<N;i++)
for(j=0;j<N;j+=1)
Aref[i*N+j]= A[i*D+j];
for(i=0;i<N;i+=1)
bref[i]= A[i*D+N];//b[i];
printf("GSL reference:\n");
gsl_m = gsl_matrix_view_array (Aref, N, N);
gsl_b = gsl_vector_view_array (bref, N);
gsl_x = gsl_vector_alloc (N);
gsl_p = gsl_permutation_alloc(N);
begin = clock();
gsl_linalg_LU_decomp(&gsl_m.matrix, gsl_p, &s);
gsl_linalg_LU_solve(&gsl_m.matrix, gsl_p, &gsl_b.vector, gsl_x);
end = clock();
time_spent = (double)(end - begin) / CLOCKS_PER_SEC;
printf("gsl matrix solver: %lf s\n",time_spent);
if(DEBUG_MODE==true)
gsl_vector_fprintf(stdout,gsl_x,"%f");
gsl_permutation_free(gsl_p);
gsl_vector_free(gsl_x);
begin = omp_get_wtime();
for(i=0;i<N;i+=1){
i_p = i;
pivot = fabs(A[i*D+i]);
for(j=i;j<N;j+=1)
if(pivot<fabs(A[j*D+i])){
pivot = fabs(A[j*D+i]);
i_p = j;
}
#pragma omp parallel for shared(i,N,A,i_p) private(j,sw)
for(j=i;j<D;j+=1){
sw = A[i*D+j];
A[i*D+j] = A[i_p*D+j];
A[i_p*D+j] = sw;
}
pivot=A[i*D+i];
#pragma omp parallel for shared(i,D,pivot,A) private(j)
for(j=0;j<D;j++)
A[i*D+j]=A[i*D+j]/pivot;
#pragma omp parallel for shared(i,A,N,D) private(tmp,j,k,l)
for(j=i+1;j<N+i;j++){
k=j%N;
tmp=A[k*D+i];
for(l=0;l<D;l+=1)
A[k*D+l]=A[k*D+l]-tmp*A[i*D+l];
}
}
end = omp_get_wtime();
time_spent = (end - begin);
printf("omp matrix solver: %lf s\n",time_spent);
/* Output */
if(DEBUG_MODE==true){
printf("\nCalculated: \n");
for(i=0;i<N;i+=1)
printf("%.6f \n",A[i*(N+1)+N]);
printf("\n");
}
free(A);
return 0;
}
int random_matrix(double *A, int N,long long int seed){
int i,j;
const gsl_rng_type * T;
gsl_rng *r;
gsl_rng_env_setup();
T = gsl_rng_default;
r = gsl_rng_alloc (T);
for(i=0;i<N;i++)
for(j=0;j<=N;j++)
A[i*(N+1)+j]= gsl_rng_uniform (r);
gsl_rng_free (r);
return 0;
}
int print_matrix(double *A, int N){
int i,j;
for(i=0;i<N;i++)
for(j=0;j<=N+1;j++){
if(j==0 || j==N || j==N+1)
printf(" | ");
printf("%.2f ",A[i*(N+1)+j]);
if(j==N+1)
printf("\n");
}
return 0;
}
int print_vector(float *b,int N){
int i;
for(i=0;i<N;i+=1)
printf("%f\n", b[i]);
return 0;
}
I updated the code above with the omp_get_wtime(), and now it reads as the wtime diminishing as I include more and more threads, so, it does behave as it should, although not as clean as I would like.
For 1000 x 1000 matrices I get 0.25 s for the GSL lib, 4.4 s for the serial omp run and 1.5 s for the 4-thread run.
For 3000 x 3000 matrices, I get ~ 9s for the GSL lib, ~ 117 s for the serial omp run and ~ 44 s for the 4 thread-run, thus at least adding more threads indeed speeds up the program!
Thanks a lot everyone

How to include freetype library to Keil uVision 4?

I have to add freetype library to keil uvision 4 for dealing ttf font files.
I followed the steps in Simple Glyph Loading Tutorial.
I am trying to compile the code below called example1.c. I tried the tutorial in Ubuntu terminal with the help of Undefined reference to 'FT_Init_FreeType'. It compiled without error.
But unfortunately I don't know how to link the library to keil.
It shows "Error: L6218E: Undefined symbol FT_Init_FreeType (referred from example1.o)."
Can anyone help me?
example1.c:
/* example1.c */
/* */
/* This small program shows how to print a rotated string with the */
/* FreeType 2 library. */
#include <stdio.h>
#include <string.h>
#include <math.h>
#include <ft2build.h>
#include FT_FREETYPE_H
#define WIDTH 640
#define HEIGHT 480
/* origin is the upper left corner */
unsigned char image[HEIGHT][WIDTH];
/* Replace this function with something useful. */
void
draw_bitmap( FT_Bitmap* bitmap,
FT_Int x,
FT_Int y)
{
FT_Int i, j, p, q;
FT_Int x_max = x + bitmap->width;
FT_Int y_max = y + bitmap->rows;
for ( i = x, p = 0; i < x_max; i++, p++ )
{
for ( j = y, q = 0; j < y_max; j++, q++ )
{
if ( i < 0 || j < 0 ||
i >= WIDTH || j >= HEIGHT )
continue;
image[j][i] |= bitmap->buffer[q * bitmap->width + p];
}
}
}
void
show_image( void )
{
int i, j;
for ( i = 0; i < HEIGHT; i++ )
{
for ( j = 0; j < WIDTH; j++ )
putchar( image[i][j] == 0 ? ' '
: image[i][j] < 128 ? '+'
: '*' );
putchar( '\n' );
}
}
int
main( int argc,
char** argv )
{
FT_Library library;
FT_Face face;
FT_GlyphSlot slot;
FT_Matrix matrix; /* transformation matrix */
FT_Vector pen; /* untransformed origin */
FT_Error error;
char* filename;
char* text;
double angle;
int target_height;
int n, num_chars;
if ( argc != 3 )
{
fprintf ( stderr, "usage: %s font sample-text\n", argv[0] );
exit( 1 );
}
filename = argv[1]; /* first argument */
text = argv[2]; /* second argument */
num_chars = strlen( text );
angle = ( 25.0 / 360 ) * 3.14159 * 2; /* use 25 degrees */
target_height = HEIGHT;
error = FT_Init_FreeType( &library ); /* initialize library */
/* error handling omitted */
error = FT_New_Face( library, filename, 0, &face );/* create face object */
/* error handling omitted */
/* use 50pt at 100dpi */
error = FT_Set_Char_Size( face, 50 * 64, 0,
100, 0 ); /* set character size */
/* error handling omitted */
slot = face->glyph;
/* set up matrix */
matrix.xx = (FT_Fixed)( cos( angle ) * 0x10000L );
matrix.xy = (FT_Fixed)(-sin( angle ) * 0x10000L );
matrix.yx = (FT_Fixed)( sin( angle ) * 0x10000L );
matrix.yy = (FT_Fixed)( cos( angle ) * 0x10000L );
/* the pen position in 26.6 cartesian space coordinates; */
/* start at (300,200) relative to the upper left corner */
pen.x = 300 * 64;
pen.y = ( target_height - 200 ) * 64;
for ( n = 0; n < num_chars; n++ )
{
/* set transformation */
FT_Set_Transform( face, &matrix, &pen );
/* load glyph image into the slot (erase previous one) */
error = FT_Load_Char( face, text[n], FT_LOAD_RENDER );
if ( error )
continue; /* ignore errors */
/* now, draw to our target surface (convert position) */
draw_bitmap( &slot->bitmap,
slot->bitmap_left,
target_height - slot->bitmap_top );
/* increment pen position */
pen.x += slot->advance.x;
pen.y += slot->advance.y;
}
show_image();
FT_Done_Face ( face );
FT_Done_FreeType( library );
return 0;
}
Create a new project "freetype". In the project settings change the "Output" to a static library:
Add the freetype sources to the project, and build. Do not use your "amalgamated" source file - that will destroy the library granularity and lead to excessively large code.
Add the resulting freetype.lib file to your application project. The linker will select only those modules from the library that are necessary to resolve references in your application thus keeping size to a minimum.
You may get smaller code size from including the freetype source directly in your application and using cross-module optimisation (this will work regardless of the use of separate compilation or the amalgamated file); however the build time may be excessive as it requires repeated full-builds to fully optimise. Note that unlike compiler-optimisation, cross-module optimisation does not affect the debugging experience - you can use the debugger normally even with it enabled.
EDIT :
The cross-module optimisation feature may not apply when using the GNU toolchain; it refers to the use of Keil MDK-ARM which uses ARM's RealView toolchain. Other aspects of this answer may also be applicable only to MDK-ARM.
After a long research I could find an alternate solution for the problem. I could reach at freetype amalgamate project, which one is the exact solution for this .
Here all the source files are amalgamated into two files. One ".c" file and one ".h" file. So it can be easily integrate into any other project.
Here is the link for freetype amalgamate.
Thank you.

Simulating a card game. degenerate suits

This might be a bit cryptic title but I have a very specific problem. First my current setup
Namely in my card simulator I deal 32 cards to 4 players in sets of 8. So 8 cards per player.
With the 4 standard suits (spades, harts , etc)
My current implementation cycles threw all combinations of 8 out of 32
witch gives me a large number of possibilities.
Namely the first player can have 10518300 different hands be dealt.
The second can then be dealt 735471 different hands.
The third player then 12870 different hands.
and finally the fourth can have only 1
giving me a grand total of 9.9561092e+16 different unique ways to deal a deck of 32 cards to 4 players. if the order of cards doesn’t matter.
On a 4 Ghz processor even with 1 tick per possibility it would take me half a year.
However I would like to simplify this dealing of cards by making the exchange of diamonds, harts and spades. Meaning that dealing of 8 harts to player 1 is equivalent to dealing 8 spades. (note that this doesn’t apply to clubs)
I am looking for a way to generate this. Because this will cut down the possibilities of the first hand by at least a factor of 6. My current implementation is in c++.
But feel free to answer in a different Languages
/** http://stackoverflow.com/a/9331125 */
unsigned cjasMain::nChoosek( unsigned n, unsigned k )
{
//assert(k < n);
if (k > n) return 0;
if (k * 2 > n) k = n-k;
if (k == 0) return 1;
int result = n;
for( int i = 2; i <= k; ++i ) {
result *= (n-i+1);
result /= i;
}
return result;
}
/** [combination c n p x]
* get the [x]th lexicographically ordered set of [r] elements in [n]
* output is in [c], and should be sizeof(int)*[r]
* http://stackoverflow.com/a/794 */
void cjasMain::Combination(int8_t* c,unsigned n,unsigned r, unsigned x){
++x;
assert(x>0);
int i,p,k = 0;
for(i=0;i<r-1;i++){
c[i] = (i != 0) ? c[i-1] : 0;
do {
c[i]++;
p = nChoosek(n-c[i],r-(i+1));
k = k + p;
} while(k < x);
k = k - p;
}
c[r-1] = c[r-2] + x - k;
}
/**http://stackoverflow.com/a/9430993 */
template <unsigned n,std::size_t r>
void cjasMain::Combinations()
{
static_assert(n>=r,"error n needs to be larger then r");
std::vector<bool> v(n);
std::fill(v.begin() + r, v.end(), true);
do
{
for (int i = 0; i < n; ++i)
{
if (!v[i])
{
COUT << (i+1) << " ";
}
}
static int j=0;
COUT <<'\t'<< j++<< "\n";
}
while (std::next_permutation(v.begin(), v.end()));
return;
}
A requirement is that from lexicographical number I can get back the original array.
Even the slightest optimization can help my monto carol simulation I hope.

Create a Fraction array

I have to Create a dynamic array capable of holding 2*n Fractions.
If the dynamic array cannot be allocated, prints a message and calls exit(1).
It next fills the array with reduced random Fractions whose numerator
is between 1 and 20, inclusive; and whose initial denominator
is between 2 and 20, inclusive.
I ready did the function that is going to create the fraction and reduced it. this is what I got. When I compiled and run this program it crashes I cant find out why. If I put 1 instead of 10 in the test.c It doesn't crash but it gives me a crazy fraction. If I put 7,8,or 11 in the test.c it will crash. I would appreciate if someone can help me.
FractionSumTester.c
Fraction randomFraction(int minNum, int minDenom, int max)
{
Fraction l;
Fraction m;
Fraction f;
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
while (m.denominator <= 1)
{
l.numerator = randomInt(minNum, max);
l.denominator = randomInt(minDenom, max);
m = reduceFraction(l);
}
return m;
}
Fraction *createFractionArray(int n)
{
Fraction *p;
int i;
p = malloc(n * sizeof(Fraction));
if (p == NULL)
{
printf("error");
exit(1);
}
for(i=0; i < 2*n ; i++)
{
p[i] = randomFraction(1,2,20);
printf("%d/%d\n", p[i].numerator, p[i].denominator);
}
return p;
}
this is the what I am using to test this two functions.
test.c
#include "Fraction.h"
#include "FractionSumTester.h"
#include <stdio.h>
int main()
{
createFractionArray(10);
return 0;
}
In your createFractionArray() function, you malloc() space for n items. Then, in the for loop, you write 2*n items into that space... which overruns your buffer and causes the crash.