I have a qp problem:
Minimize: -5x0 - x1 - 4x2 - 5x5 + 1000x0x2 + 1000x1x2 + 1000x0x3
+ 1000x1x3 + 1000x0x4 +1000x1x4
Subject to: x0>=0 x1>=0 x2>=0 x3>=0 x4>=0 x5>=0
x0+x1+x5<=5
x2+x3+x4<=5
The answer should be X0=0 X1=0 X2=5 X3=0 X4=0 X5=5 and obj=-45.
But CGAL gives me X0=5 X1=0 X2=0 X3=0 X4=0 X5=0 and obj=-25.
The code is pasted as follows:
Any suggestion would be appreciated.
Kelly
#include <iostream>
#include <climits>
#include <cassert>
#include <CGAL/basic.h>
#include <CGAL/QP_models.h>
#include <CGAL/QP_functions.h>
// choose exact integral type
#ifdef CGAL_USE_GMP
#include <CGAL/Gmpz.h>
typedef CGAL::Gmpz ET;
#else
#include <CGAL/MP_Float.h>
typedef CGAL::MP_Float ET;
#endif
using namespace std;
// program and solution types
typedef CGAL::Quadratic_program<int> Program;
typedef CGAL::Quadratic_program_solution<ET> Solution;
int
main(){
Program qp (CGAL::SMALLER, true, 0.0, false, 0.0);
qp.set_c(0, -5);
qp.set_c(1, -1);
qp.set_c(2, -4);
qp.set_c(5, -5);
int g = 1000;
qp.set_d(2, 0, g);
qp.set_d(2, 1, g);
qp.set_d(3, 0, g);
qp.set_d(3, 1, g);
qp.set_d(4, 0, g);
qp.set_d(4, 1, g);
int nRow = 0;
qp.set_a(0, nRow, 1.0);
qp.set_a(1, nRow, 1.0);
qp.set_a(5, nRow, 1.0);
qp.set_b(nRow, 5);
nRow++;
qp.set_a(2, nRow, 1.0);
qp.set_a(3, nRow, 1.0);
qp.set_a(4, nRow, 1.0);
qp.set_b(nRow, 5);
Solution s = CGAL::solve_quadratic_program(qp, ET());
assert (s.solves_quadratic_program(qp));
CGAL::print_nonnegative_quadratic_program(std::cout, qp, "first_qp");
std::cout << s;
return 0;
}
Since you matrix D (quadratic objective function) is not positive semi-definite, your result isn't so surprising. CGAL does not guarantee convergence towards a global minimum but towards a local one. What you obtain is a local minimum respecting the constraints you imposed.
If you set minimum bounds for x2 and x5 at 1 by writing qp.set_l(2,true,1); qp.set_l(5,true,1);, you will see that you converge towards the solution that you computed.
Related
I am learning GLSL through Unity and I recently came across a problem involving the storing of variables.
Shader "Shader" {
Properties{
_Hole_Position("hole_Position", Vector) = (0., 0., 0., 1.0)
_Hole_EventHorizonDistance("hole_EventHorizonDistance", Float) = 1.0
_DebugValue("debugValue", Float) = 0.0
}
SubShader{
Pass{
GLSLPROGRAM
uniform mat4 _Object2World;
//Variables
varying float debugValue;
varying vec4 pos;
varying vec4 hole_Position;
varying float hole_EventHorizonDistance = 1;
#ifdef VERTEX
void main()
{
pos = _Object2World * gl_Vertex;
gl_Position = gl_ModelViewProjectionMatrix * gl_Vertex;
}
#endif
#ifdef FRAGMENT
void main()
{
float dist = distance(vec4(pos.x, 0.0,pos.z, 1.0), vec4(hole_Position.x, 0.0, hole_Position.z, 1.0));
debugValue = dist;
if (dist < hole_EventHorizonDistance)
{
gl_FragColor = vec4(0.3, 0.3, 0.3, 1.0);
}
else
{
gl_FragColor = vec4(0.4, 0.6, 1.0, 1.0);
}
//gl_FragColor = vec4(hole_EventHorizonDistance, 0, 0, 1.0);
}
#endif
ENDGLSL
}
}
}
Now Hole_Position and EventHorizonDistance are changed from an outside C#-script with:
g.GetComponent<Renderer>().sharedMaterial.SetVector("_Hole_Position", new Vector4(transform.position.x, transform.position.y, transform.position.z, 1));
g.GetComponent<Renderer>().sharedMaterial.SetFloat("_Hole_EventHorizonDistance", 2);
this does not work as I intend it too (by changing the fragments color if its position is within 2 units from Hole_Position. However debugging with:
gl_FragColor = vec4(hole_EventHorizonDistance, 0, 0, 1.0);
seemingly suggests that EventHorizon is 0 at all times (the mesh tested on remains completely black), however debugging by getting and printing the variable from an outside (via
print(g.GetComponent<Renderer>().sharedMaterial.GetFloat("_Hole_EventHorizonDistance"));
) tells me EventHorizonDistance = 2. I cannot wrap my head around why this is the case, why is it so?
I have a small piece of testing code which calculates the dot products of two vectors with a third vector using AVX instructions (A dot C and B dot C below). It also adds the two products, but that is just to make the function return something for this example.
#include <iostream>
#include <immintrin.h>
double compute(const double *x)
{
__m256d A = _mm256_loadu_pd(x);
__m256d B = _mm256_loadu_pd(x + 4);
__m256d C = _mm256_loadu_pd(x + 8);
__m256d c1 = _mm256_mul_pd(A, C);
__m256d c2 = _mm256_mul_pd(B, C);
__m256d tmp = _mm256_hadd_pd(c1, c2);
__m128d lo = _mm256_extractf128_pd(tmp, 0);
__m128d hi = _mm256_extractf128_pd(tmp, 1);
__m128d dotp = _mm_add_pd(lo, hi);
double y[2];
_mm_store_pd(y, dotp);
return y[0] + y[1];
}
int main(int argc, char *argv[])
{
const double v[12] = {0.3, 2.9, 1.3, 4.0, -1.0, -2.1, -3.0, -4.0, 0.0, 2.0, 1.3, 1.2};
double x = 0;
std::cout << "AVX" << std::endl;
x = compute(v);
std::cout << "x = " << x << std::endl;
return 0;
}
When I compile as
clang++ -O3 -mavx main.cc -o main
everything works fine. If I enable link time optimization:
clang++ -flto -O3 -mavx main.cc -o main
I get the following error "LLVM ERROR: Do not know how to split the result of this operator!". I have narrowed the culprit to the _mm256_hadd_pd statement. If this is exchanged with e.g. _m256_add_pd link time optimization works again. I realize that this is a silly example to use link-time optimization for, but the error ocurred in a different context where it link-time optimization is extremely helpful.
Can anyone explain what is going on here?
I am trying to port x11 to arm processor. so i am using Imlib2 library for jpeg pictures. I have successfully cross compiled Imlib2 library with x windows to arm.My sample program also built successfully.But when i run that binary jpeg is not displaying properly.(it shows image upload error);
#
include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <X11/Xlib.h>
#include <Imlib2.h>
int main(int argc, char **argv)
{
Imlib_Image img;
Display *dpy;
Pixmap pix;
Window root;
Screen *scn;
int width, height;
const char *filename = NULL;
if (argc < 2)
return 0;
filename = argv[1];
img = imlib_load_image(filename);
printf("img values %x",img);
if (!img) {
fprintf(stderr, "%s:Unable to load image\n", filename);
return 0;
}
imlib_context_set_image(img);
width = imlib_image_get_width();
height = imlib_image_get_height();
dpy = XOpenDisplay(NULL);
if (!dpy)
return 0;
scn = DefaultScreenOfDisplay(dpy);
root = DefaultRootWindow(dpy);
pix = XCreatePixmap(dpy, root, width, height,
DefaultDepthOfScreen(scn));
printf("caal");
imlib_context_set_display(dpy);
imlib_context_set_visual(DefaultVisualOfScreen(scn));
imlib_context_set_colormap(DefaultColormapOfScreen(scn));
imlib_context_set_drawable(pix);
imlib_render_image_on_drawable(0, 0);
Window w = XCreateSimpleWindow(dpy, root, 0, 0, width, height, 0, None, None);
XSelectInput(dpy, w, ExposureMask);
XMapWindow(dpy, w);
//XCopyPlane(dpy, pix, w, gc, 0, 0, width, height, 0, 0, DefaultDepthOfScreen(scn));
XFlush(dpy);
XSetWindowBackgroundPixmap(dpy, w, pix);
XClearWindow(dpy, root);
XEvent ev;
while (XNextEvent(dpy, &ev)) {
if( ev.type == Expose )
{
//XCopyPlane(dpy, pix, w, gc, 0, 0, width, height, 0, 0, DefaultDepthOfScreen(scn));
printf("Expose called\n");
}
}
sleep(20);
XFreePixmap(dpy, pix);
imlib_free_image();
XCloseDisplay(dpy);
return 0;
fprintf(stderr, "usage: %s <image_file>\n", argv[0]);
return 1;
}
I'm trying to multiply two matrices using vecLibs' cblas:
#include <stdlib.h>
#include <string.h>
#include <stdio.h>
#include <vecLib/cblas.h>
int main (void) {
float *A = malloc(sizeof(float) * 2 * 3);
float *B = malloc(sizeof(float) * 3 * 1);
float *C = malloc(sizeof(float) * 2 * 1);
cblas_sgemm(CblasRowMajor,
CblasNoTrans,
CblasNoTrans,
2,
1,
3,
1.0,
A, 2,
B, 3,
0.0,
C, 2);
printf ("[ %f, %f]\n", C[0], C[1]);
return 0;
}
According to the docs every argument seems to match yet I get this error:
lda must be >= MAX(K,1): lda=2 K=3BLAS error: Parameter number 9 passed to cblas_sgemm had an invalid value
The error you are seeing seems perfectly correct to my eyes.
LDA is always the pitch of the array A in linear memory. If you are using row major storage order, the pitch will be the number of columns, not the number of rows. So LDA should be 3 in this case.
I want to calculate the product A^T*A ( A is 2000x1000 Matrix). Also i only want to solve the upper triangular Matrix. In the inner loop i have to solve the dot product of two vectors.
Now, here is the problem. Using cblas ddot() is not faster than calculating the dot product with a loop. How is this possible? (using Intel Core (TM)i7 CPU M620 #2,67GHz, 1,92GB RAM)
The problem is caused essentially by matrix size, not by ddot. Your matrices are so large that they do not fit in the cache memory. The solution is to rearrange the three nested loops such that as much as possible can be done with a line in cache, so reducing cache refreshes. A model implementation follows for both the ddot and an daxpy approach. On my computer the time consumption was about 15:1.
In other words: never, never, never program a matrix multiplication along the "row times column" scheme that we learned in school.
/*
Matrix product of A^T * A by two methods.
1) "Row times column" as we learned in school.
2) With rearranged loops such that need for cash refreshes is reduced
(this can be improved even more).
Compile: gcc -o aT_a aT_a.c -lgslcblas -lblas -lm
*/
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <cblas.h>
#define ROWS 2000
#define COLS 1000
static double a[ROWS][COLS];
static double c[COLS][COLS];
static void dot() {
int i, j;
double *ai, *bj;
ai = a[0];
for (i=0; i<COLS; i++) {
bj = a[0];
for (j=0; j<COLS; j++) {
c[i][j] = cblas_ddot(ROWS,ai,COLS,bj,COLS);
bj += 1;
}
ai += 1;
}
}
static void axpy() {
int i, j;
double *ci, *bj, aij;
for (i=0; i<COLS; i++) {
ci = c[i];
for (j=0; j<COLS; j++) ci[j] = 0.;
for (j=0; j<ROWS; j++) {
aij = a[j][i];
bj = a[j];
cblas_daxpy(COLS,aij,bj,1,ci,1);
}
}
}
int main(int argc, char** argv) {
clock_t t0, t1;
int i, j;
for (i=0; i<ROWS; ++i)
for (j=0; j<COLS; ++j)
a[i][j] = i+j;
t0 = clock();
dot();
t0 = clock();
printf("Time for DOT : %f sec.\n",(double)t0/CLOCKS_PER_SEC);
axpy();
t1 = clock();
printf("Time for AXPY: %f sec.\n",(double)(t1-t0)/CLOCKS_PER_SEC);
return 0;
}
The CBLAS dot product is effectively just a computation in slightly unrolled loop. The netlib Fortran is just this:
DO I = MP1,N,5
DTEMP = DTEMP + DX(I)*DY(I) + DX(I+1)*DY(I+1) +
$ DX(I+2)*DY(I+2) + DX(I+3)*DY(I+3) + DX(I+4)*DY(I+4)
END DO
ie. just a loop unrolled to a stride of 5.
If you must use a ddot style dot product for your operation, you might get a performance boost by re-writing your loop to use SSE2 intrinsics:
#include <emmintrin.h>
double ddotsse2(const double *x, const double *y, const int n)
{
double result[2];
int n2 = 2 * (n/2);
__m128d dtemp;
if ( (n % 2) == 0) {
dtemp = _mm_setzero_pd();
} else {
dtemp = _mm_set_sd(x[n] * y[n]);
}
for(int i=0; i<n2; i+=2) {
__m128d x1 = _mm_loadr_pd(x+i);
__m128d y1 = _mm_loadr_pd(y+i);
__m128d xy = _mm_mul_pd(x1, y1);
dtemp = _mm_add_pd(dtemp, xy);
}
_mm_store_pd(&result[0],dtemp);
return result[0] + result[1];
}
(not tested, never been compiled, buyer beware).
This may or may be faster than the standard BLAS implementation. You may also want to investigate whether further loop unrolling could improve performance.
If you're not using SSE2 intrinsics or using a data type that may not boost performance with them, you can try to transpose the matrix for an easy improvement in performance for larger matrix multiplications with cblas_?dot. Performing the matrix multiplication in blocks also helps.
void matMulDotProduct(int n, float *A, float* B, int a_size, int b_size, int a_row, int a_col, int b_row, int b_col, float *C) {
int i, j, k;
MKL_INT incx, incy;
incx = 1;
incy = b_size;
//copy out multiplying matrix from larger matrix
float *temp = (float*) malloc(n * n * sizeof(float));
for (i = 0; i < n; ++i) {
cblas_scopy(n, &B[(b_row * b_size) + b_col + i], incy, &temp[i * n], 1);
}
//transpose
mkl_simatcopy('R', 'T', n, n, 1.0, temp, 1, 1);
for (i = 0; i < n; i+= BLOCK_SIZE) {
for (j = 0; j < n; j++) {
for (k = 0; k < BLOCK_SIZE; ++k) {
C[((i + k) * n) + j] = cblas_sdot(n, &A[(a_row + i + k) * a_size + a_col], incx, &temp[n * j], 1);
}
}
}
free(temp);
}
On my machine, this code is about 1 order of magnitude faster than the the 3 loop code (but also 1 order of magnitude slower than cblas_?gemm call) for single precision floats and 2K by 2K matrices. (I'm using Intel MKL).