Currently I'm trying to port Keith Rule's Texas Holdem Hand Evaluator to Omaha Hi:
Texas Holdem Evaluator and Analysis
More Analysis Part1
More Analysis Part 2
After thinking more about the algorithm, I found a solution which gives me the right percentages for the hands and everything is fine..
But it's really really slow. How can I speed things up?
As the only thing I do right now is to look-up a normal five card hands, a LUT might be right for me. Anyone integrated one before?
static void Main(string[] args)
{
long count = 0;
double player1win = 0.0, player2win=0.0;
ulong player1 = Hand.ParseHand("Ad Kd As Ks");
ulong player2 = Hand.ParseHand("Th 5c 2c 7d");
foreach (ulong board in Hand.Hands(0, player1 | player2, 5))
{
uint maxplayer1value = 0, maxplayer2value = 0;
foreach (ulong boardcards in Hand.Hands(0, ulong.MaxValue ^ board, 3))
{
foreach (ulong player1hand in Hand.Hands(0Ul, ulong.MaxValue ^ player1, 2))
{
uint player1value = Hand.Evaluate(player1hand | boardcards, 5);
if (player1value > maxplayer1value) maxplayer1value = player1value;
}
}
foreach (ulong boardcards in Hand.Hands(0, ulong.MaxValue ^ board, 3))
{
foreach (ulong player2hand in Hand.Hands(0UL, ulong.MaxValue ^ player2, 2))
{
uint player2value = Hand.Evaluate(player2hand | boardcards, 5);
if (player2value > maxplayer2value) maxplayer2value = player2value;
}
}
if (maxplayer1value > maxplayer2value)
{
player1win += 1.0;
}
else if (maxplayer2value > maxplayer1value)
{
player2win += 1.0;
}
else
{
player1win += 0.5;
player2win += 0.5;
}
count++;
}
Console.WriteLine("Player1: {0:0.0000} Player2: {1:0.0000} Count: {2}", player1win / count * 100, player2win / count * 100, count);
Console.ReadLine();
}
Looks like you're trying to create equity calculator. I've done this as well, but not for Omaha (Texas Hold'em instead). With then players to evaluate, I've got about ~200K hands per second, which gives accurate result enough in no time. If there only two players to
evaluate, I can get up to 4 million evaluations per second.
I used bitmasks for hands. One 64-bit integer to represent card, hand or entire board. You only need actually 52 of it, obviously. By using bitwise-operators, things get going rather quickly. Here's a quick sample from my project (in C++ tho). It's using 2 + 2 evaluator
for fast look-ups:
while (trial < trials) {
/** I use here a linked list over the hand-distributions (players).
* This is kind of natural as well, as circle is the basic
* shape of poker.
*/
pDist = pFirstDist;
unsigned __int64 usedCards = _deadCards;
bool collision;
/** Here, we choose random distributions for the comparison.
* There is a chance, that two separate distributions has
* the same card being picked-up. In that case, we have a collision,
* so do the choosing again.
*/
do {
pDist->Choose(usedCards, collision);
/** If there is only one hand in the distribution (unary),
* there is no need to check over collision, since it's been
* already done in the phase building them (distributions).
*/
if (pDist->_isUnary)
collision = false;
pDist = pDist->_pNext;
} while (pDist != pFirstDist && !collision);
if (collision) {
/** Oops! Collision occurred! Take the next player (hand-
* distribution and do this all over again.
*
*/
pFirstDist = pDist->_pNext;
continue;
}
unsigned __int64 board = 0;
/** Pick a board from the hashed ones, until it's unique compared to
* the distributions.
*
*/
do {
if (count == 1) {
board = boards[0];
collision = false;
} else {
board = boards[Random()];
collision = (board & usedCards) != 0;
}
} while (collision);
board |= _boardCards;
int best = 0, s = 1;
do {
pDist->_currentHand |= board;
unsigned long i, l = static_cast<unsigned long>(pDist->_currentHand >> 32);
int p;
bool f = false;
/** My solution to find out the set bits.
* Since I'm working on a 32-bit environment, the "64-bit"
* variable needs to be split in to parts.
*/
if (_BitScanForward(&i, l)) {
p = _evaluator->_handRanks[53 + i + 32]; // Initial entry to the 2 + 2 evaluator hash.
l &= ~(static_cast<unsigned long>(1) << i);
f = true;
}
if (f)
while (_BitScanForward(&i, l)) {
l &= ~(static_cast<unsigned long>(1) << i);
p = _evaluator->_handRanks[p + i + 32];
}
l = static_cast<unsigned long>(pDist->_currentHand & 0xffffffff);
if (!f) {
_BitScanForward(&i, l);
p = _evaluator->_handRanks[53 + i];
l &= ~(static_cast<unsigned long>(1) << i);
}
while (_BitScanForward(&i, l)) {
l &= ~(static_cast<unsigned long>(1) <<_handRanks[p + i];
}
pDist->_rank = p;
/** Keep the statistics up. Please do remember, that
* equity consist of ties as well, so it's not a percentual
* chance of winning.
*/
if (p > best) {
pWinner = pDist;
s = 1;
best = p;
} else if (p == best)
++s;
pDist = pDist->_pNext;
} while (pDist != pFirstDist);
if (s > 1) {
for (unsigned int i = 0; i _rank == best) {
_handDistributions[i]->_ties += 1.0f / s;
_handDistributions[i]->_equity += 1.0f / s;
}
} else {
++pWinner->_wins;
++pWinner->_equity;
}
++trial;
pFirstDist = pDist->_pNext;
}
Please refer to the 2 + 2 evaluator, which is quite easy to adapt in your own needs.
This might help:
An example of a ready made Objective-C (and Java) Texas Hold'em 7- and 5-card evaluator can be found here and further explained here. It "adds" up hands to generate an index that sufficiently characterises the hand for determining rank.
All feedback welcome at the e-mail address found therein
Related
for example, the JDK method java.lang.Integer.numberOfLeadingZeros(int):
public static int numberOfLeadingZeros(int i) {
// HD, Figure 5-6
if (i == 0)
return 32;
int n = 1;
if (i >>> 16 == 0) { n += 16; i <<= 16; }
if (i >>> 24 == 0) { n += 8; i <<= 8; }
if (i >>> 28 == 0) { n += 4; i <<= 4; }
if (i >>> 30 == 0) { n += 2; i <<= 2; }
n -= i >>> 31;
return n;
}
what does the code comment 'HD, Figure 5-6' mean?
HD = Hacker's Delight. See the the javadoc:
Implementation note: The implementations of the "bit twiddling" methods (such as highestOneBit and numberOfTrailingZeros) are based on material from Henry S. Warren, Jr.'s Hacker's Delight, (Addison Wesley, 2002).
There are also such comments in java.lang.Long and java.lang.Math.
For example, the addExactmethod in java.lang.Math:
public static int addExact(int x, int y) {
int r = x + y;
// HD 2-12 Overflow iff both arguments have the opposite sign of the result
if (((x ^ r) & (y ^ r)) < 0) {
throw new ArithmeticException("integer overflow");
}
return r;
}
For information of Hacker's Delight we can also refer to: http://hackersdelight.org/
On my Arduino Mega 2560, I'm trying to run a motor that turns a 20-vial container (accepting int input 1-20) while regulating temperature via PID of a separate cooler. I am generally new to this field of technology so bear with me. I also have an interrupt set up for an encoder to keep track of vial position.
The void serialEvent() and void loop() are the most important portions to look at, but I decided to put the rest of the code in there just in case you needed to see it.
#include <PID_v1.h>
#include <SPI.h>
#include <TMC26XStepper.h>
#define COOL_INPUT 0
#define PIN_OUTPUT 9
TMC26XStepper tmc26XStepper = TMC26XStepper(200,5,7,6,500);
int step = 6;
int value;
int i;
char junk = ' ';
volatile long enc_count = 0;
const byte interruptPinA = 2;
const byte interruptPinB = 3;
//Define Variables we'll be connecting to
int outMax = 255;
int outMin = -145;
double Setpoint, Input, Output;
double heatInput, heatOutput, originalInput;
//Specify the links and initial tuning parameters
// AGGRESSIVE VALUES (to get to 4 deg C)
double aggKp=8.0, aggKi=3.0, aggKd=0.15;
// CONSERVATIVE VALUES (to hover around 4 deg C)
double consKp=2.5, consKi = 0.0, consKd = 1.0;
PID myPID(&Input, &Output, &Setpoint, aggKp, aggKi, aggKd, REVERSE);
void setup()
{
pinMode(step, OUTPUT);
pinMode(interruptPinA, INPUT_PULLUP);
pinMode(interruptPinB, INPUT_PULLUP);
attachInterrupt(digitalPinToInterrupt(interruptPinA), encoder_isr, CHANGE);
attachInterrupt(digitalPinToInterrupt(interruptPinB), encoder_isr, CHANGE);
//initialize the variables we're linked to
Input = (5.0*analogRead(COOL_INPUT)*100.0) / 1024;
Setpoint = 10.75;
myPID.SetOutputLimits(outMin, outMax);
//turn the PID on
myPID.SetMode(AUTOMATIC);
Serial.begin(115200);
tmc26XStepper.setSpreadCycleChopper(2,24,8,6,0);
tmc26XStepper.setMicrosteps(32);
tmc26XStepper.setStallGuardThreshold(4,0);
Serial.println("...started...");
tmc26XStepper.start();
Serial.flush();
Serial.println("Enter vial numbers 1-20");
}
void loop() {
Input = (5.0*analogRead(COOL_INPUT)*100.0) / 1024;
// A BUNCH OF CODE FOR TEMP REGULATION
Serial.println(Input);
delay(150);
}
void serialEvent() {
while (Serial.available() == 0) {}
i = Serial.parseInt();
Serial.print("position: ");
Serial.print(i);
Serial.print(" ");
while (Serial.available() > 0) {
junk = Serial.read();
}
if (i == 1) {
value = 0;
} else {
int num = i - 1;
value = num * 72;
}
while (enc_count != value) {
digitalWrite(6, HIGH);
delayMicroseconds(100);
digitalWrite(6, LOW);
delayMicroseconds(100);
if (enc_count == 1440) {
enc_count = 0;
}
}
Serial.println(enc_count);
}
// INFO FOR ENCODER
void encoder_isr() {
static int8_t lookup_table[] = {0,-1,1,0,1,0,0,-1,-1,0,0,1,0,1,-1,0};
static uint8_t enc_val = 0;
enc_val = enc_val << 2;
enc_val = enc_val | ((PIND & 0b1100) >> 2);
enc_count = enc_count + lookup_table[enc_val & 0b1111];
}
So, originally I had the two processes tested separately (vial position + encoder, then temperature regulation) and everything did exactly as it was supposed to. Now, I fused the code together and stored the vial position entry in the serialEvent() method to keep the temperature reading continuous and the vial position entry available for whenever I decided to provide input. However, when I put in a value, the program stops all together. I am able to see the number I entered (position: 5), but the Serial.println(enc_count) never gets printed. On top of the that, the temperature readings stop displaying readings.
Any thoughts? Need more information?
I just want to know the break down for the Big O execution growth rate for this code, I have try to calculate it but it I got the for loops wrong. so I am completely stuck on this now.
void doInter(int setA[], int setB[], int sizeA, int sizeB)
{
const int MAX = 10;
int sizeR;
int results [MAX];
// validate sizeA and sizeB
if ((sizeA == 0) || (sizeB == 0))
{
cout << "one of the sets is empty\n";
}
// save elements common to both sets
for (int i = sizeR = 0; i < sizeA; i++ )
{
if (member(setB, setA[i],sizeB))
{
results[sizeR++] = setA[i];
}
}
{
cout << results[i] << " ";
}
cout << "}" << endl;
}
bool member (int set[], int n, int size)
{
for (; size > 0; --size)
{
if (set[size-1] == n)
{
return true;
}
}
return false;
}
The complexity of this code is O(sizeA * sizeB). It is relatively easy to compute - first compute the complexity of the inner function member - this is a single cycle and in the worst case it will perform sizeB iterations. Now in the outer function you call this function in a cycle of size sizeA. Thus the overall complexity is the two complexities multiplied. The remaining operations are relatively simple with regards to this two cycles.
Also an example where this complexity is achieved is easy to see - use two arrays with no common elements.
Closed. This question does not meet Stack Overflow guidelines. It is not currently accepting answers.
Questions asking for code must demonstrate a minimal understanding of the problem being solved. Include attempted solutions, why they didn't work, and the expected results. See also: Stack Overflow question checklist
Closed 9 years ago.
Improve this question
I'm looking a while for a decent piece of code to use in my app, in one of those algorithms.
I found this example: http://rosettacode.org/wiki/K-d_tree#C
But when I put the code in xcode, I get an errors, for example:
"use of undeclared identifier", "expected ';' at the end of declaration".
I guess a header file is missing?
I copied the code from the link and made a minor edit which moved
"swap" from being an inline nested function to a static function.
Compiled with "gcc -C99 file.c" and it compiled ok. So, no, it doesn't
need some include file. Maybe you mis pasted it.
If you are happy with this answer, you could accept it. Thanks.
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>
#define MAX_DIM 3
struct kd_node_t{
double x[MAX_DIM];
struct kd_node_t *left, *right;
};
inline double
dist(struct kd_node_t *a, struct kd_node_t *b, int dim)
{
double t, d = 0;
while (dim--) {
t = a->x[dim] - b->x[dim];
d += t * t;
}
return d;
}
static void swap(struct kd_node_t *x, struct kd_node_t *y) {
double tmp[MAX_DIM];
memcpy(tmp, x->x, sizeof(tmp));
memcpy(x->x, y->x, sizeof(tmp));
memcpy(y->x, tmp, sizeof(tmp));
}
/* see quickselect method */
struct kd_node_t*
find_median(struct kd_node_t *start, struct kd_node_t *end, int idx)
{
if (end <= start) return NULL;
if (end == start + 1)
return start;
struct kd_node_t *p, *store, *md = start + (end - start) / 2;
double pivot;
while (1) {
pivot = md->x[idx];
swap(md, end - 1);
for (store = p = start; p < end; p++) {
if (p->x[idx] < pivot) {
if (p != store)
swap(p, store);
store++;
}
}
swap(store, end - 1);
/* median has duplicate values */
if (store->x[idx] == md->x[idx])
return md;
if (store > md) end = store;
else start = store;
}
}
struct kd_node_t*
make_tree(struct kd_node_t *t, int len, int i, int dim)
{
struct kd_node_t *n;
if (!len) return 0;
if ((n = find_median(t, t + len, i))) {
i = (i + 1) % dim;
n->left = make_tree(t, n - t, i, dim);
n->right = make_tree(n + 1, t + len - (n + 1), i, dim);
}
return n;
}
/* global variable, so sue me */
int visited;
void nearest(struct kd_node_t *root, struct kd_node_t *nd, int i, int dim,
struct kd_node_t **best, double *best_dist)
{
double d, dx, dx2;
if (!root) return;
d = dist(root, nd, dim);
dx = root->x[i] - nd->x[i];
dx2 = dx * dx;
visited ++;
if (!*best || d < *best_dist) {
*best_dist = d;
*best = root;
}
/* if chance of exact match is high */
if (!*best_dist) return;
if (++i >= dim) i = 0;
nearest(dx > 0 ? root->left : root->right, nd, i, dim, best, best_dist);
if (dx2 >= *best_dist) return;
nearest(dx > 0 ? root->right : root->left, nd, i, dim, best, best_dist);
}
#define N 1000000
#define rand1() (rand() / (double)RAND_MAX)
#define rand_pt(v) { v.x[0] = rand1(); v.x[1] = rand1(); v.x[2] = rand1(); }
int main(void)
{
int i;
struct kd_node_t wp[] = {
{{2, 3}}, {{5, 4}}, {{9, 6}}, {{4, 7}}, {{8, 1}}, {{7, 2}}
};
struct kd_node_t this = {{9, 2}};
struct kd_node_t *root, *found, *million;
double best_dist;
root = make_tree(wp, sizeof(wp) / sizeof(wp[1]), 0, 2);
visited = 0;
found = 0;
nearest(root, &this, 0, 2, &found, &best_dist);
printf(">> WP tree\nsearching for (%g, %g)\n"
"found (%g, %g) dist %g\nseen %d nodes\n\n",
this.x[0], this.x[1],
found->x[0], found->x[1], sqrt(best_dist), visited);
million = calloc(N, sizeof(struct kd_node_t));
srand(time(0));
for (i = 0; i < N; i++) rand_pt(million[i]);
root = make_tree(million, N, 0, 3);
rand_pt(this);
visited = 0;
found = 0;
nearest(root, &this, 0, 3, &found, &best_dist);
printf(">> Million tree\nsearching for (%g, %g, %g)\n"
"found (%g, %g, %g) dist %g\nseen %d nodes\n",
this.x[0], this.x[1], this.x[2],
found->x[0], found->x[1], found->x[2],
sqrt(best_dist), visited);
/* search many random points in million tree to see average behavior.
tree size vs avg nodes visited:
10 ~ 7
100 ~ 16.5
1000 ~ 25.5
10000 ~ 32.8
100000 ~ 38.3
1000000 ~ 42.6
10000000 ~ 46.7 */
int sum = 0, test_runs = 100000;
for (i = 0; i < test_runs; i++) {
found = 0;
visited = 0;
rand_pt(this);
nearest(root, &this, 0, 3, &found, &best_dist);
sum += visited;
}
printf("\n>> Million tree\n"
"visited %d nodes for %d random findings (%f per lookup)\n",
sum, test_runs, sum/(double)test_runs);
// free(million);
return 0;
}
In my current project I need to find pixel exact position of image contained in another image of larger size. Smaller image is never rotated or stretched (so should match pixel by pixel) but it may have different brightness and some pixels in the image may be distorted. My first attemp was to do it on CPU but it was too slow. The calculations are very parallel, so I decided to use the GPU. I just started to learn CUDA and wrote my first CUDA app. My code works but it still is too slow even on GPU. When the larger image has a dimension of 1024x1280 and smaller is 128x128 program performs calculations in 2000ms on GeForce GTX 560 ti. I need to get results in less than 200ms. In the future I'll probably need a more complex algorithm, so I'd rather have even more computational power reserve. The question is how I can optimise my code to achieve that speed up?
CUDAImageLib.dll:
#include "cuda_runtime.h"
#include "device_launch_parameters.h"
#include <stdio.h>
#include <cutil.h>
//#define SUPPORT_ALPHA
__global__ void ImageSearch_kernel(float* BufferOut, float* BufferB, float* BufferS, unsigned int bw, unsigned int bh, unsigned int sw, unsigned int sh)
{
unsigned int bx = threadIdx.x + blockIdx.x * blockDim.x;
unsigned int by = threadIdx.y + blockIdx.y * blockDim.y;
float diff = 0;
for (unsigned int y = 0; y < sh; ++y)
{
for (unsigned int x = 0; x < sw; ++x)
{
unsigned int as = (x + y * sw) * 4;
unsigned int ab = (x + bx + (y + by) * bw) * 4;
#ifdef SUPPORT_ALPHA
diff += ((abs(BufferS[as] - BufferB[ab]) + abs(BufferS[as + 1] - BufferB[ab + 1]) + abs(BufferS[as + 2] - BufferB[ab + 2])) * BufferS[as + 3] * BufferB[ab + 3]);
#else
diff += abs(BufferS[as] - BufferB[ab]);
diff += abs(BufferS[as + 1] - BufferB[ab + 1]);
diff += abs(BufferS[as + 2] - BufferB[ab + 2]);
#endif
}
}
BufferOut[bx + (by * (bw - sw))] = diff;
}
extern "C" int __declspec(dllexport) __stdcall ImageSearchGPU(float* BufferOut, float* BufferB, float* BufferS, int bw, int bh, int sw, int sh)
{
int aBytes = (bw * bh) * 4 * sizeof(float);
int bBytes = (sw * sh) * 4 * sizeof(float);
int cBytes = ((bw - sw) * (bh - sh)) * sizeof(float);
dim3 threadsPerBlock(32, 32);
dim3 numBlocks((bw - sw) / threadsPerBlock.x, (bh - sh) / threadsPerBlock.y);
float *dev_B = 0;
float *dev_S = 0;
float *dev_Out = 0;
unsigned int timer = 0;
float sExecutionTime = 0;
cudaError_t cudaStatus;
// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaSetDevice failed! Do you have a CUDA-capable GPU installed?");
goto Error;
}
// Allocate GPU buffers for three vectors (two input, one output) .
cudaStatus = cudaMalloc((void**)&dev_Out, cBytes);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_B, aBytes);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
cudaStatus = cudaMalloc((void**)&dev_S, bBytes);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMalloc failed!");
goto Error;
}
// Copy input vectors from host memory to GPU buffers.
cudaStatus = cudaMemcpy(dev_B, BufferB, aBytes, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cudaStatus = cudaMemcpy(dev_S, BufferS, bBytes, cudaMemcpyHostToDevice);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
cutCreateTimer(&timer);
cutStartTimer(timer);
// Launch a kernel on the GPU with one thread for each element.
ImageSearch_kernel<<<numBlocks, threadsPerBlock>>>(dev_Out, dev_B, dev_S, bw, bh, sw, sh);
// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
goto Error;
}
cutStopTimer(timer);
sExecutionTime = cutGetTimerValue(timer);
// Copy output vector from GPU buffer to host memory.
cudaStatus = cudaMemcpy(BufferOut, dev_Out, cBytes, cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
fprintf(stderr, "cudaMemcpy failed!");
goto Error;
}
Error:
cudaFree(dev_Out);
cudaFree(dev_B);
cudaFree(dev_S);
return (int)sExecutionTime;
}
extern "C" int __declspec(dllexport) __stdcall FindMinCPU(float* values, int count)
{
int minIndex = 0;
float minValue = 3.4e+38F;
for (int i = 0; i < count; ++i)
{
if (values[i] < minValue)
{
minValue = values[i];
minIndex = i;
}
}
return minIndex;
}
C# test app:
using System;
using System.Collections.Generic;
using System.Text;
using System.Diagnostics;
using System.Drawing;
namespace TestCUDAImageSearch
{
class Program
{
static void Main(string[] args)
{
using(Bitmap big = new Bitmap("Big.png"), small = new Bitmap("Small.png"))
{
Console.WriteLine("Big " + big.Width + "x" + big.Height + " Small " + small.Width + "x" + small.Height);
Stopwatch sw = new Stopwatch();
sw.Start();
Point point = CUDAImageLIb.ImageSearch(big, small);
sw.Stop();
long t = sw.ElapsedMilliseconds;
Console.WriteLine("Image found at " + point.X + "x" + point.Y);
Console.WriteLine("total time=" + t + "ms kernel time=" + CUDAImageLIb.LastKernelTime + "ms");
}
Console.WriteLine("Hit key");
Console.ReadKey();
}
}
}
//#define SUPPORT_HSB
using System;
using System.Collections.Generic;
using System.Text;
using System.Runtime.InteropServices;
using System.Drawing;
using System.Drawing.Imaging;
namespace TestCUDAImageSearch
{
public static class CUDAImageLIb
{
[DllImport("CUDAImageLib.dll")]
private static extern int ImageSearchGPU(float[] bufferOut, float[] bufferB, float[] bufferS, int bw, int bh, int sw, int sh);
[DllImport("CUDAImageLib.dll")]
private static extern int FindMinCPU(float[] values, int count);
private static int _lastKernelTime = 0;
public static int LastKernelTime
{
get { return _lastKernelTime; }
}
public static Point ImageSearch(Bitmap big, Bitmap small)
{
int bw = big.Width;
int bh = big.Height;
int sw = small.Width;
int sh = small.Height;
int mx = (bw - sw);
int my = (bh - sh);
float[] diffs = new float[mx * my];
float[] b = ImageToFloat(big);
float[] s = ImageToFloat(small);
_lastKernelTime = ImageSearchGPU(diffs, b, s, bw, bh, sw, sh);
int minIndex = FindMinCPU(diffs, diffs.Length);
return new Point(minIndex % mx, minIndex / mx);
}
public static List<Point> ImageSearch(Bitmap big, Bitmap small, float maxDeviation)
{
int bw = big.Width;
int bh = big.Height;
int sw = small.Width;
int sh = small.Height;
int mx = (bw - sw);
int my = (bh - sh);
int nDiff = mx * my;
float[] diffs = new float[nDiff];
float[] b = ImageToFloat(big);
float[] s = ImageToFloat(small);
_lastKernelTime = ImageSearchGPU(diffs, b, s, bw, bh, sw, sh);
List<Point> points = new List<Point>();
for(int i = 0; i < nDiff; ++i)
{
if (diffs[i] < maxDeviation)
{
points.Add(new Point(i % mx, i / mx));
}
}
return points;
}
#if SUPPORT_HSB
private static float[] ImageToFloat(Bitmap img)
{
int w = img.Width;
int h = img.Height;
float[] pix = new float[w * h * 4];
int i = 0;
for (int y = 0; y < h; ++y)
{
for (int x = 0; x < w; ++x)
{
Color c = img.GetPixel(x, y);
pix[i] = c.GetHue() / 360;
pix[i + 1] = c.GetSaturation();
pix[i + 2] = c.GetBrightness();
pix[i + 3] = c.A;
i += 4;
}
}
return pix;
}
#else
private static float[] ImageToFloat(Bitmap bmp)
{
int w = bmp.Width;
int h = bmp.Height;
int n = w * h;
float[] pix = new float[n * 4];
System.Diagnostics.Debug.Assert(bmp.PixelFormat == PixelFormat.Format32bppArgb);
Rectangle r = new Rectangle(0, 0, w, h);
BitmapData bmpData = bmp.LockBits(r, ImageLockMode.ReadOnly, bmp.PixelFormat);
System.Diagnostics.Debug.Assert(bmpData.Stride > 0);
int[] pixels = new int[n];
System.Runtime.InteropServices.Marshal.Copy(bmpData.Scan0, pixels, 0, n);
bmp.UnlockBits(bmpData);
int j = 0;
for (int i = 0; i < n; ++i)
{
pix[j] = (pixels[i] & 255) / 255.0f;
pix[j + 1] = ((pixels[i] >> 8) & 255) / 255.0f;
pix[j + 2] = ((pixels[i] >> 16) & 255) / 255.0f;
pix[j + 3] = ((pixels[i] >> 24) & 255) / 255.0f;
j += 4;
}
return pix;
}
#endif
}
}
Looks like what you are talking about is a well known problem: Template matching. The easiest way forward is to convolve the Image (the bigger image) with the template (the smaller image). You could implement convolutions in one of two ways.
1) Modify the convolutions example from the CUDA SDK (similar to what you are doing anyway).
2) Use FFTs to implement the convolution. Ref. Convolution theorem. You will need to remember
% MATLAB format
L = size(A) + size(B) - 1;
conv2(A, B) = IFFT2(FFT2(A, L) .* FFT2(B, L));
You could use cufft to implement the 2 dimensional FFTs (After padding them appropriately). You will need to write a kernel that does element wise multiplication and then normalizes the result (because CUFFT does not normalize) before performing the inverse FFT.
For the sizes you mention, (1024 x 1280 and 128 x 128), the inputs must be padded to atleast ((1024 + 128 - 1) x (1280 + 128 -1) = 1151 x 1407). But FFTs are fastest when the (padded) inputs are powers of 2. So you will need to pad both the large and small images to size 2048 x 2048.
You could speed up your calculations by using faster memory access, for example by using
Texture Cache for the big image
Shared Memory or Constant Cache for the small image or parts of it.
But your real problem is the whole approach of your comparison. Comparing the images pixel by pixel at every possible location will never be efficient. There is just too much work to do. First you should think about finding ways to
Select the interesting image regions in the big image where the small image might be contained and only search in these
Find a faster comparison mechanism, by something representing the images that are not their pixels values. You should be able to compare the images by computing a representation with less data, e.g. a color histogram, or integral images.