OpenCL bad get_global_id output

OpenCL bad get_global_id output - gpu

I am trying to implement matrix multiplication, but get_global_id returns incorrect values.
This is the host code (n, m, TILE_SIZE = 4):
int dimention = 2;
size_t global_item_size[] = {n, m};
size_t local_item_size[] = {TILE_SIZE, TILE_SIZE};
ret = clEnqueueNDRangeKernel(command_queue, kernel, dimention, NULL, global_item_size, local_item_size, 0, NULL, &perf_event);
And part of the kernel:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(1);
printf("aa %i %i\n", i, j);
}
This code prints this:
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
aa 0 0
aa 1 0
aa 2 0
aa 3 0
After some time I realized that get_global_id(0) returns correct index when I call it the first time and zero when I call it the second time:
kernel void mul_tile(uint n, uint m, uint k, global const float *a, global const float *b, global float *c) {
size_t i = get_global_id(0);
size_t j = get_global_id(0);
printf("aa %i %i\n", i, j);
}
So, this kernel prints the same thing.
In some cases get_global_id(2) returns 2-nd dimension indexes. But when I just rename variables it starts printing zeroes.
This problem looks like some driver bug. I use GeForce GT 745M, Ubuntu 20.04 and recommended drivers(nvidia-driver-440).

Related

How to program a MIP solver to find balanced Gray code for mixed radices?

The permutations of a mixed radix number can be ordered to achieve Grayness (in the sense of Gray code) with optimal balance and span length. Each of these constraints will be explained in turn. In my examples, I use a mixed radix number consisting of a base 2 digit, a base 3 digit, and a base 4 digit. This set is called [234], and it has 2 × 3 × 4 = 24 permutations. The permutations are listed below, in ascending order. For compactness, the digits are shown as rows, with the top row corresponding to the set’s first digit. The leftmost column is the first permutation 000, the next column is the second permutation 001, then 002, 003, 010, 011, 012, 013, and so on.
2: 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 1 1
3: 0 0 0 0 1 1 1 1 2 2 2 2 0 0 0 0 1 1 1 1 2 2 2 2
4: 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3
In the above set, multiple digits may change from one permutation to the next. For example, between the fourth and fifth permutations (003 and 010), two digits change at once. To make a Gray set, we must reorder the permutations so that only one digit changes at a time. This constraint includes the wraparound from the first to the last permutation. Below is [234] reordered to be Gray:
2: 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 0 0 1 1 0 0
3: 0 0 1 1 2 2 2 2 0 0 1 1 1 1 1 1 0 2 2 2 2 0 0 0
4: 0 0 0 0 0 0 1 1 1 1 1 2 2 1 3 3 3 3 3 2 2 2 2 3
The above set is Gray, but not balanced. To be balanced, each of the set’s digits must change the same number of times, or as close as possible. In the above set, the 2’s place changes 10 times, the 3’s place changes 7 times, and the 4’s place also changes 7 times. A set’s imbalance is the absolute value of the difference between its minimum and maximum digit changes, in this case 10 – 7 = 3. Below is [234] reordered to have optimal balance; each digit changes 8 times, so the imbalance is now zero:
2: 0 1 1 0 0 1 1 0 0 1 1 1 0 0 0 1 1 1 1 1 0 0 0 0
3: 0 0 1 1 2 2 2 2 0 0 1 1 1 1 1 1 2 0 0 2 2 2 0 0
4: 0 0 0 0 0 0 1 1 1 1 1 2 2 1 3 3 3 3 2 2 2 3 3 2
The above set is Gray and balanced, but digits get stuck for longer than we’d like. For example, the 4’s place stays zero for the first six permutations. This constitutes a span, with a length of six. In the above set, the maximum span length is six. For optimal granularity, the maximum span should be as short as possible. Below is [234] reordered so that the maximum span length is four instead of six:
2: 0 1 1 0 0 1 1 0 0 1 1 1 1 0 0 0 0 1 1 1 1 0 0 0
3: 0 0 1 1 1 1 0 0 0 0 1 2 2 2 2 1 1 1 0 2 2 2 2 0
4: 0 0 0 0 1 1 1 1 2 2 2 2 0 0 2 2 3 3 3 3 1 1 3 3
The above ordering of [234] is Gray, optimally balanced, and minimizes stuck digits. It’s the best we can do for this particular set. But for larger sets, such as [345], optimal solutions are much harder to find, because my code is too slow. Can a MIP solver do better? The solution should be coded in a language that’s supported by one of the solvers available at NEOS, because these are the only high-quality solvers I have access to (for example CPLEX via AMPL, GAMS, LP, MPS, or NL). The application is atonal music theory, hence only sets with ranges of twelve or less are relevant. The complete list of sets I’m trying to optimize is here.
EDIT: Some commenters asked about my code, so I'm enclosing it below. I use Visual Studio 2012, but this code should compile fairly easily in any C++ compiler. I use x64 (64-bit code).
// Copyleft 2023 Chris Korda
// This program is free software; you can redistribute it and/or modify it
// under the terms of the GNU General Public License as published by the Free
// Software Foundation; either version 2 of the License, or any later version.
// BalaGray.cpp : Defines the entry point for the console application.
// This app computes balanced Gray code sequences, for use in music theory.
#include "stdafx.h" // precompiled header
#include "stdint.h" // standard sizes
#include "vector" // growable array
#include "fstream" // file I/O
#include "assert.h" // debugging
using namespace std;
#define MORE_PLACES 0 // set non-zero to use more than four places
#define DO_PRUNING 1 // set non-zero to do branch pruning and reduce runtime
class CBalaGray {
public:
// Construction
CBalaGray();
// Attributes
int GetPermCount() const { return static_cast<int>(m_arrPerm.size()); }
// Operations
void Reset();
void Calc(int nPlaces, const uint8_t *arrRange);
protected:
// Constants
enum {
#if MORE_PLACES
MAX_PLACES = 8,
#else
MAX_PLACES = 4,
#endif
MAX_RANGE = 255,
ULONGLONG_BITS = 64,
};
enum { // pruning thresholds may require manual tuning; see notes in set list
PRUNE_MAXTRANS = 18,
PRUNE_IMBALANCE = 3,
};
// Types
union PERM { // permutation; size depends on MAX_PLACES
uint8_t b[MAX_PLACES]; // array of places
#if MORE_PLACES
uint64_t dw; // double word containing all places
#else
uint32_t dw; // double word containing all places
#endif
};
struct STATE { // crawler stack element
uint8_t iPerm; // permutation index
uint8_t iGray; // Gray neighbor index
PERM nTrans; // transition counts, one per place
};
typedef vector<PERM> CPermArray;
typedef vector<STATE> CStateArray;
typedef vector<uint8_t> CPlaceArray; // enough for atonal music theory
// Member data
int m_nPlaces; // number of places
int m_nGrayPerms; // number of Gray permutations reachable from a permutation
int m_nGrayStrideShift; // stride of Gray permutations array, as a shift in bits
CPlaceArray m_arrRange; // array of ranges, one for each place
CPermArray m_arrPerm; // array of permutations
CPlaceArray m_arrGray; // 2D table of permutations reachable from each permutation
CStateArray m_arrState; // array of states; crawler stack
ofstream m_fOut; // output file
// Helpers
int Pack(const PERM& perm) const;
void MakePerms(int nPlaces, const uint8_t *arrRange);
void MakeGrayTable();
void DumpGrayTablePerms() const;
void DumpPerm(const PERM& perm) const;
void DumpPerms() const;
void DumpSet() const;
void WriteBalanceToLog(int nImbalance, int nMaxTrans, int nMaxSpan);
void WriteSequenceToLog(int iDepth);
bool IsGray(PERM p1, PERM p2) const;
int ComputeBalance(int iDepth, int& nMaxTrans, PERM& nTransCounts) const;
int ComputeMaxSpan(int iDepth) const;
};
CBalaGray::CBalaGray()
{
m_fOut.open("BalaGrayIter.txt", ios_base::out); // open output file
assert(m_fOut != NULL);
Reset();
}
void CBalaGray::Reset()
{
m_nPlaces = 0;
m_arrRange.clear();
m_arrState.clear();
}
int CBalaGray::Pack(const PERM& perm) const
{
int nPacked = perm.b[m_nPlaces - 1]; // init total to first place
for (int iPlace = m_nPlaces - 2; iPlace >= 0; iPlace--) { // for each subsequent place
nPacked *= m_arrRange[iPlace]; // multiply total by places's range
nPacked += perm.b[iPlace]; // add place to total
}
return nPacked;
}
void CBalaGray::MakePerms(int nPlaces, const uint8_t *arrRange)
{
m_nPlaces = nPlaces;
m_arrRange.resize(nPlaces);
int nPerms = 1;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
assert(arrRange[iPlace] > 1); // radix must be at least binary
m_arrRange[iPlace] = arrRange[iPlace]; // store range
nPerms *= arrRange[iPlace]; // update range
}
m_arrPerm.resize(nPerms);
for (int iPerm = 0; iPerm < nPerms; iPerm++) {
PERM perm;
perm.dw = 0;
int nVal = iPerm;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
int nRange = m_arrRange[iPlace];
perm.b[iPlace] = nVal % nRange;
nVal /= nRange;
}
m_arrPerm[iPerm] = perm;
}
}
void CBalaGray::MakeGrayTable()
{
// Build 2D table of permutations reachable from each permutation.
// One row for each permutation, one column for each Gray neighbor.
// Each element is a permutation index, and must be dereferenced.
int nPlaces = m_nPlaces;
int nGrayPerms = 0;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
nGrayPerms += m_arrRange[iPlace] - 1; // one less than place's range
}
// Compute stride of Gray permutations array; to avoid multiplication,
// round up stride to nearest power of two and convert it to a shift.
unsigned long iFirstBitPos;
_BitScanReverse(&iFirstBitPos, nGrayPerms - 1);
int nStrideShift = 1 << iFirstBitPos;
m_arrGray.resize(m_arrPerm.size() << nStrideShift);
int nPerms = GetPermCount();
for (int iPerm = 0; iPerm < nPerms; iPerm++) { // for each permutation
int iCol = 0;
PERM rowPerm, colPerm;
rowPerm.dw = m_arrPerm[iPerm].dw;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
int nRange = m_arrRange[iPlace]; // place's range
for (int iVal = 0; iVal < nRange; iVal++) { // for of place's values
if (iVal != rowPerm.b[iPlace]) { // if value differs from row value
colPerm.dw = rowPerm.dw; // column permutation is same as row
colPerm.b[iPlace] = iVal; // except one place differs (Gray)
m_arrGray[(iPerm << nStrideShift) + iCol] = Pack(colPerm);
iCol++; // next column
}
}
}
}
m_nGrayPerms = nGrayPerms; // save in member var
m_nGrayStrideShift = nStrideShift;
}
void CBalaGray::DumpPerm(const PERM& perm) const
{
printf("[");
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
printf("%d ", perm.b[iPlace]);
}
printf("]");
}
void CBalaGray::DumpPerms() const
{
int nPerms = GetPermCount();
for (int iPerm = 0; iPerm < nPerms; iPerm++) {
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
printf("%d ", m_arrPerm[iPerm].b[iPlace]);
}
printf("\n");
}
}
void CBalaGray::DumpGrayTablePerms() const
{
int nPerms = GetPermCount();
for (int iPerm = 0; iPerm < nPerms; iPerm++) { // for each permutation
DumpPerm(m_arrPerm[iPerm]);
printf(": ");
for (int iGray = 0; iGray < m_nGrayPerms; iGray++) { // for each Gray neighbor
int iPerm2 = m_arrGray[(iPerm << m_nGrayStrideShift) + iGray];
DumpPerm(m_arrPerm[iPerm2]);
}
printf("\n");
}
}
void CBalaGray::DumpSet() const
{
printf("[");
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
printf("%d", m_arrRange[iPlace]);
}
printf("]\n");
}
void CBalaGray::WriteBalanceToLog(int nImbalance, int nMaxTrans, int nMaxSpan)
{
m_fOut << "balance = " << nImbalance << ", maxtrans = " << nMaxTrans << ", maxspan = " << nMaxSpan << '\n';
}
void CBalaGray::WriteSequenceToLog(int iDepth)
{
int nPerms = GetPermCount();
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
for (int iPerm = 0; iPerm < nPerms; iPerm++) {
m_fOut << int(m_arrPerm[m_arrState[iPerm].iPerm].b[iPlace]) << ' ';
}
m_fOut << '\n';
}
m_fOut << '\n';
}
__forceinline bool CBalaGray::IsGray(PERM p1, PERM p2) const
{
// Returns true if the given permutations differ by exactly one place.
bool bDiff = false;
int nPlaces = m_nPlaces;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
if (p1.b[iPlace] != p2.b[iPlace]) { // if places differ
if (!bDiff) { // if first difference
bDiff = true; // set flag
} else { // not first difference
return false; // not Gray; early out
}
}
}
return bDiff;
}
void CBalaGray::Calc(int nPlaces, const uint8_t *arrRange)
{
assert(nPlaces >= 0 && nPlaces <= MAX_PLACES);
Reset();
MakePerms(nPlaces, arrRange);
MakeGrayTable();
// DumpPerms();
// DumpGrayTablePerms();
int nPermGrays = m_nGrayPerms;
int nGrayStrideShift = m_nGrayStrideShift;
DumpSet();
int nPerms = GetPermCount();
printf("nPlaces=%d\n", nPlaces);
printf("nPerms=%d\n", nPerms);
int nBestImbalance = INT_MAX;
int nBestMaxTrans = INT_MAX;
int nBestMaxSpan = INT_MAX;
m_arrState.resize(nPerms);
uint64_t nPasses = 0;
uint64_t nPermUsedMask[2] = {0}; // need 128 bits, as number of permutations may exceed 64
int iDepth = 2; // first two levels are constant to save time; all sequences start with 0, 1
m_arrState[1].iPerm = 1;
m_arrState[1].nTrans.b[0] = 1;
nPermUsedMask[0] = 0x3;
int nStartDepth = iDepth;
while (1) {
nPasses++;
int iPrevPerm = m_arrState[iDepth - 1].iPerm;
int iGray = m_arrState[iDepth].iGray;
int iPerm = m_arrGray[(iPrevPerm << nGrayStrideShift) + iGray]; // optimized 2D table addressing
int iUsedMask = iPerm >= ULONGLONG_BITS; // index selects one of two 64-bit masks
uint64_t nPermMask = 1ull << (iPerm & (ULONGLONG_BITS - 1));
if (!(nPermUsedMask[iUsedMask] & nPermMask)) { // if this permutation hasn't been used yet on this branch
m_arrState[iDepth].iPerm = iPerm; // save permutation index on stack
int nMaxTrans;
PERM nTransCounts;
int nImbalance = ComputeBalance(iDepth, nMaxTrans, nTransCounts);
if (iDepth < nPerms - 1) { // if incomplete sequence
#if DO_PRUNING
// these constants may require tuning, see notes below
// if (nMaxTrans > PRUNE_MAXTRANS || nImbalance > PRUNE_IMBALANCE) { // slightly faster
if (nImbalance > PRUNE_IMBALANCE) {
goto lblPrune; // abandon this branch
}
#endif
// crawl one level deeper
nPermUsedMask[iUsedMask] |= nPermMask; // mark this permutation as used
m_arrState[iDepth].nTrans.dw = nTransCounts.dw; // save current transition counts on stack
iDepth++; // increment depth to next permutation
m_arrState[iDepth].iGray = 0; // reset index of Gray neighbors
m_arrState[iDepth].iPerm = 0; // reset permutation index
continue; // equivalent to recursion, but less overhead
} else { // reached a leaf: complete sequence, a potential winner
// if branch doesn't wrap around Gray
if (!IsGray(m_arrPerm[m_arrState[0].iPerm], m_arrPerm[m_arrState[nPerms - 1].iPerm])) {
goto lblPrune; // abandon this branch
}
// if max transition count or imbalance are worse than our current bests
if (nMaxTrans > nBestMaxTrans || nImbalance > nBestImbalance) {
goto lblPrune; // abandon this branch
}
int nMaxSpan = ComputeMaxSpan(iDepth); // compute maximum span length
// if max transition count and imbalance equal our current bests
if (nMaxTrans == nBestMaxTrans && nImbalance == nBestImbalance) {
if (nMaxSpan >= nBestMaxSpan) { // if max span didn't improve
goto lblPrune; // abandon this branch
}
}
// we have a winner, until something better comes along
nBestMaxTrans = nMaxTrans; // update best max transition count
nBestImbalance = nImbalance; // update best imbalance
nBestMaxSpan = nMaxSpan; // update best maximum span length
printf("balance = %d, maxtrans = %d, maxspan = %d\n", nImbalance, nMaxTrans, nMaxSpan);
WriteBalanceToLog(nImbalance, nMaxTrans, nMaxSpan);
WriteSequenceToLog(iDepth);
}
}
m_arrState[iDepth].iGray++; // increment Gray neighbor index
if (m_arrState[iDepth].iGray >= nPermGrays) { // if no more Gray neighbors for this permutation
lblPrune:
if (iDepth <= nStartDepth) { // if we're at same level where we started
break; // exit main loop
} else { // sufficient levels remain above us
iDepth--; // back up a level
// restore bitmask that keeps track of which permutations we've used on this branch
int iPerm = m_arrState[iDepth].iPerm; // number of permutations may exceed 64
int iUsedMask = iPerm >= ULONGLONG_BITS; // index selects one of two 64-bit masks
uint64_t nPermMask = 1ull << (iPerm & (ULONGLONG_BITS - 1));
nPermUsedMask[iUsedMask] &= ~nPermMask; // mark this permutation as available again
m_arrState[iDepth].iGray++; // increment was skipped by continue statement above
if (m_arrState[iDepth].iGray >= nPermGrays) { // if no more Gray neighbors
goto lblPrune; // keep backing up
}
}
}
}
printf("done!\n");
}
__forceinline int CBalaGray::ComputeBalance(int iDepth, int& nMaxTrans, PERM& nTransCounts) const
{
int nPlaces = m_nPlaces;
PERM nTrans;
nTrans.dw = m_arrState[iDepth - 1].nTrans.dw; // load latest transition counts from stack
// compare current state to previous state
PERM sPrev, sCur;
sPrev.dw = m_arrPerm[m_arrState[iDepth - 1].iPerm].dw;
sCur.dw = m_arrPerm[m_arrState[iDepth].iPerm].dw;
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
if (sCur.b[iPlace] != sPrev.b[iPlace]) { // if place transitioned
nTrans.b[iPlace]++; // increment place's transition count
}
}
nTransCounts = nTrans; // order matters; counts passed back to caller must exclude wraparound
// account for wraparound; compare current state to initial state, which is assumed to be zero
for (int iPlace = 0; iPlace < nPlaces; iPlace++) { // for each place
if (sCur.b[iPlace]) { // if place transitioned
nTrans.b[iPlace]++; // increment place's transition count
}
}
// now that we have latest transition counts, compute their min and max
int nMin = nTrans.b[0]; // initialize min and max to first transition count
int nMax = nTrans.b[0];
for (int iPlace = 1; iPlace < nPlaces; iPlace++) { // for each transition count, excluding first
int n = nTrans.b[iPlace];
if (n < nMin) // if less than min
nMin = n; // update min
if (n > nMax) // if greater than max
nMax = n; // udpate max
}
nMaxTrans = nMax;
return nMax - nMin; // return difference
}
__forceinline int CBalaGray::ComputeMaxSpan(int iDepth) const
{
int arrSpan[MAX_PLACES];
int arrFirstSpan[MAX_PLACES];
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
arrSpan[iPlace] = 1; // initial span length is one
arrFirstSpan[iPlace] = 0; // first span length not set
}
int nMaxSpan = 1;
PERM sFirst, sPrev;
sFirst.dw = m_arrPerm[m_arrState[0].iPerm].dw; // store first state
sPrev.dw = sFirst.dw;
for (int iState = 1; iState <= iDepth; iState++) { // for each state, excluding first
PERM s;
s.dw = m_arrPerm[m_arrState[iState].iPerm].dw; // compare this state to previous state
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
if (s.b[iPlace] != sPrev.b[iPlace]) { // if place transitioned
if (arrSpan[iPlace] > nMaxSpan) // if span length exceeds max
nMaxSpan = arrSpan[iPlace]; // update max span length
if (!arrFirstSpan[iPlace]) // if first span length hasn't been set
arrFirstSpan[iPlace] = arrSpan[iPlace]; // save first span length
arrSpan[iPlace] = 1; // reset span length
} else { // place didn't transition
arrSpan[iPlace]++; // increment span length
}
}
sPrev = s; // update previous state
}
// wrap around from last to first state
for (int iPlace = 0; iPlace < m_nPlaces; iPlace++) { // for each place
if (sFirst.b[iPlace] != sPrev.b[iPlace]) { // if place transitioned
if (arrSpan[iPlace] > nMaxSpan) // if span length exceeds max
nMaxSpan = arrSpan[iPlace]; // update max span length
} else { // place didn't transition
arrSpan[iPlace] += arrFirstSpan[iPlace]; // compute wrapped span length
if (arrSpan[iPlace] > nMaxSpan) // if span length exceeds max
nMaxSpan = arrSpan[iPlace]; // update max span length
}
}
return nMaxSpan;
}
void test()
{
// All cases want PRUNE_IMBALANCE = 3 unless specified otherwise below.
// Pruning greatly reduces runtime, but the results may not be optimal.
// Proven means exited normally with pruning disabled (DO_PRUNING = 0).
//
// const uint8_t arrRange[] = {2, 10}; // proven
// const uint8_t arrRange[] = {3, 9};
// const uint8_t arrRange[] = {4, 8};
// const uint8_t arrRange[] = {5, 7};
// const uint8_t arrRange[] = {6, 6};
// const uint8_t arrRange[] = {2, 9}; // proven
// const uint8_t arrRange[] = {3, 8};
// const uint8_t arrRange[] = {4, 7};
// const uint8_t arrRange[] = {5, 6};
// const uint8_t arrRange[] = {2, 8}; // proven
// const uint8_t arrRange[] = {3, 7};
// const uint8_t arrRange[] = {4, 6};
// const uint8_t arrRange[] = {5, 5};
// const uint8_t arrRange[] = {2, 7}; // proven
// const uint8_t arrRange[] = {3, 6}; // proven
// const uint8_t arrRange[] = {4, 5}; // proven
// const uint8_t arrRange[] = {2, 6}; // proven
// const uint8_t arrRange[] = {3, 5}; // proven
// const uint8_t arrRange[] = {4, 4}; // proven
// const uint8_t arrRange[] = {2, 5}; // proven
// const uint8_t arrRange[] = {3, 4}; // proven
// const uint8_t arrRange[] = {2, 4}; // proven
// const uint8_t arrRange[] = {3, 3}; // proven
// const uint8_t arrRange[] = {2, 3}; // proven
// const uint8_t arrRange[] = {2, 2}; // proven
// const uint8_t arrRange[] = {2, 2, 8};
// const uint8_t arrRange[] = {2, 3, 7};
// const uint8_t arrRange[] = {2, 4, 6};
// const uint8_t arrRange[] = {2, 5, 5};
// const uint8_t arrRange[] = {3, 3, 6};
// const uint8_t arrRange[] = {3, 4, 5};
// const uint8_t arrRange[] = {4, 4, 4};
// const uint8_t arrRange[] = {2, 2, 7};
// const uint8_t arrRange[] = {2, 3, 6};
// const uint8_t arrRange[] = {2, 4, 5};
// const uint8_t arrRange[] = {3, 3, 5};
// const uint8_t arrRange[] = {3, 4, 4};
// const uint8_t arrRange[] = {2, 2, 6}; // proven
// const uint8_t arrRange[] = {2, 3, 5};
// const uint8_t arrRange[] = {2, 4, 4};
// const uint8_t arrRange[] = {3, 3, 4};
// const uint8_t arrRange[] = {2, 2, 5}; // proven
// const uint8_t arrRange[] = {2, 3, 4}; // proven
// const uint8_t arrRange[] = {3, 3, 3};
// const uint8_t arrRange[] = {2, 2, 4}; // proven
// const uint8_t arrRange[] = {2, 3, 3}; // proven
// const uint8_t arrRange[] = {2, 2, 3}; // proven
// const uint8_t arrRange[] = {2, 2, 2}; // proven
// const uint8_t arrRange[] = {2, 2, 2, 6};
// const uint8_t arrRange[] = {2, 2, 3, 5}; // slow
// const uint8_t arrRange[] = {2, 2, 4, 4};
// const uint8_t arrRange[] = {2, 3, 3, 4}; // slow; wants PRUNE_IMBALANCE = 4
const uint8_t arrRange[] = {3, 3, 3, 3}; // slow
// const uint8_t arrRange[] = {2, 2, 2, 5};
// const uint8_t arrRange[] = {2, 2, 3, 4};
// const uint8_t arrRange[] = {2, 3, 3, 3};
// const uint8_t arrRange[] = {2, 2, 2, 4};
// const uint8_t arrRange[] = {2, 2, 3, 3}; // slow
// const uint8_t arrRange[] = {2, 2, 2, 3}; // proven
// const uint8_t arrRange[] = {2, 2, 2, 2}; // proven
//
// *** following cases require MORE_PLACES to be non-zero ***
//
// const uint8_t arrRange[] = {2, 2, 2, 2, 4}; // wants PRUNE_IMBALANCE = 2
// const uint8_t arrRange[] = {2, 2, 2, 3, 3}; // wants PRUNE_IMBALANCE = 4
// const uint8_t arrRange[] = {2, 2, 2, 2, 3}; // wants PRUNE_IMBALANCE = 2
// const uint8_t arrRange[] = {2, 2, 2, 2, 2};
// const uint8_t arrRange[] = {2, 2, 2, 2, 2, 2};
//
CBalaGray bg;
bg.Calc(_countof(arrRange), arrRange);
fgetc(stdin);
}
int _tmain(int argc, _TCHAR* argv[])
{
test();
return 0;
}

Within CPLEX, I would use CPOptimizer.
For instance, in OPL
// 2 2 2 2
using CP;
int Size=4;
int r[1..Size]=[2,2,2,2];
int States=prod(i in 1..Size) r[i];
int fig[1..States][1..Size];
execute
{
var index=0;
for(var f1=1;f1<=r[1];f1++)
for(var f2=1;f2<=r[2];f2++)
for(var f3=1;f3<=r[3];f3++)
for(var f4=1;f4<=r[4];f4++)
{
index++;
fig[index][1]=f1;
fig[index][2]=f2;
fig[index][3]=f3;
fig[index][4]=f4;
}
}
dvar int x[1..States] in 1..States; // list of States in the right order
dvar int change[1..States] in 1..Size; // the figure that is different next time
dexpr int nbChanges[i in 1..Size]=count(change,i);
dexpr int inbalance=max(i in 1..Size) nbChanges[i]-min(i in 1..Size) nbChanges[i];
dvar int+ nochangeForThatManyTimes[1..States][1..Size] in 1..maxint;
dexpr int maxspan=max(i in 1..States,j in 1..Size) nochangeForThatManyTimes[i][j];
minimize staticLex(inbalance,maxspan);
subject to
{
x[1]==1;
allDifferent(x);
// Gray
forall(i in 1..States,j in 1..Size)
((fig[x[i]][j]==fig[x[(i<States)?(i+1):1]][j])==(j!=change[i]));
forall(i in 2..States,j in 1..Size)
{
(j==change[i-1]) => (nochangeForThatManyTimes[i][j]==1);
(j!=change[i-1]) => (nochangeForThatManyTimes[i][j]==1+nochangeForThatManyTimes[i-1][j]);
}
forall(j in 1..Size)
(j==change[States]) ==
(nochangeForThatManyTimes[1][j]==1);
}
execute
{
for(var i=1;i<=States;i++)
{
for(var j=1;j<=Size;j++) write(fig[x[i]][j]-1);
writeln();
}
writeln();
writeln("inbalance = ",inbalance);
writeln("maxspan = ",maxspan);
}
gives
0000
1000
1010
1011
1001
1101
0101
0001
0011
0010
0110
0111
1111
1110
1100
0100
inbalance = 0
maxspan = 6
and with 3,3,3,3 and a 12000 time limit I got
OBJECTIVE: 1; 14
0000
1000
2000
2100
2110
2010
2012
0012
0010
0011
2011
1011
1010
1012
1022
1002
0002
0102
0202
0212
0210
2210
2200
2220
0220
0222
1222
2222
2202
2212
2112
1112
0112
0110
0120
0020
0021
0022
0122
0121
0111
0211
0221
1221
1220
1210
1211
1212
1202
1102
1122
1120
1110
1111
1121
1021
1020
2020
2120
2122
2102
2002
2022
2021
2121
2221
2211
2111
2101
2001
1001
0001
0101
1101
1201
2201
0201
0200
1200
1100
0100
inbalance = 1
maxspan = 14
and after 10 hours
OBJECTIVE: 1; 10
0000
0001
0002
0012
0010
0011
0111
0121
0221
1221
1021
0021
0022
0222
0122
0102
0112
0212
0202
0201
0101
1101
1001
1011
1012
1112
1110
0110
2110
2010
2020
2000
2002
1002
1022
1020
1120
0120
2120
2121
2221
2222
2202
2200
0200
1200
1000
1010
1210
2210
0210
0211
1211
1201
1202
1102
1100
0100
2100
2101
2102
2112
2012
2022
2122
1122
1121
1111
2111
2011
2021
2001
2201
2211
2212
1212
1222
1220
2220
0220
0020
inbalance = 1
maxspan = 10
In order to get
inbalance = 1
maxspan = 9
with
0000
1000
1001
1201
1200
1210
0210
2210
2211
2201
2200
2000
2010
1010
0010
0110
0112
0102
0202
0212
0012
1012
1011
1111
0111
0101
1101
1100
1102
1002
1022
1020
1120
1121
1021
0021
0011
0211
0221
0222
0122
1122
2122
2102
2202
1202
1222
1212
1211
1221
2221
2222
2212
2012
2022
0022
0002
2002
2001
2101
2121
0121
0120
0020
0220
1220
2220
2020
2021
2011
2111
2112
1112
1110
2110
2120
2100
0100
0200
0201
0001
I slightly improved the model
execute
{
cp.param.timelimit=36000;
}
using CP;
int Size=4;
int r[1..Size]=[3,3,3,3];
int maxr=max(i in 1..Size) r[i];
int States=prod(i in 1..Size) r[i];
int fig[1..States][1..Size];
int which[i1 in 1..r[1]][i2 in 1..r[2]][i3 in 1..r[3]][i4 in 1..r[4]];
execute
{
var index=0;
for(var f1=1;f1<=r[1];f1++)
for(var f2=1;f2<=r[2];f2++)
for(var f3=1;f3<=r[3];f3++)
for(var f4=1;f4<=r[4];f4++)
{
index++;
fig[index][1]=f1;
fig[index][2]=f2;
fig[index][3]=f3;
fig[index][4]=f4;
which[f1][f2][f3][f4]=index;
}
}
dvar int x[1..States] in 1..States; // list of States in the right order
dvar int y[1..States] in 1..States;
dvar int change[1..States] in 1..Size; // the figure that is different next time
dvar int move[1..States] in 1..maxr;
dexpr int nbChanges[i in 1..Size]=count(change,i);
dexpr int inbalance=max(i in 1..Size) nbChanges[i]-min(i in 1..Size) nbChanges[i];
dvar int+ nochangeForThatManyTimes[1..States][1..Size] in 1..maxint;
dexpr int maxspan=max(i in 1..States,j in 1..Size) nochangeForThatManyTimes[i][j];
minimize staticLex(inbalance,maxspan);
subject to
{
// inverse(x,y);
// allDifferent(y);
x[1]==1;
forall(i in 1..States) move[i]<=r[change[i]];
//inverse(x,y);
change[1]==1;
allDifferent(x);
// Gray
// forall(i in 1..States,j in 1..Size)
// {
// (j!=change[i]) == (fig[x[i]][j]==fig[x[(i<States)?(i+1):1]][j]);
// (j==change[i]) == ((fig[x[i]][j]+move[i]-1) mod r[j]+1==fig[x[(i<States)?(i+1):1]][j]);
//}
forall(i in 1..States)
x[(i<States)?(i+1):1]
==which
[(fig[x[i]][1]+(1==change[i])*move[i]-1) mod r[1]+1]
[(fig[x[i]][2]+(2==change[i])*move[i]-1) mod r[2]+1]
[(fig[x[i]][3]+(3==change[i])*move[i]-1) mod r[3]+1]
[(fig[x[i]][4]+(4==change[i])*move[i]-1) mod r[4]+1]
;
inferred(change);
inferred(move);
inferred(nochangeForThatManyTimes);
inbalance>=States mod 2;
forall(i in 2..States,j in 1..Size)
{
(j==change[i-1]) => (nochangeForThatManyTimes[i][j]==1);
(j!=change[i-1]) => (nochangeForThatManyTimes[i][j]==1+nochangeForThatManyTimes[i-1][j]);
}
forall(j in 1..Size)
(j==change[States]) ==
(nochangeForThatManyTimes[1][j]==1);
}
execute
{
for(var i=1;i<=States;i++)
{
for(var j=1;j<=Size;j++) write(fig[x[i]][j]-1);
writeln();
}
writeln();
writeln("inbalance = ",inbalance);
writeln("maxspan = ",maxspan);
}
NB: You can use CPLEX for free in the cloud with this OPL API

Well, after some tinkering, I have an Integer Program running for this, that I think is producing quality results. Tried a couple approaches...each had differing limitations
It is a little grotesque in parts as the counting of repeat digits is quite cumbersome.
It really bogs down for things with ~30 states or more, so it's not going to make it to the finish line. :) I think it is much more nimble if I remove the repeat counting, and I'll tinker a bit more. In the interim, here are some results for the cases not marked as proven on your web page. The (4, 6) run (second run) is an improvement, the other 2 are now "proven" as stated, perhaps with a different sequence, I didn't x-check.
I'll update later with any other improvements.
starting run: (3, 7)
WARNING: Initializing ordered Set PR_flat with a fundamentally unordered data
source (type: set). This WILL potentially lead to nondeterministic
behavior in Pyomo
Problem:
- Name: unknown
Lower bound: 1.3
Upper bound: 1.3
Number of objectives: 1
Number of constraints: 3744
Number of variables: 3539
Number of binary variables: 3557
Number of integer variables: 3560
Number of nonzeros: 3
Sense: minimize
Solver:
- Status: ok
User time: -1.0
System time: 353.01
Wallclock time: 305.85
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 1018
Number of created subproblems: 1018
Black box:
Number of iterations: 667719
Error rc: 0
Time: 306.00031781196594
Solution:
- number of solutions: 0
number of solutions displayed: 0
11
01
02
00
10
20
21
22
12
13
23
03
05
25
26
24
14
04
06
16
15
max imbalance: 1.0
max repeats: 3.0
starting run: (4, 6)
WARNING: Initializing ordered Set PR_flat with a fundamentally unordered data
source (type: set). This WILL potentially lead to nondeterministic
behavior in Pyomo
Problem:
- Name: unknown
Lower bound: 0.2
Upper bound: 0.2
Number of objectives: 1
Number of constraints: 4854
Number of variables: 4619
Number of binary variables: 4640
Number of integer variables: 4643
Number of nonzeros: 3
Sense: minimize
Solver:
- Status: ok
User time: -1.0
System time: 34.21
Wallclock time: 34.89
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 1
Number of created subproblems: 1
Black box:
Number of iterations: 14167
Error rc: 0
Time: 34.923232078552246
Solution:
- number of solutions: 0
number of solutions displayed: 0
10
13
33
34
14
15
35
32
02
03
23
22
12
11
01
00
30
31
21
24
04
05
25
20
max imbalance: 0.0
max repeats: 2.0
starting run: (5, 5)
WARNING: Initializing ordered Set PR_flat with a fundamentally unordered data
source (type: set). This WILL potentially lead to nondeterministic
behavior in Pyomo
Problem:
- Name: unknown
Lower bound: 1.3
Upper bound: 1.3
Number of objectives: 1
Number of constraints: 5256
Number of variables: 5011
Number of binary variables: 5033
Number of integer variables: 5036
Number of nonzeros: 3
Sense: minimize
Solver:
- Status: ok
User time: -1.0
System time: 915.71
Wallclock time: 634.99
Termination condition: optimal
Termination message: Model was solved to optimality (subject to tolerances), and an optimal solution is available.
Statistics:
Branch and bound:
Number of bounded subproblems: 1764
Number of created subproblems: 1764
Black box:
Number of iterations: 1855323
Error rc: 0
Time: 635.0473001003265
Solution:
- number of solutions: 0
number of solutions displayed: 0
11
01
31
33
34
44
04
03
00
40
41
42
22
23
43
13
12
02
32
30
10
20
21
24
14
max imbalance: 1.0
max repeats: 3.0

OpenCL kernel function crash

I have written a code in OpenCL in which I am not using local (shared) memory. My code crashes during execution and gives error -5. The error goes away when I replace global memory access to cvt_img buffer (in the middle of the code) with some constant values.
I do not understand why this happens, becuase I prevent accessing to out-of-the-scope memory locations using an if statement.
This code is part of a 3D pipeline, but right now, I have seperated it from my main application, and have put it in a seperate project in which all of the buffers are initialized randomly.
The size of the grid (in terms of number of threads) is the same as size of the image (img_size.x, img_size.y) and size of the block is (16, 16). The application is running for 15 images.
void compute_cost_volume(
global float3 *cvt_img,
global float8 *spixl_map,
global float *disp_level,
global int *view_subset,
global int *subset_num,
int array_width, int2 map_size,
int2 img_size, float bl_ratio,
int sp_size, int num_disp, float2 step,
int x, int y, int z, int view_count
)
{
barrier(CLK_GLOBAL_MEM_FENCE);
int idx = map_size.x * map_size.y * z + map_size.x * y + x;
float8 spixl = spixl_map[idx];
float2 center = spixl.s12;
int2 camIdx = (int2)(z % array_width, z / array_width);
float cost_est = 1000000.0, disp_est = 0.0;
for (int dl = 0 ; dl < num_disp ; dl++)
{
float d = disp_level[dl];
float min_val = 1000000.0;
for (int n = 0 ; n < subset_num[z] ; n++)
{
int view = view_subset[n];
int2 viewIdx = (int2)(view % array_width, view / array_width);
float val = 0.0;
for (int i = -2 ; i <= 2 ; i++) for (int j = -2 ; j <= 2 ; j++)
{
//int2 xy_ref = (int2)(center.x - 2*step.x + i*step.x, center.y - 2*step.y + j*step.y);
int2 xy_ref = (int2)(center.x + i*step.x, center.y + j*step.y);
int2 xy_proj = (int2)((int)(xy_ref.x - d*(viewIdx.x - camIdx.x)), (int)(xy_ref.y - bl_ratio*d*(viewIdx.y - camIdx.y) ) );
if (xy_ref.x >= 0 && xy_ref.y >= 0 && xy_proj.x >= 0 && xy_proj.y >= 0 && xy_ref.x < img_size.x && xy_ref.y < img_size.y && xy_proj.x < img_size.x && xy_proj.y < img_size.y)
{
float3 color_ref = cvt_img[img_size.x*img_size.y*z + img_size.x*xy_ref.y + xy_ref.x];
float3 color_proj = cvt_img[img_size.x*img_size.y*view + img_size.x*xy_proj.y + xy_proj.x];
val += fabs(color_ref.x - color_proj.x) + fabs(color_ref.y - color_proj.y) + fabs(color_ref.z - color_proj.z);
}
else
val += 30;
}
if (val < min_val)
min_val = val;
}
if (min_val < cost_est)
{
cost_est = min_val;
disp_est = d;
}
}
spixl_map[idx].s7 = disp_est;
}
kernel void initial_depth_estimation(
global float3 *cvt_img,
global float8 *spixl_map,
global float *disp_level,
int array_width, int2 map_size,
int2 img_size, float bl_ratio,
int sp_size, int disp_num,
global int *view_subset, global int *subset_num
)
{
int x = get_global_id(0);
int y = get_global_id(1);
if (x >= map_size.x || y >= map_size.y)
return;
//float2 step = (float2)(1, 1);
for (int z = 0 ; z < 15 ; z++){
int idx = map_size.x*map_size.y*z + map_size.x*y + x;
// Set The Bounding Box
float2 step = (float2)(1.0, 1.0);
compute_cost_volume(cvt_img, spixl_map, disp_level, view_subset, subset_num,
array_width, map_size, img_size, bl_ratio, sp_size, disp_num, step, x, y, z, 15);
barrier(CLK_LOCAL_MEM_FENCE);
}
}

From the documentation
https://www.khronos.org/registry/OpenCL/sdk/1.0/docs/man/xhtml/vectorDataTypes.html
" The vector data type is defined with the type name i.e. char, uchar, short, ushort, int, uint, float, long, and ulong followed by a literal value n that defines the number of elements in the vector. Supported values of n are 2, 4, 8, and 16. "
Therefore, there is no float3, maybe you can try to use float4 and make the last element zero?
Also, assuming that float3 existed, this line of code
float3 color_proj = cvt_img[img_size.x*img_size.y*view + img_size.x*xy_proj.y + xy_proj.x];
does not do what you want, this will produce ONE value that cannot be assigned to vector, you should have used something like
float3 color_proj = (float3) cvt_img[img_size.x*img_size.y*view + img_size.x*xy_proj.y + xy_proj.x];
this would copy the one value returned by the cvt_img[...] to 3 vector elements.

How to pass a pointer argument to a function without knowing the size to be allocated for that pointer

I know this question is very noob. I am trying to understand how the pointer thing works. I studied basics of C but still did not understand this.
Given this piece of function:
+ (void)nv21ToRgbWithWidth:(unsigned int)width height:(unsigned int)height yuyv:(unsigned char *)yuyv rgb:(unsigned char *)rgb
{
const int nv_start = width * height ;
UInt32 i, j, index = 0, rgb_index = 0;
UInt8 y, u, v;
int r, g, b, nv_index = 0;
for(i = 0; i < height ; i++)
{
for(j = 0; j < width; j ++){
//nv_index = (rgb_index / 2 - width / 2 * ((i + 1) / 2)) * 2;
nv_index = i / 2 * width + j - j % 2;
y = yuyv[rgb_index];
u = yuyv[nv_start + nv_index ];
v = yuyv[nv_start + nv_index + 1];
r = y + (140 * (v-128))/100; //r
g = y - (34 * (u-128))/100 - (71 * (v-128))/100; //g
b = y + (177 * (u-128))/100; //b
if(r > 255) r = 255;
if(g > 255) g = 255;
if(b > 255) b = 255;
if(r < 0) r = 0;
if(g < 0) g = 0;
if(b < 0) b = 0;
index = rgb_index % width + (height - i - 1) * width;
rgb[index * 3+0] = b;
rgb[index * 3+1] = g;
rgb[index * 3+2] = r;
rgb_index++;
}
}
}
How am I suppose to know how the unsigned char * for rgb should be initialized before passing in to the function?
I tried calling the function like this:
unsigned char *rgb = NULL;
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
But the the program crashes on this line:
rgb[index * 3+0] = b;
I see rgb was initialized with NULL, so you can't assign values. So, I thought of initializing an array and pass it to pointer rgb like this:
unsigned char rgbArr[10000];
unsigned char *rgb = rgbArr;
but the function still crashes. I really don't know how should I pass the rgb parameter in this function. Please help me understand this.

The expected size in bytes seems to be at least height*width*3; it might be that allocating such an array as a local variable (as you do with unsigned char rgbArr[10000]) exceeds a stack limit; The program likely crashes in such a case. I'd try to use the heap instead:
unsigned char* rgb = malloc(imageHeight*imageWidth*3);
[MyClass nv21ToRgbWithWidth:imageWidth height:imageHeight yuyv:yuyvValues rgb:rgb];
...
free(rgb);

That is what the malloc(), calloc(), realloc() and free() functions are for. Don't forget to use the free() function to prevent memory leaks... I hope that helps.

go benchmark and gc: B/op alloc/op

benchmark code:
func BenchmarkSth(b *testing.B) {
var x []int
b.ResetTimer()
for i := 0; i < b.N; i++ {
x = append(x, i)
}
}
result:
BenchmarkSth-4 50000000 20.7 ns/op 40 B/op 0 allocs/op
question/s:
where did 40 B/op come from? (any way of tracing + instructions is greatly appreciated)
how is it possible to have 40 B/op while having 0 allocs?
which one affects GC and how? (B/op or allocs/op)
is it really possible to have 0 B/op using append?

The Go Programming Language Specification
Appending to and copying slices
The variadic function append appends zero or more values x to s of
type S, which must be a slice type, and returns the resulting slice,
also of type S.
append(s S, x ...T) S // T is the element type of S
If the capacity of s is not large enough to fit the additional values,
append allocates a new, sufficiently large underlying array that fits
both the existing slice elements and the additional values. Otherwise,
append re-uses the underlying array.
For your example, on average, [40, 41) bytes per operation are allocated to increase the capacity of the slice when necessary. The capacity is increased using an amortized constant time algorithm: up to len 1024 increase to 2 times cap then increase to 1.25 times cap. On average, there are [0, 1) allocations per operation.
For example,
func BenchmarkMem(b *testing.B) {
b.ReportAllocs()
var x []int64
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkMem-4 50000000 26.6 ns/op 40 B/op 0 allocs/op
--- BENCH: BenchmarkMem-4
bench_test.go:32: op 1 B 8 alloc 1 lx 1 cx 1
bench_test.go:32: op 100 B 2040 alloc 8 lx 100 cx 128
bench_test.go:32: op 10000 B 386296 alloc 20 lx 10000 cx 12288
bench_test.go:32: op 1000000 B 45188344 alloc 40 lx 1000000 cx 1136640
bench_test.go:32: op 50000000 B 2021098744 alloc 57 lx 50000000 cx 50539520
For op = 50000000,
B/op = floor(2021098744 / 50000000) = floor(40.421974888) = 40
allocs/op = floor(57 / 50000000) = floor(0.00000114) = 0
Read:
Go Slices: usage and internals
Arrays, slices (and strings): The mechanics of 'append'
'append' complexity
To have zero B/op (and zero allocs/op) for append, allocate a slice with sufficient capacity before appending.
For example, with var x = make([]int64, 0, b.N),
func BenchmarkZero(b *testing.B) {
b.ReportAllocs()
var x = make([]int64, 0, b.N)
var a, ac int64
b.ResetTimer()
for i := 0; i < b.N; i++ {
c := cap(x)
x = append(x, int64(i))
if cap(x) != c {
a++
ac += int64(cap(x))
}
}
b.StopTimer()
sizeInt64 := int64(8)
B := ac * sizeInt64 // bytes
b.Log("op", b.N, "B", B, "alloc", a, "lx", len(x), "cx", cap(x))
}
Output:
BenchmarkZero-4 100000000 11.7 ns/op 0 B/op 0 allocs/op
--- BENCH: BenchmarkZero-4
bench_test.go:51: op 1 B 0 alloc 0 lx 1 cx 1
bench_test.go:51: op 100 B 0 alloc 0 lx 100 cx 100
bench_test.go:51: op 10000 B 0 alloc 0 lx 10000 cx 10000
bench_test.go:51: op 1000000 B 0 alloc 0 lx 1000000 cx 1000000
bench_test.go:51: op 100000000 B 0 alloc 0 lx 100000000 cx 100000000
Note the reduction in benchmark CPU time from around 26.6 ns/op to around 11.7 ns/op.

Generate all combinations of a char array inside of a CUDA device kernel

I need help please. I started to program a common brute forcer / password guesser with CUDA (2.3 / 3.0beta).
I tried different ways to generate all possible plain text "candidates" of a defined ASCII char set.
In this sample code I want to generate all 74^4 possible combinations (and just output the result back to host/stdout).
$ ./combinations
Total number of combinations : 29986576
Maximum output length : 4
ASCII charset length : 74
ASCII charset : 0x30 - 0x7a
"0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxy"
CUDA code (compiled with 2.3 and 3.0b - sm_10) - combinaions.cu:
#include <stdio.h>
#include <cuda.h>
__device__ uchar4 charset_global = {0x30, 0x30, 0x30, 0x30};
__shared__ __device__ uchar4 charset[128];
__global__ void combo_kernel(uchar4 * result_d, unsigned int N)
{
int totalThreads = blockDim.x * gridDim.x ;
int tasksPerThread = (N % totalThreads) == 0 ? N / totalThreads : N/totalThreads + 1;
int myThreadIdx = blockIdx.x * blockDim.x + threadIdx.x ;
int endIdx = myThreadIdx + totalThreads * tasksPerThread ;
if( endIdx > N) endIdx = N;
const unsigned int m = 74 + 0x30;
for(int idx = myThreadIdx ; idx < endIdx ; idx += totalThreads) {
charset[threadIdx.x].x = charset_global.x;
charset[threadIdx.x].y = charset_global.y;
charset[threadIdx.x].z = charset_global.z;
charset[threadIdx.x].w = charset_global.w;
__threadfence();
if(charset[threadIdx.x].x < m) {
charset[threadIdx.x].x++;
} else if(charset[threadIdx.x].y < m) {
charset[threadIdx.x].x = 0x30; // = 0
charset[threadIdx.x].y++;
} else if(charset[threadIdx.x].z < m) {
charset[threadIdx.x].y = 0x30; // = 0
charset[threadIdx.x].z++;
} else if(charset[threadIdx.x].w < m) {
charset[threadIdx.x].z = 0x30;
charset[threadIdx.x].w++;; // = 0
}
charset_global.x = charset[threadIdx.x].x;
charset_global.y = charset[threadIdx.x].y;
charset_global.z = charset[threadIdx.x].z;
charset_global.w = charset[threadIdx.x].w;
result_d[idx].x = charset_global.x;
result_d[idx].y = charset_global.y;
result_d[idx].z = charset_global.z;
result_d[idx].w = charset_global.w;
}
}
#define BLOCKS 65535
#define THREADS 128
int main(int argc, char **argv)
{
const int ascii_chars = 74;
const int max_len = 4;
const unsigned int N = pow((float)ascii_chars, max_len);
size_t size = N * sizeof(uchar4);
uchar4 *result_d, *result_h;
result_h = (uchar4 *)malloc(size );
cudaMalloc((void **)&result_d, size );
cudaMemset(result_d, 0, size);
printf("Total number of combinations\t: %d\n\n", N);
printf("Maximum output length\t: %d\n", max_len);
printf("ASCII charset length\t: %d\n\n", ascii_chars);
printf("ASCII charset\t: 0x30 - 0x%02x\n ", 0x30 + ascii_chars);
for(int i=0; i < ascii_chars; i++)
printf("%c",i + 0x30);
printf("\n\n");
combo_kernel <<< BLOCKS, THREADS >>> (result_d, N);
cudaThreadSynchronize();
printf("CUDA kernel done\n");
printf("hit key to continue...\n");
getchar();
cudaMemcpy(result_h, result_d, size, cudaMemcpyDeviceToHost);
for (unsigned int i=0; i<N; i++)
printf("result[%06u]\t%c%c%c%c\n",i, result_h[i].x, result_h[i].y, result_h[i].z, result_h[i].w);
free(result_h);
cudaFree(result_d);
}
The code should compile without any problems but the output is not what i expected.
On emulation mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[000128] 5000
On release mode:
CUDA kernel done hit
key to continue...
result[000000] 1000
...
result[012288] 5000
I also used __threadfence() and or __syncthreads() on different lines of the code also without success...
ps. if possible I want to generate everything inside of the kernel function . I also tried "pre" generating of possible plain text candidates inside host main function and memcpy to device, this works only with a very limited charset size (because of limited device memory).
any idea about the output, why the repeating (even with __threadfence() or __syncthreads()) ?
any other method to generate plain text (candidates) inside CUDA kernel fast :-) (~75^8) ?
thanks a million
greets jan

Incidentally, your loop bound is overly complex. You don't need to do all that work to compute the endIdx, instead you can do the following, making the code simpler.
for(int idx = myThreadIdx ; idx < N ; idx += totalThreads)

Let's see:
When filling your charset array, __syncthreads() will be sufficient as you are not interested in writes to global memory (more on this later)
Your if statements are not correctly resetting your loop iterators:
In z < m, then both x == m and y == m and must both be set to 0.
Similar for w
Each thread is responsible for writing one set of 4 characters in charset, but every thread writes the same 4 values. No thread does any independent work.
You are writing each threads results to global memory without atomics, which is unsafe. There is no guarantee that the results won't be immediately clobbered by another thread before reading them back.
You are reading the results of computation back from global memory immediately after writing them to global memory. It's unclear why you are doing this and this is very unsafe.
Finally, there is no reliable way in CUDA to to a synchronization between all blocks, which seems to be what you are hoping for. Calling __threadfence only applies to blocks currently executing on the device, which can be subset of all blocks that should run for a kernel call. Thus it doesn't work as a synchronization primitive.
It's probably easier to calculate initial values of x, y, z and w for each thread. Then each thread can start looping from its initial values until it has performed tasksPerThread iterations. Writing the values out can probably proceed more or less as you have it now.
EDIT: Here is a simple test program to demonstrate the logic errors in your loop iteration:
int m = 2;
int x = 0, y = 0, z = 0, w = 0;
for (int i = 0; i < m * m * m * m; i++)
{
printf("x: %d y: %d z: %d w: %d\n", x, y, z, w);
if(x < m) {
x++;
} else if(y < m) {
x = 0; // = 0
y++;
} else if(z < m) {
y = 0; // = 0
z++;
} else if(w < m) {
z = 0;
w++;; // = 0
}
}
The output of which is this:
x: 0 y: 0 z: 0 w: 0
x: 1 y: 0 z: 0 w: 0
x: 2 y: 0 z: 0 w: 0
x: 0 y: 1 z: 0 w: 0
x: 1 y: 1 z: 0 w: 0
x: 2 y: 1 z: 0 w: 0
x: 0 y: 2 z: 0 w: 0
x: 1 y: 2 z: 0 w: 0
x: 2 y: 2 z: 0 w: 0
x: 2 y: 0 z: 1 w: 0
x: 0 y: 1 z: 1 w: 0
x: 1 y: 1 z: 1 w: 0
x: 2 y: 1 z: 1 w: 0
x: 0 y: 2 z: 1 w: 0
x: 1 y: 2 z: 1 w: 0
x: 2 y: 2 z: 1 w: 0

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

OpenCL bad get_global_id output - gpu

Related

How to program a MIP solver to find balanced Gray code for mixed radices?

OpenCL kernel function crash

How to pass a pointer argument to a function without knowing the size to be allocated for that pointer

go benchmark and gc: B/op alloc/op

Generate all combinations of a char array inside of a CUDA device kernel

Categories

Resources

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas

OpenCL bad get_global_id output - gpu

Related

How to program a MIP solver to find balanced Gray code for mixed radices?

OpenCL kernel function crash

How to pass a pointer argument to a function without knowing the size to be allocated for that pointer

go benchmark and gc: B/op alloc/op

Generate all combinations of a char array inside of a CUDA __device__ kernel

Categories

Resources

Generate all combinations of a char array inside of a CUDA device kernel