How do I model a chessboard when programming a computer to play chess? - chess

What data structures would you use to represent a chessboard for a computer chess program?

For a serious chess engine, using bitboards is an efficient way to represent a chess board in memory. Bitboards are faster than any array based representation, specially in 64-bit architectures where a bitboard can fit inside a single CPU register.
Bitboards
Basic idea of bitboards is to represent every chess piece type in 64 bits. In C++/C# it will be ulong/UInt64. So you'll maintain 12 UInt64 variables to represent your chess board: two (one black and one white) for each piece type, namely, pawn, rook, knight, bishop, queen and king. Every bit in a UInt64 will correspond to a square on chessboard. Typically, the least significant bit will be a1 square, the next b1, then c1 and so on in a row-major fashion. The bit corresponding to a piece's location on chess board will be set to 1, all others will be set to 0. For example, if you have two white rooks on a2 and h1 then the white rooks bitboard will look like this:
0000000000000000000000000000000000000000000000000000000110000000
Now for example, if you wanted to move your rook from a2 to g2 in the above example, all you need to do is XOR you bitboard with:
0000000000000000000000000000000000000000000000000100000100000000
Bitboards have a performance advantage when it comes to move generation. There are other performance advantages too that spring naturally from bitboards representation. For example you could use lockless hash tables which are an immense advantage when parallelising your search algorithm.
Further Reading
The ultimate resource for chess engine development is the Chess Programming Wiki. I've recently written this chess engine which implements bitboards in C#. An even better open source chess engine is StockFish which also implements bitboards but in C++.

Initially, use an 8 * 8 integer array to represent the chess board.
You can start programing using this notation. Give point values for the pieces. For example:
**White**
9 = white queen
5 = white rook
3 = bishop
3 = knight
1 = pawn
**black**
-9 = white queen
-5 = white rook
-3 = bishop
-3 = knight
-1 = pawn
White King: very large positive number
Black King: very large negative number
etc. (Note that the points given above are approximations of trading power of each chess piece)
After you develop the basic backbones of your application and clearly understand the working of the algorithms used, try to improve the performance by using bit boards.
In bit boards, you use eight 8 -bit words to represent the boards. This representation needs a board for each chess piece. In one bit board you will be storing the position of the rook while in another you will be storing the position of the knight... etc
Bit boards can improve the performance of your application very much because manipulating the pieces with bit boards are very easy and fast.
As you pointed out,
Most chessprograms today, especially
those that run on a 64 bit CPU, use a
bitmapped approach to represent a
chessboard and generate moves. x88 is
an alternate board model for machines
without 64 bit CPUs.

The simple approach is to use an 8x8 integer array. Use 0 for empty squares and assign values for the pieces:
1 white pawns
2 white knights
3 white bishops
4 white rooks
5 white queens
6 white king
Black pieces use negative values
-1 black pawn
-2 black knight
etc
8| -4 -2 -3 -5 -6 -3 -2 -4
7| -1 -1 -1 -1 -1 -1 -1 -1
6| 0 0 0 0 0 0 0 0
5| 0 0 0 0 0 0 0 0
4| 0 0 0 0 0 0 0 0
3| 0 0 0 0 0 0 0 0
2| 1 1 1 1 1 1 1 1
1| 4 2 3 5 6 3 2 4
-------------------------
1 2 3 4 5 6 7 8
Piece moves can be calculated by using the array indexes. For example the white pawns move by increasing the row index by 1, or by 2 if it's the pawn's first move. So the white pawn on [2][1] could move to [3][1] or [4][1].
However this simple 8x8 array representation of has chessboard has several problems. Most notably when you're moving 'sliding' pieces like rooks, bishops and queens you need to constantly be checking the indexes to see if the piece has moved off the board.
Most chessprograms today, especially those that run on a 64 bit CPU, use a bitmapped approach to represent a chessboard and generate moves. x88 is an alternate board model for machines without 64 bit CPUs.

Array of 120 bytes.
This is a chessboard of 8x8 surrounded by sentinel squares (e.g. a 255 to indicate that a piece can't move to this square). The sentinels have a depth of two so that a knight can't jump over.
To move right add 1. To move left add -1. Up 10, down -10, up and right diagonal 11 etc. Square A1 is index 21. H1 is index 29. H8 is index 99.
All designed for simplicity. But it's never going to be as fast as bitboards.

Well, not sure if this helps, but Deep Blue used a single 6-bit number to represent a specific position on the board. This helped it save footprint on-chip in comparison to it's competitor, which used a 64-bit bitboard.
This might not be relevant, since chances are, you might have 64 bit registers on your target hardware, already.

When creating my chess engine I originally went with the [8][8] approach, however recently I changed my code to represent the chess board using a 64 item array. I found that this implementation was about 1/3 more efficient, at least in my case.
One of the things you want to consider when doing the [8][8] approach is describing positions. For example if you wish to describe a valid move for a chess piece, you will need 2 bytes to do so. While with the [64] item array you can do it with one byte.
To convert from a position on the [64] item board to a [8][8] board you can simply use the following calculations:
Row= (byte)(index / 8)
Col = (byte)(index % 8)
Although I found that you never have to do that during the recursive move searching which is performance sensitive.
For more information on building a chess engine, feel free to visit my blog that describes the process from scratch: www.chessbin.com
Adam Berent

There are of course a number of different ways to represent a chessboard, and the best way will depend on what is most important to you.
Your two main choices are between speed and code clarity.
If speed is your priority then you must use a 64 bit data type for each set of pieces on the board (e.g. white pawns, black queens, en passant pawns). You can then take advantage of native bitwise operations when generating moves and testing move legality.
If clarity of code is priority then forget bit shuffling and go for nicely abstracted data types like others have already suggested. Just remember that if you go this way you will probably hit a performance ceiling sooner.
To start you off, look at the code for Crafty (C) and SharpChess (C#).

An alternative to the standard 8x8 board is the 10x12 mailbox (so-called because, uh, I guess it looks like a mailbox). This is a one-dimensional array that includes sentinels around its "borders" to assist with move generation. It looks like this:
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, "a8", "b8", "c8", "d8", "e8", "f8", "g8", "h8", -1,
-1, "a7", "b7", "c7", "d7", "e7", "f7", "g7", "h7", -1,
-1, "a6", "b6", "c6", "d6", "e6", "f6", "g6", "h6", -1,
-1, "a5", "b5", "c5", "d5", "e5", "f5", "g5", "h5", -1,
-1, "a4", "b4", "c4", "d4", "e4", "f4", "g4", "h4", -1,
-1, "a3", "b3", "c3", "d3", "e3", "f3", "g3", "h3", -1,
-1, "a2", "b2", "c2", "d2", "e2", "f2", "g2", "h2", -1,
-1, "a1", "b1", "c1", "d1", "e1", "f1", "g1", "h1", -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
-1, -1, -1, -1, -1, -1, -1, -1, -1, -1
You can generate that array like this (in JavaScript):
function generateEmptyBoard() {
var row = [];
for(var i = 0; i < 120; i++) {
row.push((i < 20 || i > 100 || !(i % 10) || i % 10 == 9)
? -1
: i2an(i));
}
return row;
}
// converts an index in the mailbox into its corresponding value in algebraic notation
function i2an(i) {
return "abcdefgh"[(i % 10) - 1] + (10 - Math.floor(i / 10));
}
Of course, in a real implementation you'd put actual piece objects where the board labels are. But you'd keep the negative ones (or something equivalent). Those locations make move generation a lot less painful because you can easily tell when you've run off the board by checking for that special value.
Let's first look at generating the legal moves for the knight (a non-sliding piece):
function knightMoves(square, board) {
var i = an2i(square);
var moves = [];
[8, 12, 19, 21].forEach(function(offset) {
[i + offset, i - offset].forEach(function(pos) {
// make sure we're on the board
if (board[pos] != -1) {
// in a real implementation you'd also check whether
// the squares you encounter are occupied
moves.push(board[pos]);
}
});
});
return moves;
}
// converts a position in algebraic notation into its location in the mailbox
function an2i(square) {
return "abcdefgh".indexOf(square[0]) + 1 + (10 - square[1]) * 10;
}
We know that the valid Knight moves are a fixed distance from the piece's starting point, so we only needed to check that those locations are valid (i.e. not sentinel squares).
The sliding pieces aren't much harder. Let's look at the bishop:
function bishopMoves(square, board) {
var oSlide = function(direction) {
return slide(square, direction, board);
}
return [].concat(oSlide(11), oSlide(-11), oSlide(9), oSlide(-9));
}
function slide(square, direction, board) {
var moves = [];
for(var pos = direction + an2i(square); board[pos] != -1; pos += direction) {
// in a real implementation you'd also check whether
// the squares you encounter are occupied
moves.push(board[pos]);
}
return moves;
}
Here are some examples:
knightMoves("h1", generateEmptyBoard()) => ["f2", "g3"]
bishopMoves("a4", generateEmptyBoard()) => ["b3", "c2", "d1", "b5", "c6", "d7", "e8"]
Note that the slide function is a general implementation. You should be able to model the legal moves of the other sliding pieces fairly easily.

Using a bitboard would be an efficient way to represent the state of a chess board. The basic idea is that you use 64bit bitsets to represent each of the squares on the board, where first bit usually represents A1 (the lower left square), and the 64th bit represents H8 (the diagonally opposite square). Each type of piece (pawn, king, etc.) of each player (black, white) gets its own bit board and all 12 of these boards makes up the game state. For more information check out this Wikipedia article.

I would use a multidimensional array so that each element in an array is a grid reference to a square on the board.
Thus
board = arrary(A = array (1,2,3,4,5,5,6,7,8),
B = array (12,3,.... etc...
etc...
)
Then board[A][1] is then the board square A1.
In reality you would use numbers not letters to help keep your maths logic for where pieces are allowed to move to simple.

int[8][8]
0=no piece
1=king
2=queen
3=rook
4=knight
5=bishop
6=pawn
use positive ints for white and negative ints for black

I actually wouldn't model the chess board, I'd just model the position of the pieces.
You can have bounds for the chess board then.
Piece.x= x position of piece
Piece.y= y position of piece

I know this is a very old post which I have stumbled across a few times when googling chess programming, yet I feel I must mention it is perfectly feasible to model a chessboard with a 1D array e.g. chessboard[64];
I would say this is the simplest approach to chessboard representation...but of course it is a basic approach.
Is a 1D chessboard array structure more efficient than a 2D array (which needs a nested for loop to access and manipulate the indices)?
It is also possible to use a 1D array with more than 64 squares to represent OffBoard squares also e.g. chessboard[120]; (with the array sentinel and board playing squares correctly initialised).
Finally and again for completeness for this post I feel I must mention the 0x88 board array representation. This is quite a popular way to represent a chessboard which also accounts for offboard squares.

An array would probably be fine. If you wanted more convenient means of "traversing" the board, you could easily build methods to abstract away the details of the data structure implementation.

Related

numpy: Efficient lookup of multidimensional result for multidimensional key

To motivate the 'efficient' in the title: I am working with volumetric image data which can be up to 512x512x1000 pixels. So slow loops etc. are not really an option, particularly if the images need to be viewed in a GUI. Imagine sitting just 10s in front of a viewer waiting for images to load...
From two 3D input volumes x and y I calculate new 3D output volumes, currently up to three at a time e.g. by solving equation systems for each pixel. Since a lot of x,y combinations are actually repetitive and often even only a coherent meshgrid range is of interest, I am trying to speed up by creating a lookup table for this subregion. Works well, in my test case I need only ca. 3000 calculations instead of 30 million.
Now, to the problem: I am utterly failing at efficiently looking up the solutions of the 30 million x,y combinations from the 3000 solutions lookup table in a numpythonic way!
Let's try with an example:
# x y s1 s2
lookup = np.array([[ 4, 11, 23., 4. ],
[ 4, 12, 25., 13. ],
[ 5, 11, 21., 19. ],
[ 5, 12, 26., 56. ]])
I succeed in getting the index of one x,y pair following this post:
ii = np.where((lookup[:,0] == 4) & (lookup[:,1]==12))[0][0]
s1, s2 = lookup[ii,-2:]
print('At index',ii,':',s1,s2)
>>> At index 1 : 25.0 13.0
Q1: But how to vectorize this, i.e get full solutions arrays for the 30 million pixels?
s1, s2 = lookup[numpy_magic_with_xy, -2:]
Q2: And actually I'd like to set all solutions to zero for all x,y not within the region of interest. Where do I add that condition?
Q3: And what would really be the fastest way to achieve all this?
PS: I'm fine with using 1D image representations by working with x.ravel() etc. and reshaping at the end. Unless you tell me I don't need to and it's just slowing things down. Just doing it to still understand my own code I guess...

How to Make a Uniform Random Integer Generator from a Random Boolean Generator?

I have a hardware-based boolean generator that generates either 1 or 0 uniformly. How to use it to make a uniform 8-bit integer generator? I'm currently using the collected booleans to create the binary string for the 8-bit integer. The generated integers aren't uniformly distributed. It follows the distribution explained on this page. Integers with ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶l̶t̶e̶r̶n̶a̶t̶I̶n̶g̶ ̶b̶I̶t̶s̶ the same number of 1's and 0's such as 85 (01010101) and -86 (10101010) have the highest chance to be generated and integers with a lot of repeating bits such as 0 (00000000) and -1 (11111111) have the lowest chance.
Here's the page that I've annotated with probabilities for each possible 4-bit integer. We can see that they're not uniform. 3, 5, 6, -7, -6, and -4 that have the same number of 1's and 0's have ⁶/₁₆ probability while 0 and -1 that all of their bits are the same only have ¹/₁₆ probability.
.
And here's my implementation on Kotlin
Based on your edit, there appears to be a misunderstanding here. By "uniform 4-bit integers", you seem to have the following in mind:
Start at 0.
Generate a random bit. If it's 1, add 1, and otherwise subtract 1.
Repeat step 2 three more times.
Output the resulting number.
Although the random bit generator may generate bits where each outcome is as likely as the other to be randomly generated, and each 4-bit chunk may be just as likely as any other to be randomly generated, the number of bits in each chunk is not uniformly distributed.
What range of integers do you want? Say you're generating 4-bit integers. Do you want a range of [-4, 4], as in the 4-bit random walk in your question, or do you want a range of [-8, 7], which is what you get when you treat a 4-bit chunk of bits as a two's complement integer?
If the former, the random walk won't generate a uniform distribution, and you will need to tackle the problem in a different way.
In this case, to generate a uniform random number in the range [-4, 4], do the following:
Take 4 bits of the random bit generator and treat them as an integer in [0, 15);
If the integer is greater than 8, go to step 1.
Subtract 4 from the integer and output it.
This algorithm uses rejection sampling, but is variable-time (thus is not appropriate whenever timing differences can be exploited in a security attack). Numbers in other ranges are similarly generated, but the details are too involved to describe in this answer. See my article on random number generation methods for details.
Based on the code you've shown me, your approach to building up bytes, ints, and longs is highly error-prone. For example, a better way to build up an 8-bit byte to achieve what you want is as follows (keeping in mind that I am not very familiar with Kotlin, so the syntax may be wrong):
val i = 0
val b = 0
for (i = 0; i < 8; i++) {
b = b << 1; // Shift old bits
if (bitStringBuilder[i] == '1') {
b = b | 1; // Set new bit
} else {
b = b | 0; // Don't set new bit
}
}
value = (b as byte) as T
Also, if MediatorLiveData is not thread safe, then neither is your approach to gathering bits using a StringBuilder (especially because StringBuilder is not thread safe).
The approach you suggest, combining eight bits of the boolean generator to make one uniform integer, will work in theory. However, in practice there are several issues:
You don't mention what kind of hardware it is. In most cases, the hardware won't be likely to generate uniformly random Boolean bits unless the hardware is a so-called true random number generator designed for this purpose. For example, the hardware might generate uniformly distributed bits but have periodic behavior.
Entropy means how hard it is to predict the values a generator produces, compared to ideal random values. For example, a 64-bit data block with 32 bits of entropy is as hard to predict as an ideal random 32-bit data block. Characterizing a hardware device's entropy (or ability to produce unpredictable values) is far from trivial. Among other things, this involves entropy tests that have to be done across the full range of operating conditions suitable for the hardware (e.g., temperature, voltage).
Most hardware cannot produce uniform random values, so usually an additional step, called randomness extraction, entropy extraction, unbiasing, whitening, or deskewing, is done to transform the values the hardware generates into uniformly distributed random numbers. However, it works best if the hardware's entropy is characterized first (see previous point).
Finally, you still have to test whether the whole process delivers numbers that are "adequately random" for your purposes. There are several statistical tests that attempt to do so, such as NIST's Statistical Test Suite or TestU01.
For more information, see "Nondeterministic Sources and Seed Generation".
After your edits to this page, it seems you're going about the problem the wrong way. To produce a uniform random number, you don't add uniformly distributed random bits (e.g., bit() + bit() + bit()), but concatenate them (e.g., (bit() << 2) | (bit() << 1) | bit()). However, again, this will work in theory, but not in practice, for the reasons I mention above.

Efficiently implementing DXT1 texture decompression in hardware

DXT1 compression is designed to be fast to decompress in hardware where its used in texture samplers. The Wikipedia article says that under certain circumstances you can work out the co-efficients of the interpolated colours as:
c2 = (2/3)*c0+(1/3)*c1
or rearranging that:
c2 = (1/3)*(2*c0+c1)
However you re-arrange the above equation, then you end up always having to multiply something by 1/3 (or dividing by 3, same deal even more expensive). And it seems weird to me that a texture format which is designed to be fast to decompress in hardware would require a multiplication or division. The FPGA I'm implementing my GPU on only has limited resources for multiplications and I want to save those for where they're really required.
So am I missing something? Is there an efficient way of avoiding the multiplications of the colour channels by a 1/3? Or should I just eat the cost of that multiplication?
This might be a bad way of imagining it, but could you implement it via the use of addition/subtraction of successive halves (shifts)?
As you have 16 bits this gives you the ability to get quite accurate with successive additions and subtractions.
A third could be represented as
a(n+1) = a(n) +/- A>>1, where, the list [0, 0, 1, 0, 1, etc] shows whether to add or subtract the shifted result.
I believe this is called fractional maths.
However, in FPGAs, it is difficult to know whether this is actually more power efficient than the native DSP blocks (e.g. DSP48E1) provided.
MY best answer I can come up with is that I can use the identity:
x/3 = sum(n=1 to infinity) (x/2^(2n))
and then take the first n terms. Using 4 terms I get:
(x/4)+(x/16)+(x/64)+(x/256)
which equals
x*0.33203125
which is probably good enough.
This relies on multiplication by a fixed power of 2 being free in hardware, then 3 additions of which I can run 2 in parallel.
Any better answer is appreciated though.
** EDIT **: Using a combination of this and #dyslexicgruffalo's answer I made a simple c++ program which iterated over the various sequences and tried them all and recorded the various average/max errors.
I did this for 0 <= x <= 189 (as 189 is the value of 2*c0.g + c1.g when g (which is 6 bits) maxes out.
The shortest good sequence (with a max error of 2, average error of 0.62) and is 4 ops was:
1 + x/4 + x/16 + x/64.
The best sequence which had a max error of 1, average error of 0.32, but is 6 ops was:
x/2 - x/4 + x/8 - x/16 + x/32 - x/64.
For the 5 bit values (red and blue) the maximum value is 31*3 and the above sequences are still good but not the best. These are:
x/4 + x/8 - x/16 + x/32 [max error of 1, average 0.38]
and
1 + x/4 + x/16 [max error of 2, average of 0.68]
(And, luckily, none of the above sequences ever guesses an answer which is too big so no clamping is needed even though they're not perfect)

Extract transform and rotation matrices from homography?

I have 2 consecutive images from a camera and I want to estimate the change in camera pose:
I calculate the optical flow:
Const MAXFEATURES As Integer = 100
imgA = New Image(Of [Structure].Bgr, Byte)("pic1.bmp")
imgB = New Image(Of [Structure].Bgr, Byte)("pic2.bmp")
grayA = imgA.Convert(Of Gray, Byte)()
grayB = imgB.Convert(Of Gray, Byte)()
imagesize = cvGetSize(grayA)
pyrBufferA = New Emgu.CV.Image(Of Emgu.CV.Structure.Gray, Byte) _
(imagesize.Width + 8, imagesize.Height / 3)
pyrBufferB = New Emgu.CV.Image(Of Emgu.CV.Structure.Gray, Byte) _
(imagesize.Width + 8, imagesize.Height / 3)
features = MAXFEATURES
featuresA = grayA.GoodFeaturesToTrack(features, 0.01, 25, 3)
grayA.FindCornerSubPix(featuresA, New System.Drawing.Size(10, 10),
New System.Drawing.Size(-1, -1),
New Emgu.CV.Structure.MCvTermCriteria(20, 0.03))
features = featuresA(0).Length
Emgu.CV.OpticalFlow.PyrLK(grayA, grayB, pyrBufferA, pyrBufferB, _
featuresA(0), New Size(25, 25), 3, _
New Emgu.CV.Structure.MCvTermCriteria(20, 0.03D),
flags, featuresB(0), status, errors)
pointsA = New Matrix(Of Single)(features, 2)
pointsB = New Matrix(Of Single)(features, 2)
For i As Integer = 0 To features - 1
pointsA(i, 0) = featuresA(0)(i).X
pointsA(i, 1) = featuresA(0)(i).Y
pointsB(i, 0) = featuresB(0)(i).X
pointsB(i, 1) = featuresB(0)(i).Y
Next
Dim Homography As New Matrix(Of Double)(3, 3)
cvFindHomography(pointsA.Ptr, pointsB.Ptr, Homography, HOMOGRAPHY_METHOD.RANSAC, 1, 0)
and it looks right, the camera moved leftwards and upwards:
Now I want to find out how much the camera moved and rotated. If I declare my camera position and what it's looking at:
' Create camera location at origin and lookat (straight ahead, 1 in the Z axis)
Location = New Matrix(Of Double)(2, 3)
location(0, 0) = 0 ' X location
location(0, 1) = 0 ' Y location
location(0, 2) = 0 ' Z location
location(1, 0) = 0 ' X lookat
location(1, 1) = 0 ' Y lookat
location(1, 2) = 1 ' Z lookat
How do I calculate the new position and lookat?
If I'm doing this all wrong or if there's a better method, any suggestions would be very welcome, thanks!
For pure camera rotation R = A-1HA. To prove this consider image to plane homographies H1=A and H2=AR, where A is camera intrinsic matrix. Then H12=H2*H1-1=A-1RA, from which you can obtain R
Camera translation is harder to estimate. If the camera translates you have to a find fundamental matrix first (not homography): xTFx=0 and then convert it into an essential matrix E=ATFA; Then you can decompose E into rotation and translation E=txR, where tx means a vector product matrix. Decomposition is not obvious, see this.
The rotation you get will be exact while the translation vector can be found only up to scale. Intuitively this scaling means that from the two images alone you cannot really say whether the objects are close and small or far away and large. To disambiguate we may use a familiar size objects, known distance between two points, etc.
Finally note that a human visual system has a similar problem: though we "know" the distance between our eyes, when they are converged on the object the disparity is always zero and from disparity alone we cannot say what the distance is. Human vision relies on triangulation from eyes version signal to figure out absolute distance.
Well what your looking at is in simple terms a Pythagorean theorem problem a^2 + b^2 = c^2. However when it comes to camera based applications things are not very easy to accurately determine. You have found half of the detail you need for "a" however finding "b" or "c" is much harder.
The Short Answer
Basically it can't be done with a single camera. But it can be with done with two cameras.
The Long Winded Answer (Thought I'd explain in more depth, no pun intended)
I'll try and explain, say we select two points within our image and move the camera left. We know the distance from the camera of each point B1 is 20mm and point B2 is 40mm . Now lets assume that we process the image and our measurement are A1 is (0,2) and A2 is (0,4) these are related to B1 and B2 respectively. Now A1 and A2 are not measurements; they are pixels of movement.
What we now have to do is multiply the change in A1 and A2 by a calculated constant which will be the real world distance at B1 and B2. NOTE: Each one these is different according to measurement B*. This all relates to Angle of view or more commonly called the Field of View in photography at different distances. You can accurately calculate the constant if you know the size of each pixel on the camera CCD and the f number of the lens you have inside the camera.
I would expect this isn't the case so at different distances you have to place an object of which you know the length and see how many pixels it takes up. Close up you can use a ruler to make things easier. With these measurements. You take this data and form a curve with a line of best fit. Where the X-axis will be the distance of the object and the Y-axis will be the constant of pixel to distance ratio that you must multiply your movement by.
So how do we apply this curve. Well it's guess work. In theory the larger the measurement of movement A* the closer the object to the camera. In our example our ratios for A1 > A2 say 5mm and 3mm respectively and we would now know that point B1 has moved 10mm (2x5mm) and B2 has moved 6mm (2x6mm). But let's face it - we will never know B and we will never be able to tell if a distance moved is 20 pixels of an object close up not moving far or an object far away moving a much great distance. This is why things like the Xbox Kinect use additional sensors to get depth information that can be tied to the objects within the image.
What you attempting could be attempted with two cameras as the distance between these cameras is known the movement can be more accurately calculated (effectively without using a depth sensor). The maths behind this is extremely complex and I would suggest looking up some journal papers on the subject. If you would like me to explain the theory, I can attempt to.
All my experience comes from designing high speed video acquisition and image processing for my PHD so trust me, it can't be done with one camera, sorry. I hope some of this helps.
Cheers
Chris
[EDIT]
I was going to add a comment but this is easier due to the bulk of information:
Since it is the Kinect I will assume you have some relevant depth information associated with each point if not you will need to figure out how to get this.
The equation you will need to start of with is for the Field of View (FOV):
o/d = i/f
Where:
f is equal to the focal length of the lens usually given in mm (i.e. 18 28 30 50 are standard examples)
d is the object distance from the lens gathered from kinect data
o is the object dimension (or "field of view" perpendicular to and bisected by the optical axis).
i is the image dimension (or "field stop" perpendicular to and bisected by the optical axis).
We need to calculate i, where o is our unknown so for i (which is a diagonal measurement),
We will need the size of the pixel on the ccd this will in micrometres or µm you will need to find this information out, For know we will take it as being 14um which is standard for a midrange area scan camera.
So first we need to work out i horizontal dimension (ih) which is the number of pixels of the width of the camera multiplied by the size of the ccd pixel (We will use 640 x 320)
so: ih = 640*14um = 8960um
= 8960/1000 = 8.96mm
Now we need i vertical dimension (iv) same process but height
so: iv = (320 * 14um) / 1000 = 4.48mm
Now i is found by Pythagorean theorem Pythagorean theorem a^2 + b^2 = c^2
so: i = sqrt(ih^2 _ iv^2)
= 10.02 mm
Now we will assume we have a 28 mm lens. Again, this exact value will have to be found out. So our equation is rearranged to give us o is:
o = (i * d) / f
Remember o will be diagonal (we will assume of object or point is 50mm away):
o = (10.02mm * 50mm) / 28mm
17.89mm
Now we need to work out o horizontal dimension (oh) and o vertical dimension (ov) as this will give us the distance per pixel that the object has moved. Now as FOV α CCD or i is directly proportional to o we will work out a ratio k
k = i/o
= 10.02 / 17.89
= 0.56
so:
o horizontal dimension (oh):
oh = ih / k
= 8.96mm / 0.56 = 16mm per pixel
o vertical dimension (ov):
ov = iv / k
= 4.48mm / 0.56 = 8mm per pixel
Now we have the constants we require, let's use it in an example. If our object at 50mm moves from position (0,0) to (2,4) then the measurements in real life are:
(2*16mm , 4*8mm) = (32mm,32mm)
Again, a Pythagorean theorem: a^2 + b^2 = c^2
Total distance = sqrt(32^2 + 32^2)
= 45.25mm
Complicated I know, but once you have this in a program it's easier. So for every point you will have to repeat at least half the process as d will change on therefore o for every point your examining.
Hope this gets you on your way,
Cheers
Chris

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks
If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.
Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.