Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks - optimization

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks

If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.

Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.

Related

Is ZSKIPLIST_MAXLEVEL(64) enough for 2^64 elements?

I have learned about skiplist from Skip Lists: A Probabilistic Alternative to Balanced Trees recently. I think 64 levels with p=0.25 should have 4^64 elements( If p = 1/2, using MaxLevel = 16 is appropriate for data structures containing up to 2^16 elements.--quote from A Probabilistic Alternative to Balanced Trees). But when I check the Redis server.h, I see
define ZSKIPLIST_MAXLEVEL 64 /* Should be enough for 2^64 elements */
define ZSKIPLIST_P 0.25 /* Skiplist P = 1/4 */.
Am I wrong or the config wrong?
You are right. It is fair to say that server.h defines a MAXLEVEL higher than necessary.
The probability that a particular element reaches a level at least k is p^k, and the probability that any of the n elements reaches a level at least k is n·p^k.
Redis uses the skip list for sorted sets, and it uses it in conjunction with a hash map. The hash map (dict) has a capacity of 2^64-1 elements (ready for, not ever tested in practice). Hence the comment on ZSKIPLIST_MAXLEVEL.
It follows that if we use N = 2^64 as an upper bound on the number of elements in the skip list, Redis could have used ZSKIPLIST_MAXLEVEL = 32.
The nodes allocate memory up the level they actually got when rolling the dice. Only the first, null node uses the ZSKIPLIST_MAXLEVEL. The impact: 512 bytes of extra memory per sorted list (of 129 entries or more - before, it uses a ziplist, see zset-max-ziplist-entries config setting).
The other side effect is the chance a node would get an unnecessarily high level, but the odds get really low really fast as we go up beyond the 32nds level.
I think it is worth fixing IMO, for those 512 bytes.

How to Make a Uniform Random Integer Generator from a Random Boolean Generator?

I have a hardware-based boolean generator that generates either 1 or 0 uniformly. How to use it to make a uniform 8-bit integer generator? I'm currently using the collected booleans to create the binary string for the 8-bit integer. The generated integers aren't uniformly distributed. It follows the distribution explained on this page. Integers with ̶a̶ ̶l̶o̶t̶ ̶o̶f̶ ̶a̶l̶t̶e̶r̶n̶a̶t̶I̶n̶g̶ ̶b̶I̶t̶s̶ the same number of 1's and 0's such as 85 (01010101) and -86 (10101010) have the highest chance to be generated and integers with a lot of repeating bits such as 0 (00000000) and -1 (11111111) have the lowest chance.
Here's the page that I've annotated with probabilities for each possible 4-bit integer. We can see that they're not uniform. 3, 5, 6, -7, -6, and -4 that have the same number of 1's and 0's have ⁶/₁₆ probability while 0 and -1 that all of their bits are the same only have ¹/₁₆ probability.
.
And here's my implementation on Kotlin
Based on your edit, there appears to be a misunderstanding here. By "uniform 4-bit integers", you seem to have the following in mind:
Start at 0.
Generate a random bit. If it's 1, add 1, and otherwise subtract 1.
Repeat step 2 three more times.
Output the resulting number.
Although the random bit generator may generate bits where each outcome is as likely as the other to be randomly generated, and each 4-bit chunk may be just as likely as any other to be randomly generated, the number of bits in each chunk is not uniformly distributed.
What range of integers do you want? Say you're generating 4-bit integers. Do you want a range of [-4, 4], as in the 4-bit random walk in your question, or do you want a range of [-8, 7], which is what you get when you treat a 4-bit chunk of bits as a two's complement integer?
If the former, the random walk won't generate a uniform distribution, and you will need to tackle the problem in a different way.
In this case, to generate a uniform random number in the range [-4, 4], do the following:
Take 4 bits of the random bit generator and treat them as an integer in [0, 15);
If the integer is greater than 8, go to step 1.
Subtract 4 from the integer and output it.
This algorithm uses rejection sampling, but is variable-time (thus is not appropriate whenever timing differences can be exploited in a security attack). Numbers in other ranges are similarly generated, but the details are too involved to describe in this answer. See my article on random number generation methods for details.
Based on the code you've shown me, your approach to building up bytes, ints, and longs is highly error-prone. For example, a better way to build up an 8-bit byte to achieve what you want is as follows (keeping in mind that I am not very familiar with Kotlin, so the syntax may be wrong):
val i = 0
val b = 0
for (i = 0; i < 8; i++) {
b = b << 1; // Shift old bits
if (bitStringBuilder[i] == '1') {
b = b | 1; // Set new bit
} else {
b = b | 0; // Don't set new bit
}
}
value = (b as byte) as T
Also, if MediatorLiveData is not thread safe, then neither is your approach to gathering bits using a StringBuilder (especially because StringBuilder is not thread safe).
The approach you suggest, combining eight bits of the boolean generator to make one uniform integer, will work in theory. However, in practice there are several issues:
You don't mention what kind of hardware it is. In most cases, the hardware won't be likely to generate uniformly random Boolean bits unless the hardware is a so-called true random number generator designed for this purpose. For example, the hardware might generate uniformly distributed bits but have periodic behavior.
Entropy means how hard it is to predict the values a generator produces, compared to ideal random values. For example, a 64-bit data block with 32 bits of entropy is as hard to predict as an ideal random 32-bit data block. Characterizing a hardware device's entropy (or ability to produce unpredictable values) is far from trivial. Among other things, this involves entropy tests that have to be done across the full range of operating conditions suitable for the hardware (e.g., temperature, voltage).
Most hardware cannot produce uniform random values, so usually an additional step, called randomness extraction, entropy extraction, unbiasing, whitening, or deskewing, is done to transform the values the hardware generates into uniformly distributed random numbers. However, it works best if the hardware's entropy is characterized first (see previous point).
Finally, you still have to test whether the whole process delivers numbers that are "adequately random" for your purposes. There are several statistical tests that attempt to do so, such as NIST's Statistical Test Suite or TestU01.
For more information, see "Nondeterministic Sources and Seed Generation".
After your edits to this page, it seems you're going about the problem the wrong way. To produce a uniform random number, you don't add uniformly distributed random bits (e.g., bit() + bit() + bit()), but concatenate them (e.g., (bit() << 2) | (bit() << 1) | bit()). However, again, this will work in theory, but not in practice, for the reasons I mention above.

Mutable array, Objective-c, or variable array, c. Any difference in performance?

I have a multidimensional array (3D matrix) of unknown size, where each element in this matrix is of type short int.
The size of the matrix can be approximated to be around 10 x 10 x 1,000,000.
As I see it I have two options: Mutable Array (Objective-c) or Variable Array (c).
Are there any difference in reading writing to these arrays?
How large will these files become when I save to file?
Any advice would be gratefully accepted.
Provided you know the size of the array at the point of creation, i.e. you don't need to dynamically change the bounds, then a C array of short int with these dimensions will win easily - for reasons such as no encoding of values as objects and direct indexing.
If you write the array in binary to a file then it will just be the number of elements multiplied by sizeof(short int) without any overhead. If you need to also stored the dimensions that is 3 * sizeof(int) - 12 or 24 bytes.
The mutable array will be slower (albeit not by much) since its built on a C array. How large will the file be when you save this array?
It will take you more than 10x10x10000000 bytes because you'll have to encode it in a way where you can recall the matrix. This part is really up to you. For a 3D array, you'll have to use a special character/format in order to denote a 3D array. It depends on how you want to do this, but it will take 1 byte for every digit of every number + 1 char for the space you'll put between elements in the same row + (1 NL For every 2nd dimension in your array * n) + (1 other character for 3d values * n *n)
It might be easier to Stick each Row into its own file, and then stick the columns below it like you normally would. Then in a new file, I would start putting the 3d elements such that each line lines up with the column number of the 2nd dimension. That's just me though, its up to you.

What's the best way to store a bi-dimensional sparse array (2d sparse matrix) ? How much size will it have in VoltDB?

Question one: Are there specialized databases to store dense and sparse matrices ? I googled but didn't find any...
The matrix in question is huge (10^5 by 10^5) but it's sparse, which means that most of its values are zeros and I only need to store the non-zero values. So I thought of making a table like this:
2D Matrix
---------------
X Y val
---------------
1 2 4.2
5 1 91.0
9 3 139.1
And so on. 3 columns, two for the coordinates, the third for the value of that cell in the sparse matrix. Question 2: Is this the best way to store a sparse matrix ? I also considered MongoDB but it seems that making one document per cell of the matrix would be too much overhead. Table oriented databases are slow but I can use VoltDB :) Side-node: I thought of a Redis Hash but can't make it bi-dimensional (found a way to serialize 2D matrixes and make it 1D, that way I can store in a Redis Hash or even List)
Question 3: How many bytes per line will VoltDB use ? The coordinates will be integers ranging from 0 to 10^5 maybe more, the values of the cell will be floats.
Regarding Question 3, based on your example, the X and Y columns could be the INTEGER datatype in VoltDB, which is 4 bytes. The value column could be a FLOAT datatype, which is 8 bytes.
Each record would therefore be 16 bytes, so the nominal size in memory would be 16 bytes * row count. In general, you add 30% for overhead, and then 1GB per server for heap size to determine the overall memory needed. See the references below for more detail.
You will probably want to index this table, so assuming you wanted a compound index of (x,y), the size would be as follows:
Tree index:
(sum-of-column-sizes + 8 + 32) * rowcount
Hash index:
(((2 * rowcount) + 1) * 8) + ((sum-of-column-sizes + 32) * rowcount)
sum-of-column-sizes for (x,y) is 8 bytes.
References:
The available datatypes are listed in Appendix A of Using VoltDB:
http://community.voltdb.com/docs/UsingVoltDB/ddlref_createtable#TabDatatypes
Guidelines and formulas for estimating the memory size are in the VoltDB Planning Guide:
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs
The two most relevant questions are 1) how sparse? and 2) how do you want to use the data?
First, I'm assuming you want to read/write/process the data from within the database. If not, then there are many sparse matrix encodings that could even be packed into a blob and optionally compressed.
Assuming your data is fairly sparse, and assuming you want to use the data within the database, the x,y,value tuple storage is probably best. Chapter 4 of the VoltDB Planning Guide covers estimating memory usage and sizing your hardware.
http://community.voltdb.com/docs/PlanningGuide/ChapMemoryRecs
The short answer is that tables with numeric data are packed pretty tight. You've got 12 bytes of real data per row (short, short, double). You should see an average of just over 1 byte beyond that in overhead per row. You'll also need to add in the size of any indexes. The documentation describes the worst case. I think for an index on two short integers such as the X and Y columns, the storage per key will be close to 28 bytes, including overhead.

How to make a start on the "crackless wall" problem

Here's the problem statement:
Consider the problem of building a wall out of 2x1 and 3x1 bricks (horizontal×vertical dimensions) such that, for extra strength, the gaps between horizontally-adjacent bricks never line up in consecutive layers, i.e. never form a "running crack".
There are eight ways of forming a crack-free 9x3 wall, written W(9,3) = 8.
Calculate W(32,10). < Generalize it to W(x,y) >
http://www.careercup.com/question?id=67814&form=comments
The above link gives a few solutions, but I'm unable to understand the logic behind them. I'm trying to code this in Perl and have done so far:
input : W(x,y)
find all possible i's and j's such that x == 3(i) + 2(j);
for each pair (i,j) ,
find n = (i+j)C(j) # C:combinations
Adding all these n's should give the count of all possible combinations. But I have no idea on how to find the real combinations for one row and how to proceed further.
Based on the claim that W(9,3)=8, I'm inferring that a "running crack" means any continuous vertical crack of height two or more. Before addressing the two-dimensional problem as posed, I want to discuss an analogous one-dimensional problem and its solution. I hope this will make it more clear how the two-dimensional problem is thought of as one-dimensional and eventually solved.
Suppose you want to count the number of lists of length, say, 40, whose symbols come from a reasonably small set of, say, the five symbols {a,b,c,d,e}. Certainly there are 5^40 such lists. If we add an additional constraint that no letter can appear twice in a row, the mathematical solution is still easy: There are 5*4^39 lists without repeated characters. If, however, we instead wish to outlaw consonant combinations such as bc, cb, bd, etc., then things are more difficult. Of course we would like to count the number of ways to choose the first character, the second, etc., and multiply, but the number of ways to choose the second character depends on the choice of the first, and so on. This new problem is difficult enough to illustrate the right technique. (though not difficult enough to make it completely resistant to mathematical methods!)
To solve the problem of lists of length 40 without consonant combinations (let's call this f(40)), we might imagine using recursion. Can you calculate f(40) in terms of f(39)? No, because some of the lists of length 39 end with consonants and some end with vowels, and we don't know how many of each type we have. So instead of computing, for each length n<=40, f(n), we compute, for each n and for each character k, f(n,k), the number of lists of length n ending with k. Although f(40) cannot be
calculated from f(39) alone, f(40,a) can be calculated in terms of f(30,a), f(39,b), etc.
The strategy described above can be used to solve your two-dimensional problem. Instead of characters, you have entire horizontal brick-rows of length 32 (or x). Instead of 40, you have 10 (or y). Instead of a no-consonant-combinations constraint, you have the no-adjacent-cracks constraint.
You specifically ask how to enumerate all the brick-rows of a given length, and you're right that this is necessary, at least for this approach. First, decide how a row will be represented. Clearly it suffices to specify the locations of the 3-bricks, and since each has a well-defined center, it seems natural to give a list of locations of the centers of the 3-bricks. For example, with a wall length of 15, the sequence (1,8,11) would describe a row like this: (ooo|oo|oo|ooo|ooo|oo). This list must satisfy some natural constraints:
The initial and final positions cannot be the centers of a 3-brick. Above, 0 and 14 are invalid entries.
Consecutive differences between numbers in the sequence must be odd, and at least three.
The position of the first entry must be odd.
The difference between the last entry and the length of the list must also be odd.
There are various ways to compute and store all such lists, but the conceptually easiest is a recursion on the length of the wall, ignoring condition 4 until you're done. Generate a table of all lists for walls of length 2, 3, and 4 manually, then for each n, deduce a table of all lists describing walls of length n from the previous values. Impose condition 4 when you're finished, because it doesn't play nice with recursion.
You'll also need a way, given any brick-row S, to quickly describe all brick-rows S' which can legally lie beneath it. For simplicity, let's assume the length of the wall is 32. A little thought should convince you that
S' must satisfy the same constraints as S, above.
1 is in S' if and only if 1 is not in S.
30 is in S' if and only if 30 is not in S.
For each entry q in S, S' must have a corresponding entry q+1 or q-1, and conversely every element of S' must be q-1 or q+1 for some element q in S.
For example, the list (1,8,11) can legally be placed on top of (7,10,30), (7,12,30), or (9,12,30), but not (9,10,30) since this doesn't satisfy the "at least three" condition. Based on this description, it's not hard to write a loop which calculates the possible successors of a given row.
Now we put everything together:
First, for fixed x, make a table of all legal rows of length x. Next, write a function W(y,S), which is to calculate (recursively) the number of walls of width x, height y, and top row S. For y=1, W(y,S)=1. Otherwise, W(y,S) is the sum over all S' which can be related to S as above, of the values W(y-1,S').
This solution is efficient enough to solve the problem W(32,10), but would fail for large x. For example, W(100,10) would almost certainly be infeasible to calculate as I've described. If x were large but y were small, we would break all sensible brick-laying conventions and consider the wall as being built up from left-to-right instead of bottom-to-top. This would require a description of a valid column of the wall. For example, a column description could be a list whose length is the height of the wall and whose entries are among five symbols, representing "first square of a 2x1 brick", "second square of a 2x1 brick", "first square of a 3x1 brick", etc. Of course there would be constraints on each column description and constraints describing the relationship between consecutive columns, but the same approach as above would work this way as well, and would be more appropriate for long, short walls.
I found this python code online here and it works fast and correctly. I do not understand how it all works though. I could get my C++ to the last step (count the total number of solutions) and could not get it to work correctly.
def brickwall(w,h):
# generate single brick layer of width w (by recursion)
def gen_layers(w):
if w in (0,1,2,3):
return {0:[], 1:[], 2:[[2]], 3:[[3]]}[w]
return [(layer + [2]) for layer in gen_layers(w-2)] + \
[(layer + [3]) for layer in gen_layers(w-3)]
# precompute info about whether pairs of layers are compatible
def gen_conflict_mat(layers, nlayers, w):
# precompute internal brick positions for easy comparison
def get_internal_positions(layer, w):
acc = 0; intpos = set()
for brick in layer:
acc += brick; intpos.add(acc)
intpos.remove(w)
return intpos
intpos = [get_internal_positions(layer, w) for layer in layers]
mat = []
for i in range(nlayers):
mat.append([j for j in range(nlayers) \
if intpos[i].isdisjoint(intpos[j])])
return mat
layers = gen_layers(w)
nlayers = len(layers)
mat = gen_conflict_mat(layers, nlayers, w)
# dynamic programming to recursively compute wall counts
nwalls = nlayers*[1]
for i in range(1,h):
nwalls = [sum(nwalls[k] for k in mat[j]) for j in range(nlayers)]
return sum(nwalls)
print(brickwall(9,3)) #8
print(brickwall(9,4)) #10
print(brickwall(18,5)) #7958
print(brickwall(32,10)) #806844323190414