How to parallelize collision constraints in position based dynamics - gpu

I'm trying to implement PBD algorithm in OpenCL. In the original paper the algorithm is stated as
(1) forall vertices i
(2) initialize x[i]=x0[i], v[i]=v0[i], w[i]=1/m[i]
(3)endfor
(4)loop
(5) forall vertices i do v[i]←v[i]+∆t*w*i*f_ext(x[i])
(6) dampVelocities(v1,...,vN)
(7) forall vertices i do p[i]←x[i]+∆t*v[i]
(8) forall vertices i do generateCollisionConstraints(xi→pi)
(9) loop solverIterations times
(10) projectConstraints(C1,...,CM+Mcoll,p1,...,pN)
(11) endloop
(12) forall vertices i
(13) v[i]←(p[i]−x[i])/∆t(14)x[i]←p[i]
(15) endfor
(16) velocityUpdate(v1,...,vN)
(17)endloop
The step 8 is the most difficulut to efficiently parallelize. What I'm doing currently is to use a regular 3D grid (represented as a single buffer) where I first insert all the particles. Next I iterate neighboring cells and check for collisions. But what do I do when the collision is actually detected? I cannot hold a variable size vector, nor I can "append" constrains to a buffer because that would require atomic operations, which are not parallel. What are some efficient approaches for implementing PBD on GPU?

What I figured out so far is that probably the best approach to implement this
is either
use atomicAdd to simulate stack, similarly to the way it's done in face culling https://vkguide.dev/docs/gpudriven/compute_culling/
preallocate a small stack per each thread and let each thread append the constraints to that private stack. Then you could efficiently join all those stacks into one large array in logarithmic time. To show it more visually consider a 3 stacks each of maximum length 2
Stack 1: |_|_| (empty stack)
Stack 2: |A|_| (stack with one element)
Stack 3: |B|C| (stack fully filled up to its maximum length)
Then concatenate these stacks like so
|A|_|_|_|B|C|
and now iteratively count how much each element should be shifted, which is equal to the number of vacant elements to its left (inclusive). Create a second buffer that holds the number of those vacant elements. The initial state of this buffer is
|0|1|1|1|0|0| (each cell is 0 if it is empty or 1 otherwise)
In the second iteration you sum values of cells at positions i and i+1 and store the result in i+1.
|0|1|2|2|1|0| (each cell holds number of filled cells up to 2 places to the left, including itself)
In the third iteration you sum up values of cells i and i+2
|0|1|2|3|3|2| (each cell holds number of filled cells up to 4 places to the left, including itself)
In the fourth iteration you sum up values of cells i and i+4
|0|1|2|3|3|3| (each cell holds number of filled cells up to 8 places to the left, including itself)
Now you know that A should be shifted by 0 to the left, B by 3 and C by 3, arriving at final result
|A|B|C|_|_|_|

Related

Efficient way to create masking kreg values [duplicate]

This question already has answers here:
BMI for generating masks with AVX512
(2 answers)
Closed 3 years ago.
One of the benefits of Intel's AVX-512 extension is that nearly all operations can be masked by providing in addition to the vector register a kreg which specifies a mask to apply to the operation: elements excluded by the mask may either be set to zero or retain their previous value.
A particularly common use of the kreg is to create a mask that excludes N contiguous elements at the beginning or end of a vector, e.g., as the first or final iteration in a vectorized loop where less than a full vector would be processed. E.g., for a loop over 121 int32_t values, the first 112 elements could be handled by 7 full 512-bit vectors, but that leaves 9 elements left over which could be handled by masked operations which operate only on the first 9 elements.
So the question is, given a (runtime valued) integer r which is some value in the range 0 - 16 representing remaining elements, what's the most efficient way to load a 16-bit kreg such that the low r bits are set and the remaining bits unset? KSHIFTLW seems unsuitable for the purpose because it only takes an immediate.
BMI2 bzhi does exactly what you want: Zero High Bits Starting with Specified Bit Position. Every CPU with AVX512 so far has BMI2.
__mmask16 k = _bzhi_u32(-1UL, r);
This costs 2 instructions, both single-uop: mov-immediate and bzhi. It's even single-cycle latency. (Or 3 cycles on KNL)
For r=0, it zeros all the bits giving 0.
For r=1, it leaves only the low bit (bit #0) giving 1
For r=12, it zeros bit #12 and higher, leaving 0x0FFF (12 bits set)
For r>=32 BZHI leaves all 32 bits set (and sets CF)
The INDEX is specified by bits 7:0 of the second source operand
If you had a single-vector-at-a-time cleanup loop that runs after an unrolled vector loop, you could even use this every loop iterations, counting the remaining length down towards zero, instead of a separate last-vector cleanup. It leaves all bits set for high lengths. But this costs 2 uops inside the loop, including port 5 kmovw, and means your main loop would have to use masked instructions. This only works for r<=255 because it only looks at the low byte, not the full integer index. But the mov reg, -1 can be hoisted because bzhi doesn't destroy it.
PS. Normally I think you'd want to arrange your cleanup to handle 1..16 elements, (or 0..15 if you branch to maybe skip it). But the full 17-possibility 0..16 makes sense if this cleanup also handles small lengths that never enter the main loop at all, and len=0 is possible. (And your main loop exits with length remaining = 1..16 so the final iteration can be unconditional)

Dealing with simultaneous collisions in N-body sim

I have written an 2d N-body simulation in Python which allows collisions between the bodies.
A body is modeled as a circle whose area is proportional to its mass. Each time the sim advances by one time-step, I identify which bodies have overlapping radii and I replace them by one particle (respecting conservation of momentum and treating the collision as inelastic).
This worked well for N<100 but for higher N I encountered (not systematically, but depending on the initial conditions of the sim) a weird error: the number of particles N actually increased (and quickly exceeded the maximum integer value Python can handle).
I believe what is occuring is, after the new positions of the bodies have been calculated, it happens that (at least) 4 bodies mutually overlap. Each of the 6 pairs creates a new body and but we also delete the 4 bodies,so N increased by os we have 6-4=2. But the 6 new bodies are spawned in roughly the same positions as the original 4 bodies so now we have even more overlap (N can now increase by potentially (6 choose 2)-6=9 bodies). Hence why N can grow very fast.
How can I avoid this error?
The only thing I can think of is to reduce the radii of the bodies to reduce the change of 4 particles mutually overlapping. But this does not definitively fix the problem.

isosurface tracking in high dimensions

how to trace isosurface on a higher dimensional space efficiently
You have a scalar cost function in N dimensions,
f(y0, y1, .., yN) ∊ ℝ, y ∊ ℝ
but sampled only in a regular rectangular grid,
yk = Ψk + ψk xk, constants Ψk ∊ ℝ and ψk ∊ ℝ, and grid coordinates xk ∊ ℕ
and the problem is to locate the isosurface(s) i,
f(y0, y1, .., yN) = Ci
The direct approach would be to just loop over each cell in the grid, and check if the current isosurface intersects the current cell, and if so, describe the part of the isosurface within the current cell. (Marching Cubes is one approach to describing how the isosurface intersects each grid cell.)
The restriction here is to use a neighborhood based search instead of examining every single cell.
OP had a previous question specifically for the 3D case, to which I posted a link to example code, grid.h and grid.c (at Pastebin.com, because they were too long to include inline).
That implementation is completely different to OP's slicing method. Mine is a direct, simple walk over the grid cells intersecting the current isosurface. It caches the grid samples, and uses a separate map (one char per grid cell) to keep track which grid cells have been cached, walked, and/or pushed to a stack to be walked later. This approach is easily extended to more than three dimensions. Although the code is written for exactly three dimensions, the approach itself is not specific to three dimensions at all; all you need to do is to adjust the data structures to accommodate any (sensible) number of dimensions.
The isosurface walk itself is trivial. You start from any grid cell the isosurface intersects, then examine all 2N nearest neighbor cells to see if the isosurface intersects those too. In practice, you use a stack of grid cell locations to be examined, and a map of grid cell flags to avoid re-examining already examined grid cells.
Because the number of grid point samples per grid cell is 2N, my example code is not optimal: a lot of nearby grid points end up being evaluated to see if the neighboring grid cells do intersect the isosurface. (Instead of examining only the grid points delimiting the isosurface, grid points belonging to any grid cells surrounding the isosurface are examined.) This extra work grows exponentially as N increases.
A better approach would be to consider each of the 2N possible (N-1)-faces separately, to avoid examining cells the isosurface does not intersect at all.
In an N-dimensional regular rectangular grid, each cell is an N-dimensional cuboid, defined by the 2N grid points at the vertices (corners). The N-cuboid cells have N(N-1) two-dimensional faces, and 2N (N-1)-dimensional faces.
To examine each (N-1)-face, you need to examine the cost function at the 2N-1 grid points defining that (N-1)-face. If the cost function at those points spans the isosurface value, then the isosurface intersects the (N-1)-face, and the isosurface intersects the next grid cell in that direction also.
There are two (N-1)-faces perpendicular to each axis. If the isosurface intersects the (N-1)-face closer to negative infinity, then the isosurface intersects the next grid cell along that axis towards negative infinity too. Similarly, if the isosurface intersects the (N-1)-face closer to positive infinity, then it also intersects the next grid cell along that axis towards positive infinity too. Thus, the (N-1)-faces are perfect for deciding which neighboring cells should be examined or not. This is true because the (N-1)-face is exactly the set of grid points the two cells share.
I'm very hesitant to provide example C code, because the example code of the same approach for the 3D case does not seem to have helped anyone thus far. I fear a longer explanation with 2- and 3-dimensional example images for illustration would be needed to describe the approach in easily understandable terms; and without a firm grasp of the logic, any example code would just look like gobbledygook.
You are better using a library for 2 dimension you can try the conrec algorithm from Prof. Paul Bourke. It's similar to a marching cube.

How to make a start on the "crackless wall" problem

Here's the problem statement:
Consider the problem of building a wall out of 2x1 and 3x1 bricks (horizontal×vertical dimensions) such that, for extra strength, the gaps between horizontally-adjacent bricks never line up in consecutive layers, i.e. never form a "running crack".
There are eight ways of forming a crack-free 9x3 wall, written W(9,3) = 8.
Calculate W(32,10). < Generalize it to W(x,y) >
http://www.careercup.com/question?id=67814&form=comments
The above link gives a few solutions, but I'm unable to understand the logic behind them. I'm trying to code this in Perl and have done so far:
input : W(x,y)
find all possible i's and j's such that x == 3(i) + 2(j);
for each pair (i,j) ,
find n = (i+j)C(j) # C:combinations
Adding all these n's should give the count of all possible combinations. But I have no idea on how to find the real combinations for one row and how to proceed further.
Based on the claim that W(9,3)=8, I'm inferring that a "running crack" means any continuous vertical crack of height two or more. Before addressing the two-dimensional problem as posed, I want to discuss an analogous one-dimensional problem and its solution. I hope this will make it more clear how the two-dimensional problem is thought of as one-dimensional and eventually solved.
Suppose you want to count the number of lists of length, say, 40, whose symbols come from a reasonably small set of, say, the five symbols {a,b,c,d,e}. Certainly there are 5^40 such lists. If we add an additional constraint that no letter can appear twice in a row, the mathematical solution is still easy: There are 5*4^39 lists without repeated characters. If, however, we instead wish to outlaw consonant combinations such as bc, cb, bd, etc., then things are more difficult. Of course we would like to count the number of ways to choose the first character, the second, etc., and multiply, but the number of ways to choose the second character depends on the choice of the first, and so on. This new problem is difficult enough to illustrate the right technique. (though not difficult enough to make it completely resistant to mathematical methods!)
To solve the problem of lists of length 40 without consonant combinations (let's call this f(40)), we might imagine using recursion. Can you calculate f(40) in terms of f(39)? No, because some of the lists of length 39 end with consonants and some end with vowels, and we don't know how many of each type we have. So instead of computing, for each length n<=40, f(n), we compute, for each n and for each character k, f(n,k), the number of lists of length n ending with k. Although f(40) cannot be
calculated from f(39) alone, f(40,a) can be calculated in terms of f(30,a), f(39,b), etc.
The strategy described above can be used to solve your two-dimensional problem. Instead of characters, you have entire horizontal brick-rows of length 32 (or x). Instead of 40, you have 10 (or y). Instead of a no-consonant-combinations constraint, you have the no-adjacent-cracks constraint.
You specifically ask how to enumerate all the brick-rows of a given length, and you're right that this is necessary, at least for this approach. First, decide how a row will be represented. Clearly it suffices to specify the locations of the 3-bricks, and since each has a well-defined center, it seems natural to give a list of locations of the centers of the 3-bricks. For example, with a wall length of 15, the sequence (1,8,11) would describe a row like this: (ooo|oo|oo|ooo|ooo|oo). This list must satisfy some natural constraints:
The initial and final positions cannot be the centers of a 3-brick. Above, 0 and 14 are invalid entries.
Consecutive differences between numbers in the sequence must be odd, and at least three.
The position of the first entry must be odd.
The difference between the last entry and the length of the list must also be odd.
There are various ways to compute and store all such lists, but the conceptually easiest is a recursion on the length of the wall, ignoring condition 4 until you're done. Generate a table of all lists for walls of length 2, 3, and 4 manually, then for each n, deduce a table of all lists describing walls of length n from the previous values. Impose condition 4 when you're finished, because it doesn't play nice with recursion.
You'll also need a way, given any brick-row S, to quickly describe all brick-rows S' which can legally lie beneath it. For simplicity, let's assume the length of the wall is 32. A little thought should convince you that
S' must satisfy the same constraints as S, above.
1 is in S' if and only if 1 is not in S.
30 is in S' if and only if 30 is not in S.
For each entry q in S, S' must have a corresponding entry q+1 or q-1, and conversely every element of S' must be q-1 or q+1 for some element q in S.
For example, the list (1,8,11) can legally be placed on top of (7,10,30), (7,12,30), or (9,12,30), but not (9,10,30) since this doesn't satisfy the "at least three" condition. Based on this description, it's not hard to write a loop which calculates the possible successors of a given row.
Now we put everything together:
First, for fixed x, make a table of all legal rows of length x. Next, write a function W(y,S), which is to calculate (recursively) the number of walls of width x, height y, and top row S. For y=1, W(y,S)=1. Otherwise, W(y,S) is the sum over all S' which can be related to S as above, of the values W(y-1,S').
This solution is efficient enough to solve the problem W(32,10), but would fail for large x. For example, W(100,10) would almost certainly be infeasible to calculate as I've described. If x were large but y were small, we would break all sensible brick-laying conventions and consider the wall as being built up from left-to-right instead of bottom-to-top. This would require a description of a valid column of the wall. For example, a column description could be a list whose length is the height of the wall and whose entries are among five symbols, representing "first square of a 2x1 brick", "second square of a 2x1 brick", "first square of a 3x1 brick", etc. Of course there would be constraints on each column description and constraints describing the relationship between consecutive columns, but the same approach as above would work this way as well, and would be more appropriate for long, short walls.
I found this python code online here and it works fast and correctly. I do not understand how it all works though. I could get my C++ to the last step (count the total number of solutions) and could not get it to work correctly.
def brickwall(w,h):
# generate single brick layer of width w (by recursion)
def gen_layers(w):
if w in (0,1,2,3):
return {0:[], 1:[], 2:[[2]], 3:[[3]]}[w]
return [(layer + [2]) for layer in gen_layers(w-2)] + \
[(layer + [3]) for layer in gen_layers(w-3)]
# precompute info about whether pairs of layers are compatible
def gen_conflict_mat(layers, nlayers, w):
# precompute internal brick positions for easy comparison
def get_internal_positions(layer, w):
acc = 0; intpos = set()
for brick in layer:
acc += brick; intpos.add(acc)
intpos.remove(w)
return intpos
intpos = [get_internal_positions(layer, w) for layer in layers]
mat = []
for i in range(nlayers):
mat.append([j for j in range(nlayers) \
if intpos[i].isdisjoint(intpos[j])])
return mat
layers = gen_layers(w)
nlayers = len(layers)
mat = gen_conflict_mat(layers, nlayers, w)
# dynamic programming to recursively compute wall counts
nwalls = nlayers*[1]
for i in range(1,h):
nwalls = [sum(nwalls[k] for k in mat[j]) for j in range(nlayers)]
return sum(nwalls)
print(brickwall(9,3)) #8
print(brickwall(9,4)) #10
print(brickwall(18,5)) #7958
print(brickwall(32,10)) #806844323190414

Shared Memory Bank Conflicts in CUDA: How memory is aligned to banks

As far as my understanding goes, shared memory is divided into banks and accesses by multiple threads to a single data element within the same bank will cause a conflict (or broadcast).
At the moment I allocate a fairly large array which conceptually represents several pairs of two matrices:
__shared__ float A[34*N]
Where N is the number of pairs and the first 16 floats of a pair are one matrix and the following 18 floats are the second.
The thing is, access to the first matrix is conflict free but access to the second one has conflicts. These conflicts are unavoidable, however, my thinking is that because the second matrix is 18 all future matrices will be misaligned to the banks and therefore more conflicts than necessary will occur.
Is this true, if so how can I avoid it?
Everytime I allocate shared memory, does it start at a new bank? So potentially could I do
__shared__ Apair1[34]
__shared__ Apair2[34]
...
Any ideas?
Thanks
If your pairs of matrices are stored contiguously, and if you are accessing the elements linearly by thread index, then you will not have shared memory bank conflicts.
In other words if you have:
A[0] <- mat1 element1
A[1] <- mat1 element2
A[2] <- mat1 element3
A[15] <- mat1 element16
A[16] <- mat2 element1
A[17] <- mat2 element2
A[33] <- mat2 element18
And you access this using:
float element;
element = A[pairindex * 34 + matindex * 16 + threadIdx.x];
Then adjacent threads are accessing adjacent elements in the matrix and you do not have conflicts.
In response to your comments (below) it does seem that you are mistaken in your understanding. It is true that there are 16 banks (in current generations, 32 in the next generation, Fermi) but consecutive 32-bit words reside in consecutive banks, i.e. the address space is interleaved across the banks. This means that provided you always have an array index that can be decomposed to x + threadIdx.x (where x is not dependent on threadIdx.x, or at least is constant across groups of 16 threads) you will not have bank conflicts.
When you access the matrices further along the array, you still access them in a contiguous chunk and hence you will not have bank conflicts. It is only when you start accessing non-adjacent elements that you will have bank conflicts.
The reduction sample in the SDK illustrates bank conflicts very well by building from a naive implementation to an optimised implementation, possibly worth taking a look.
Banks are set up such that each successive 32 bits are in the next bank. So, if you declare an array of 4 byte floats, each subsequent float in the array will be in the next bank (modulo 16 or 32, depending on your architecture). I'll assume you're on compute capability 1.x, so you have a bank of width 16.
If you have arrays of 18 and 16, things can be funny. You can avoid bank conflicts in the 16x16 array by declaring it like
__shared__ float sixteen[16][16+1]
which avoids bank conflicts when accessing transpose elements using threadIdx.x (as I assume you're doing if you're getting conflicts). When accessing elements in, say, the first row of a 16x16 matrix, they'll all reside in the 1st bank. What you want to do is have each of these in a successive bank. Padding does this for you. You treat the array exactly as you would before, as sixteen[row][column], or similarly for a flattened matrix, as sixteen[row*(16+1)+column], if you want.
For the 18x18 case, when accessing in the transpose, you're moving at an even stride. The answer again is to pad by 1.
__shared__ float eighteens[18][18+1]
So now, when you access in the transpose (say accessing elements in the first column), it will access as (18+1)%16 = 3, and you'll access banks 3, 6, 9, 12, 15, 2, 5, 8 etc, so you should get no conflicts.
The particular alignment shift due to having a matrix of size 18 isn't the problem, because the starting point of the array makes no difference, it's only the order in which you access it. If you want to flatten the arrays I've proposed above, and merge them into 1, that's fine, as long as you access them in a similar fashion.