I am trying to change the elements of a C-style array. Using an NSArray/NSMutableArray is not an option for me.
My code is as so:
int losingPositionsX[] = {0, 4, 8...};
but when I enter this code
losingPositionsX = {8, 16, 24};
to change the arrays elements of he array it has an error of: "expected expression" How can I make the copy?
In C (and by extension, in Objective C) you cannot assign C arrays to each other like that. You copy C arrays with memcpy, like this:
int losingPositionsX[] = {0, 4, 8};
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
Important: this solution requires that the sizes of the two arrays be equal.
You have to use something like memcpy() or a loop.
#define ARRAY_SIZE 3
const int VALUES[ARRAY_SIZE] = {8, 16, 24};
for (int i = 0; i < ARRAY_SIZE; i++)
losingPositionsX[i] = VALUES[i];
Alternatively, with memcpy(),
// Assuming VALUES has same type and size as losingPositions
memcpy(losingPositionsX, VALUES, sizeof(VALUES));
// Same thing
memcpy(losingPositionsX, VALUES, sizeof(losingPositionsX));
// Same thing (but don't use this one)
memcpy(losingPositionsX, VALUES, sizeof(int) * 3);
Since you are on OS X, which supports C99, you can use compound literals:
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
The loop is the safest, and will probably be optimized into the same machine code as memcpy() by the compiler. It's relatively easy to make typos with memcpy().
I do not know, whether it is a help for you in relation to memory management. But you can do
int * losingPositionsX = (int[]){ 0, 4, 8 };
losingPositionsX = (int[]){ 8, 16, 32 };
Related
I'm trying to make a mesh of a piece of beam with stirrup and bars, but I'm having some trouble with stirrup, it is inside the main domain, and I do not know how to solve it. I'm attaching the .geo file, hoping someone could help. Maybe there are other way to mesh it, I do not know.
SetFactory("OpenCASCADE");
// Input
Rectangle(1) = {0, 0, 0, 300, 300, 0};
Disk(2) = {50, 50, 0, 10, 10};
Disk(3) = {50,250,0,10,10};
Disk(4) = {250,250,0,10,10};
Disk(5) = {250,50,0,10,10};
Rectangle(6) = {30,30,146,240,240,10};
Rectangle(7) = {40,40,146,220,220,10};
// Start Operations
s() = BooleanFragments{ Surface{1}; Delete; }{ Surface{2,3,4,5}; Delete;};
ext() = Extrude{0,0,300} {Surface{s()}; Layers{10}; Recombine;};
st() = BooleanFragments{ Surface{6}; Delete;}{Surface{7}; Delete;};
Recursive Delete {Surface{7}; }
Extrude{0,0,10} {Surface{22}; Layers{10}; Recombine;}
BooleanFragments{ Volume{5}; Delete;}{Volume{6}; Delete;}
// Mesh Options all elements needs to be Hexa
Mesh.RecombineAll = 2;
Not a complete answer; however, I think I identified the problem that probably causes the major troubles:
The circular extrusions (cylinders) touch the stirrup exactly at the vertices, thus creating complications to OpenCASCADE-based BooleanFragments operation.
The following code:
SetFactory("OpenCASCADE");
// Input
Rectangle(1) = {0, 0, 0, 300, 300, 0};
Disk(2) = {52, 52, 0, 10, 10};
Disk(3) = {52,248,0,10,10};
Disk(4) = {248,248,0,10,10};
Disk(5) = {248,52,0,10,10};
Rectangle(6) = {30,30,146,240,240,10};
Rectangle(7) = {40,40,146,220,220,10};
// Start Operations
s() = BooleanFragments{ Surface{1}; Delete; }{ Surface{2,3,4,5}; Delete;};
ext() = Extrude{0,0,300} {Surface{s()}; Layers{10}; Recombine;};
st() = BooleanFragments{ Surface{6}; Delete;}{Surface{7}; Delete;};
Recursive Delete {Surface{7}; }
Extrude{0,0,10} {Surface{22}; Layers{10}; Recombine;}
BooleanFragments{ Volume{5}; Delete;}{Volume{6}; Delete;}
// Mesh Options all elements needs to be Hexa
Mesh.RecombineAll = 2;
where I slightly shifted the cylinders to the inside (50->52 and 250 -> 248) should not have the meshing problem.
However, this disconnects the cylinders from the loop and modifies the problem drastically. Here is a zoom on the problematic part in the original, unmodified setup.
So, what you required from the CAD tool, is to handle the merging of those two surfaces (the loop and the cylinder) automatically using BooleanFragments, which might be problematic, especially if one has to take floating-point arithmetic aspects into account.
Having a little trouble tracking down the Swift equivalent of:
//timeArray and locationArray are NSMutableArrays
NSRange removalRange = NSMakeRange(0, i);
[timeArray removeObjectsInRange:removalRange];
[locationArray removeObjectsInRange:removalRange];
I see that Swift does have a call in the API: typealias NSRange = _NSRange but I haven't got past that part. Any help?
In addition to Antonio's answer, you can also just use the range operator:
var array = [0, 1, 2, 3, 4, 5]
array.removeRange(1..<3)
// array is now [0, 3, 4, 5]
The half-closed range operator (1..<3) includes 1, up to but not including 3 (so 1-2).
A full range operator (1...3) includes 3 (so 1-3).
Use the removeRange method of the swift arrays, which requires an instance of the Range struct to define the range:
var array = [1, 2, 3, 4]
let range = Range(start: 0, end: 1)
array.removeRange(range)
This code removes all array elements from index 0 (inclusive) up to index 1 (not inclusive)
Swift 3
As suggested by #bitsand, the above code is deprecated. It can be replaced with:
let range = 0..<1
array.removeSubrange(range)
or, more concisely:
array.removeSubrange(0..<1)
I was trying to enumerate filetypes with bitmasking for fast and easy distinguishing on bitwise OR:
typedef enum {
FileTypeDirectory = 1,
FileTypePIX = 2,
FileTypeJPG = 4,
FileTypePNG = 8,
FileTypeGIF = 16,
FileTypeHTML = 32,
FileTypeXML = 64,
FileTypeTXT = 128,
FileTypePDF = 256,
FileTypePPTX = 512,
FileTypeAll = 1023
} FileType;
My OR operations did work until 128, afterwards it failed. Are enums on a 64 Bit Mac OSX limited to Byte Datatypes? (2^7=128)
All enum constants in C are of type int and not of the type of the enumeration itself. So the restriction is not in the storage size for enum variables, but only in the number of bits for an int.
I don't know much of objective-c (as this is tagged also) but it shouldn't deviate much from C.
I'm not quite sure how you used the OR operator but it works for me well with your typedef.
FileType _fileType = FileTypeGIF | FileTypePDF | FileTypePPTX;
NSLog(#"filetype is : %d", _fileType);
the result is:
filetype is : 784
which is correct values because 16 + 256 + 512 is precisely 784.
(it has been tested on real device only.)
I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.
Two things are affecting the performance in your exercise.
1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM).
Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument.
Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)
It looks like EACH thread in a Work Group is copying the same 50 floats before the computation starts. This will saturate the Global Memory bandwidth.
try this
if ( lid == 0 )
{
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
}
This does the copy only once per work group.
Is it just me, or is there no binary search function in Phobos? I have a pre-sorted array that I want to search with my own comparator function, but I can't find anything in std.algorithms or std.containers.
Thanks!
Use SortedRange from std.range:
Cribbed from http://www.digitalmars.com/d/2.0/phobos/std_range.html#SortedRange:
auto a = [ 1, 2, 3, 42, 52, 64 ];
auto r = assumeSorted(a);
assert(r.canFind(3));
assert(!r.canFind(32));