Enum bitwise masking limitations - objective-c

I was trying to enumerate filetypes with bitmasking for fast and easy distinguishing on bitwise OR:
typedef enum {
FileTypeDirectory = 1,
FileTypePIX = 2,
FileTypeJPG = 4,
FileTypePNG = 8,
FileTypeGIF = 16,
FileTypeHTML = 32,
FileTypeXML = 64,
FileTypeTXT = 128,
FileTypePDF = 256,
FileTypePPTX = 512,
FileTypeAll = 1023
} FileType;
My OR operations did work until 128, afterwards it failed. Are enums on a 64 Bit Mac OSX limited to Byte Datatypes? (2^7=128)

All enum constants in C are of type int and not of the type of the enumeration itself. So the restriction is not in the storage size for enum variables, but only in the number of bits for an int.
I don't know much of objective-c (as this is tagged also) but it shouldn't deviate much from C.

I'm not quite sure how you used the OR operator but it works for me well with your typedef.
FileType _fileType = FileTypeGIF | FileTypePDF | FileTypePPTX;
NSLog(#"filetype is : %d", _fileType);
the result is:
filetype is : 784
which is correct values because 16 + 256 + 512 is precisely 784.
(it has been tested on real device only.)

Related

Represent a 3d vector as a single numerical value

Is it possible to convert a 3d vector representing a colour into a single numerical value (x)? Something ideally that is a float value between 0 and 1. Math's is not my strong suit at all so from my googling I think I either need to use vectorization or convert the value to a tensor to achieve my objective. Would that be correct?
An example of what I am trying to achieve is:
labColour = (112, 48, 0)
labAsFloat = colour_to_float(luvColour, cspace='LAB')
print(labAsFloat) # outputs something like 0.74673543
def colour_to_float(colour, cspace):
return ??? somehow vectorise??
Not quite sure I understand your question correctly. If the objective is merely a unique floating number representation then this might work.
def colour_to_float(colour):
int_arr = list(colour)
int_arr.append(0)
data_bytes = np.array(int_arr, dtype=np.uint8)
return (data_bytes.view(dtype=np.float32))[0]
def float_to_colour(num):
return np.array([num], dtype=np.float32).view(dtype=np.uint8)[:3].tolist()
Results:
labColour = (230, 140, 50)
f = colour_to_float(labColour)
print(f)
4.64232e-39
lab = float_to_colour(f)
print(lab)
[230, 140, 50]

calling a fortran dll from python using cffi with multidimensional arrays

I use a dll that contains differential equation solvers among other useful mathematical tools. Unfortunately, this dll is written in Fortran. My program is written in python 3.7 and I use spyder as an IDE.
I successfully called easy functions from the dll. However, I can't seem to get functions to work that require multidimensional arrays.
This is the online documentation to the function I am trying to call:
https://www.nag.co.uk/numeric/fl/nagdoc_fl26/html/f01/f01adf.html
The kernel dies without an error message if I execute the following code:
import numpy as np
import cffi as cf
ffi=cf.FFI()
lib=ffi.dlopen("C:\Windows\SysWOW64\DLL20DDS")
ffi.cdef("""void F01ADF (const int *n, double** a, const int *lda, int *ifail);""")
#Integer
nx = 4
n = ffi.new('const int*', nx)
lda = nx + 1
lda = ffi.new('const int*', lda)
ifail = 0
ifail = ffi.new('int*', ifail)
#matrix to be inversed
ax1 = np.array([5,7,6,5],dtype = float, order = 'F')
ax2 = np.array([7,10,8,7],dtype = float, order = 'F')
ax3 = np.array([6,8,10,9],dtype = float, order = 'F')
ax4 = np.array([5,7,9,10], dtype = float, order = 'F')
ax5 = np.array([0,0,0,0], dtype = float, order = 'F')
ax = (ax1,ax2,ax3,ax4,ax5)
#Array
zx = np.zeros(nx, dtype = float, order = 'F')
a = ffi.cast("double** ", zx.__array_interface__['data'][0])
for i in range(lda[0]):
a[i] = ffi.cast("double* ", ax[i].__array_interface__['data'][0])
lib.F01ADF(n, a, lda, ifail)
Since function with 1D arrays work I assume that the multidimensional array is the issues.
Any kind of help is greatly appreciated,
Thilo
Not having access to the dll you refer to complicates giving a definitive answer, however, the documentation of the dll and the provided Python script may be enough to diagnose the problem. There are at least two issues in your example:
The C header interface:
Your documentation link clearly states what the function's C header interface should look like. I'm not very well versed in C, Python's cffi or cdef, but the parameter declaration for a in your function interface seems wrong. The double** a (pointer to pointer to double) in your function interface should most likely be double a[] or double* a (pointer to double) as stated in the documentation.
Defining a 2d Numpy array with Fortran ordering:
Note that your Numpy arrays ax1..5 are one dimensional arrays, since the arrays only have one dimension order='F' and order='C' are equivalent in terms of memory layout and access. Thus, specifying order='F' here, probably does not have the intended effect (Fortran using column-major ordering for multi-dimensional arrays).
The variable ax is a tuple of Numpy arrays, not a 2d Numpy array, and will therefore have a very different representation in memory (which is of utmost importance when passing data to the Fortran dll) than a 2d array.
Towards a solution
My first step would be to correct the C header interface. Next, I would declare ax as a proper Numpy array with two dimensions, using Fortran ordering, and then cast it to the appropriate data type, as in this example:
#file: test.py
import numpy as np
import cffi as cf
ffi=cf.FFI()
lib=ffi.dlopen("./f01adf.dll")
ffi.cdef("""void f01adf_ (const int *n, double a[], const int *lda, int *ifail);""")
# integers
nx = 4
n = ffi.new('const int*', nx)
lda = nx + 1
lda = ffi.new('const int*', lda)
ifail = 0
ifail = ffi.new('int*', ifail)
# matrix to be inversed
ax = np.array([[5, 7, 6, 5],
[7, 10, 8, 7],
[6, 8, 10, 9],
[5, 7, 9, 10],
[0, 0, 0, 0]], dtype=float, order='F')
# operation on matrix using dll
print("BEFORE:")
print(ax.astype(int))
a = ffi.cast("double* ", ax.__array_interface__['data'][0])
lib.f01adf_(n, a, lda, ifail)
print("\nAFTER:")
print(ax.astype(int))
For testing purposes, consider the following Fortran subroutine that has the same interface as your actual dll as a substitute for your dll. It will simply add 10**(i-1) to the i'th column of input array a. This will allow checking that the interface between Python and Fortran works as intended, and that the intended elements of array a are operated on:
!file: f01adf.f90
Subroutine f01adf(n, a, lda, ifail)
Integer, Intent (In) :: n, lda
Integer, Intent (Inout) :: ifail
Real(Kind(1.d0)), Intent (Inout) :: a(lda,*)
Integer :: i
print *, "Fortran DLL says: Hello world!"
If ((n < 1) .or. (lda < n+1)) Then
! Input variables not conforming to requirements
ifail = 2
Else
! Input variables acceptable
ifail = 0
! add 10**(i-1) to the i'th column of 2d array 'a'
Do i = 1, n
a(:, i) = a(:, i) + 10**(i-1)
End Do
End If
End Subroutine
Compiling the Fortran code, and then running the suggested Python script, gives me the following output:
> gfortran -O3 -shared -fPIC -fcheck=all -Wall -Wextra -std=f2008 -o f01adf.dll f01adf.f90
> python test.py
BEFORE:
[[ 5 7 6 5]
[ 7 10 8 7]
[ 6 8 10 9]
[ 5 7 9 10]
[ 0 0 0 0]]
Fortran DLL says: Hello world!
AFTER:
[[ 6 17 106 1005]
[ 8 20 108 1007]
[ 7 18 110 1009]
[ 6 17 109 1010]
[ 1 10 100 1000]]

Why write to BinaryWriter twice?

I'm implementing this tone-generator program and it works great:
https://social.msdn.microsoft.com/Forums/vstudio/en-US/c2b953b6-3c85-4eda-a478-080bae781319/beep-beep?forum=vbgeneral
What I can't figure out, is why the following two lines of code:
BW.Write(Sample)
BW.Write(Sample)
One "write" makes sense, but why the second "write"?
The example is a bit cryptic but the wave file is configured to be 2 channels thus the two writes are simply sending the same audio data to both channels.
The wave header is this hardcoded bit:
Dim Hdr() As Integer = {&H46464952, 36 + Bytes, &H45564157, _
&H20746D66, 16, &H20001, 44100, _
176400, &H100004, &H61746164, Bytes}
Which decoded means:
H46464952 = 'RIFF' (little endian)
36+Bytes = Length of header + length of data
H45564157 = 'WAVE' (little endian)
H20746D66 = 'fmt ' (little endian)
16 = length of fmt chunk (always 16)
H20001 = 0x0001: PCM,
0x0002: 2 channels
44100 = sampleRate
176400 = sampleRate*numChannels*bytesPerSample = 44100*2*2
H100004 = 0x0004: numChannels*bytesPerSample,
0x0010: bitsPerSample (16)
H61746164 = 'data'
Bytes = size of data chunk

Changing the elements of a C-style array in Objective-C

I am trying to change the elements of a C-style array. Using an NSArray/NSMutableArray is not an option for me.
My code is as so:
int losingPositionsX[] = {0, 4, 8...};
but when I enter this code
losingPositionsX = {8, 16, 24};
to change the arrays elements of he array it has an error of: "expected expression" How can I make the copy?
In C (and by extension, in Objective C) you cannot assign C arrays to each other like that. You copy C arrays with memcpy, like this:
int losingPositionsX[] = {0, 4, 8};
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
Important: this solution requires that the sizes of the two arrays be equal.
You have to use something like memcpy() or a loop.
#define ARRAY_SIZE 3
const int VALUES[ARRAY_SIZE] = {8, 16, 24};
for (int i = 0; i < ARRAY_SIZE; i++)
losingPositionsX[i] = VALUES[i];
Alternatively, with memcpy(),
// Assuming VALUES has same type and size as losingPositions
memcpy(losingPositionsX, VALUES, sizeof(VALUES));
// Same thing
memcpy(losingPositionsX, VALUES, sizeof(losingPositionsX));
// Same thing (but don't use this one)
memcpy(losingPositionsX, VALUES, sizeof(int) * 3);
Since you are on OS X, which supports C99, you can use compound literals:
memcpy(losingPositionsX, (int[3]){8, 16, 24}, sizeof(losingPositionsX));
The loop is the safest, and will probably be optimized into the same machine code as memcpy() by the compiler. It's relatively easy to make typos with memcpy().
I do not know, whether it is a help for you in relation to memory management. But you can do
int * losingPositionsX = (int[]){ 0, 4, 8 };
losingPositionsX = (int[]){ 8, 16, 32 };

OpenCL Performance Optimization

I have started learning OpenCL and I currently try to test how much I can improve performance for a simple skeletal animation algorithm. To do this I have written a program that performs skeletal animation from randomly generated vertices and transformation matrices twice, once with an SSE-optimized linear algebra library in plain C++, and once using my own OpenCL kernel on GPU (I'm testing on an Nvidia GTX 460).
I started off with a simple kernel where each work-item transforms exactly one vertex, with all values read from global memory. Because I was not satisfied with the performance of this kernel, I tried to optimize a little. My current kernel looks like this:
inline float4 MultiplyMatrixVector(float16 m, float4 v)
{
return (float4) (
dot(m.s048C, v),
dot(m.s159D, v),
dot(m.s26AE, v),
dot(m.s37BF, v)
);
}
kernel void skelanim(global const float16* boneMats, global const float4* vertices, global const float4* weights, global const uint4* indices, global float4* resVertices)
{
int gid = get_global_id(0);
int lid = get_local_id(0);
local float16 lBoneMats[NUM_BONES];
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
barrier(CLK_LOCAL_MEM_FENCE);
for (int i = 0 ; i < NUM_VERTICES_PER_WORK_ITEM ; i++) {
int vidx = gid*NUM_VERTICES_PER_WORK_ITEM + i;
float4 vertex = vertices[vidx];
float4 w = weights[vidx];
uint4 idx = indices[vidx];
resVertices[vidx] = (MultiplyMatrixVector(lBoneMats[idx.x], vertex * w.x)
+ MultiplyMatrixVector(lBoneMats[idx.y], vertex * w.y)
+ MultiplyMatrixVector(lBoneMats[idx.z], vertex * w.z)
+ MultiplyMatrixVector(lBoneMats[idx.w], vertex * w.w));
}
}
Now I process a constant number of vertices per work-item, and I prefetch all the bone matrices into local memory only once for each work-item, which I believed would lead to way better performance because the matrices for multiple vertices could be read from the faster local memory afterwards. Unfortunately, this kernel performs worse than my first attempt, and even worse than the CPU-only implementation.
Why is performance so bad with this should-be optimization?
If it helps, here is how I execute the kernel:
#define NUM_BONES 50
#define NUM_VERTICES 30000
#define NUM_VERTICES_PER_WORK_ITEM 100
#define NUM_ANIM_REPEAT 1000
uint64_t PerformOpenCLSkeletalAnimation(Matrix4* boneMats, Vector4* vertices, float* weights, uint32_t* indices, Vector4* resVertices)
{
File kernelFile("/home/alemariusnexus/test/skelanim.cl");
char opts[256];
sprintf(opts, "-D NUM_VERTICES=%u -D NUM_REPEAT=%u -D NUM_BONES=%u -D NUM_VERTICES_PER_WORK_ITEM=%u", NUM_VERTICES, NUM_ANIM_REPEAT, NUM_BONES, NUM_VERTICES_PER_WORK_ITEM);
cl_program prog = BuildOpenCLProgram(kernelFile, opts);
cl_kernel kernel = clCreateKernel(prog, "skelanim", NULL);
cl_mem boneMatBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_BONES*sizeof(Matrix4), boneMats, NULL);
cl_mem vertexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*sizeof(Vector4), vertices, NULL);
cl_mem weightBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(float), weights, NULL);
cl_mem indexBuf = clCreateBuffer(ctx, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, NUM_VERTICES*4*sizeof(uint32_t), indices, NULL);
cl_mem resVertexBuf = clCreateBuffer(ctx, CL_MEM_WRITE_ONLY | CL_MEM_ALLOC_HOST_PTR, NUM_VERTICES*sizeof(Vector4), NULL, NULL);
uint64_t s, e;
s = GetTickcount();
clSetKernelArg(kernel, 0, sizeof(cl_mem), &boneMatBuf);
clSetKernelArg(kernel, 1, sizeof(cl_mem), &vertexBuf);
clSetKernelArg(kernel, 2, sizeof(cl_mem), &weightBuf);
clSetKernelArg(kernel, 3, sizeof(cl_mem), &indexBuf);
clSetKernelArg(kernel, 4, sizeof(cl_mem), &resVertexBuf);
size_t globalWorkSize[] = { NUM_VERTICES / NUM_VERTICES_PER_WORK_ITEM };
size_t localWorkSize[] = { NUM_BONES };
for (size_t i = 0 ; i < NUM_ANIM_REPEAT ; i++) {
clEnqueueNDRangeKernel(cq, kernel, 1, NULL, globalWorkSize, localWorkSize, 0, NULL, NULL);
}
clEnqueueReadBuffer(cq, resVertexBuf, CL_TRUE, 0, NUM_VERTICES*sizeof(Vector4), resVertices, 0, NULL, NULL);
e = GetTickcount();
return e-s;
}
I guess there are more things that could be optimized, maybe batching some of the other global reads together, but first I would really like to know why this first optimization didn't work.
Two things are affecting the performance in your exercise.
1) OpenCL conforms to C99 std that does not contain anything about inline functions, i.e. the clcc compiler either just ignores the inline keyword and does a regular call, or it supports the inlining silently. But it is not mandated to support that feature.
So, better define your MultiplyMatrixVector as a pre-processor macro. Though this is not a major problem in your case.
2) You incorrectly threat the local memory (the LDM).
Although its latency times less than the latency of the global memory when it accessed properly, the local memory is subject to bank conflicts.
Your vertex index is calculated with stride 100 per work item. The number of banks depends on the GPU in use but usually it is 16 or 32, i.e. you may access up to 16(32) four byte LDM variables in one cycle without penalty if all of them are in different banks. Otherwise, you get a bank conflict (when two or more threads accesses the same bank) that is serialized.
Your 100 threads in a work group accesses the array in LDM with no special arrangement about bank conflicts. Moreover, the array elements are float16, i.e. a single element spans all 16 banks (or half of 32 banks). Thus, you have a bank conflict in each row of MultiplyMatrixVector function. The cummulative degree that conflict at least 16x32 (here 16 is the number of the vector elements you access and 32 is a size of half wavefront or halfwarp).
The solution here is not to copy that array to LDM, but to allocate it in the host with CL_MEM_READ_ONLY (which you already did) and declare your kernel using __constant specifier for boneMats argument.
Then the OpenCL library would allocate the memory in the constant area inside GPU and the access to that array would be fast:
kernel void skelanim(__constant const float16* boneMats,
global const float4* vertices,
global const float4* weights,
global const uint4* indices,
global float4* resVertices)
It looks like EACH thread in a Work Group is copying the same 50 floats before the computation starts. This will saturate the Global Memory bandwidth.
try this
if ( lid == 0 )
{
async_work_group_copy(lBoneMats, boneMats, NUM_BONES, 0);
}
This does the copy only once per work group.