How do I access an integer array within a struct/class from in-line assembly (blackfin dialect) using gcc? - g++

Not very familiar with in-line assembly to begin with, and much less with that of the blackfin processor. I am in the process of migrating a legacy C application over to C++, and ran into a problem this morning regarding the following routine:
//
void clear_buffer ( short * buffer, int len ) {
__asm__ (
"/* clear_buffer */\n\t"
"LSETUP (1f, 1f) LC0=%1;\n"
"1:\n\t"
"W [%0++] = %2;"
:: "a" ( buffer ), "a" ( len ), "d" ( 0 )
: "memory", "LC0", "LT0", "LB0"
);
}
I have a class that contains an array of shorts that is used for audio processing:
class AudProc
{
enum { buffer_size = 512 };
short M_samples[ buffer_size * 2 ];
// remaining part of class omitted for brevity
};
Within the AudProc class I have a method that calls clear_buffer, passing it the samples array:
clear_buffer ( M_samples, sizeof ( M_samples ) / 2 );
This generates a "Bus Error" and aborts the application.
I have tried making the array public, and that produces the same result. I have also tried making it static; that allows the call to go through without error, but no longer allows for multiple instances of my class as each needs its own buffer to work with. Now, my first thought is, it has something to do with where the buffer is in memory, or from where it is being accessed. Does something need to be changed in the in-line assembly to make this work, or in the way it is being called?
Thought that this was similar to what I was trying to accomplish, but it is using a different dialect of asm, and I can't figure out if it is the same problem I am experiencing or not:
GCC extended asm, struct element offset encoding
Anyone know why this is occurring and how to correct it?
Does anyone know where there is helpful documentation regarding the blackfin asm instruction set? I've tried looking on the ADSP site, but to no avail.

I would suspect that you could define your clear_buffer as
inline void clear_buffer (short * buffer, int len) {
memset (buffer, 0, sizeof(short)*len);
}
and probably GCC is able to optimize (when invoked with -O2 or -O3) that cleverly (because GCC knows about memset).
To understand assembly code, I suggest running gcc -S -O -fverbose-asm on some small C file, then to look inside the produced .s file.

I would have take a guess, because I don't know Blackfin assembler:
That LC0 sounds like "loop counter", LSETUP looks like a macro/insn, which, well, setups a loop between two labels and with a certain loop counter.
The "%0" operands is apparently the address to write to and we can safely guess it's incremented in the loop, in other words it's both an input and output operand and should be described as such.
Thus, I suggest describing it as in input-output operand, using "+" constraint modifier, as follows:
void clear_buffer ( short * buffer, int len ) {
__asm__ (
"/* clear_buffer */\n\t"
"LSETUP (1f, 1f) LC0=%1;\n"
"1:\n\t"
"W [%0++] = %2;"
: "+a" ( buffer )
: "a" ( len ), "d" ( 0 )
: "memory", "LC0", "LT0", "LB0"
);
}
This is, of course, just a hypothesis, but you could disassemble the code and check if by any chance GCC allocated the same register for "%0" and "%2".
PS. Actually, only "+a" should be enough, early-clobber is irrelevant.

For anyone else who runs into a similar circumstance, the problem here was not with the in-line assembly, nor with the way it was being called: it was with the classes / structs in the program. The class that I believed to be the offender was not the problem - there was another class that held an instance of it, and due to other members of that outer class, the inner one was not aligned on a word boundary. This was causing the "Bus Error" that I was experiencing. I had not come across this before because the classes were not declared with __attribute__((packed)) in other code, but they are in my implementation.
Giving Type Attributes - Using the GNU Compiler Collection (GCC) a read was what actually sparked the answer for me. Two particular attributes that affect memory alignment (and, thus, in-line assembly such as I am using) are packed and aligned.
As taken from the aforementioned link:
aligned (alignment)
This attribute specifies a minimum alignment (in bytes) for variables of the specified type. For example, the declarations:
struct S { short f[3]; } __attribute__ ((aligned (8)));
typedef int more_aligned_int __attribute__ ((aligned (8)));
force the compiler to ensure (as far as it can) that each variable whose type is struct S or more_aligned_int is allocated and aligned at least on a 8-byte boundary. On a SPARC, having all variables of type struct S aligned to 8-byte boundaries allows the compiler to use the ldd and std (doubleword load and store) instructions when copying one variable of type struct S to another, thus improving run-time efficiency.
Note that the alignment of any given struct or union type is required by the ISO C standard to be at least a perfect multiple of the lowest common multiple of the alignments of all of the members of the struct or union in question. This means that you can effectively adjust the alignment of a struct or union type by attaching an aligned attribute to any one of the members of such a type, but the notation illustrated in the example above is a more obvious, intuitive, and readable way to request the compiler to adjust the alignment of an entire struct or union type.
As in the preceding example, you can explicitly specify the alignment (in bytes) that you wish the compiler to use for a given struct or union type. Alternatively, you can leave out the alignment factor and just ask the compiler to align a type to the maximum useful alignment for the target machine you are compiling for. For example, you could write:
struct S { short f[3]; } __attribute__ ((aligned));
Whenever you leave out the alignment factor in an aligned attribute specification, the compiler automatically sets the alignment for the type to the largest alignment that is ever used for any data type on the target machine you are compiling for. Doing this can often make copy operations more efficient, because the compiler can use whatever instructions copy the biggest chunks of memory when performing copies to or from the variables that have types that you have aligned this way.
In the example above, if the size of each short is 2 bytes, then the size of the entire struct S type is 6 bytes. The smallest power of two that is greater than or equal to that is 8, so the compiler sets the alignment for the entire struct S type to 8 bytes.
Note that although you can ask the compiler to select a time-efficient alignment for a given type and then declare only individual stand-alone objects of that type, the compiler's ability to select a time-efficient alignment is primarily useful only when you plan to create arrays of variables having the relevant (efficiently aligned) type. If you declare or use arrays of variables of an efficiently-aligned type, then it is likely that your program also does pointer arithmetic (or subscripting, which amounts to the same thing) on pointers to the relevant type, and the code that the compiler generates for these pointer arithmetic operations is often more efficient for efficiently-aligned types than for other types.
The aligned attribute can only increase the alignment; but you can decrease it by specifying packed as well. See below.
Note that the effectiveness of aligned attributes may be limited by inherent limitations in your linker. On many systems, the linker is only able to arrange for variables to be aligned up to a certain maximum alignment. (For some linkers, the maximum supported alignment may be very very small.) If your linker is only able to align variables up to a maximum of 8-byte alignment, then specifying aligned(16) in an __attribute__ still only provides you with 8-byte alignment. See your linker documentation for further information.
.
packed
This attribute, attached to struct or union type definition, specifies that each member (other than zero-width bit-fields) of the structure or union is placed to minimize the memory required. When attached to an enum definition, it indicates that the smallest integral type should be used.
Specifying this attribute for struct and union types is equivalent to specifying the packed attribute on each of the structure or union members. Specifying the -fshort-enums flag on the line is equivalent to specifying the packed attribute on all enum definitions.
In the following example struct my_packed_struct's members are packed closely together, but the internal layout of its s member is not packed—to do that, struct my_unpacked_struct needs to be packed too.
struct my_unpacked_struct
{
char c;
int i;
};
struct __attribute__ ((__packed__)) my_packed_struct
{
char c;
int i;
struct my_unpacked_struct s;
};
You may only specify this attribute on the definition of an enum, struct or union, not on a typedef that does not also define the enumerated type, structure or union.
The problem which I was experiencing was specifically due to the use of packed. I attempted to simply add the aligned attribute to the structs and classes, but the error persisted. Only removing the packed attribute resolved the problem. For now, I am leaving the aligned attribute on them and testing to see if I find any improvements in the efficiency of the code as mentioned above, simply due to their being aligned on word boundaries. The application makes use of arrays of these structures, so perhaps there will be better performance, but only profiling the code will say for certain.

Related

Why is the pCode type const uint32_t*? (pCode in VkShaderModuleCreateInfo )?

I just read the Shader Modules Vulkan tutorial, and I didn't understand something.
Why is createInfo.pCode a uint32_t rather than unsigned char or uint8_t? Is it faster? (because moving pointer is now 4 bytes)?
VkShaderModule createShaderModule(const std::vector<char>& code) {
VkShaderModuleCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = code.size();
createInfo.pCode = reinterpret_cast<const uint32_t*>(code.data());
VkShaderModule shaderModule;
if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
throw std::runtime_error("failed to create shader module!");
}
}
A SPIR-V module is defined a stream of 32-bit words. Passing in data using a uint32_t pointer tells the driver that the data is 32-bit aligned and allows the shader compiler in the driver to directly access the data using aligned 32-bit loads.
This is usually faster (in most CPU designs) than random unaligned access, in particular for cases that cross cache lines.
This is also more portable C/C++. Using direct unaligned memory access instructions is possible in most CPU architectures, but not standard in the language. The portable alternative using a byte stream requires assembling 4 byte loads and merging them, which is less efficient than just making a direct aligned word load.
Note using reinterpret_cast here assumes the data is aligned correctly. For the base address of std::vector it will work (data allocated via new must be aligned enough for the largest supported primitive type), but it's one to watch out for if you change where the code comes from in future.
According to VUID-VkShaderModuleCreateInfo-pCode-parameter, the specification requires that:
pCode must be a valid pointer to an array of 4/codeSize uint32_t values
pCode being a uint32_t* allows this requirement to be partially validated by the compiler, making it one less thing to worry about for users of the API.
Because SPIR-V word is 32 bits.
It should not matter for performance. It is just a type.

Specifying push constant block offset in HLSL

I am trying to write a Vulkan renderer, I use glslangValidator with HLSL for shaders and am trying to implement push constants.
[[vk::push_constant]]
cbuffer cbFragment {
float4 imageColor;
float4 aaaa;
};
[[vk::push_constant]]
cbuffer cbMatrices {
float4 bbbb;
};
The annotation "[[vk::push_constant]]" works, I use spirv_reflect for reflection and both push constants show up and they work as intended.
The problem I'm having is that they seemingly overlap, if I assign "bbbb" a value, "imageColor" is affected in exactly the same way and vice versa. In the reflection data both push constant blocks have the offset 0, which explains the issue. However, I seem to be completely unable to change the offset of either of the push constants.
[[vk::offset(x)]] does not work at all, it neither affects the individual member offsets nor the offset of the push constants. The only offset that works at all is HLSL's built in "packoffset", which only applies to the buffer members. And although it might actually be a solution to just offset the members of one of the push constants to be outside the range of the other, I hardly believe that can be a sensible solution as it's also causing the validation layer to fail because offsetting the individual member simply increases the size of the push constant unnecessarily and the overlap itself is still present.
I would greatly appreciate any help on this matter and am willing to provide any necessary clarification, thank you very much!
Push constants live in a single chunk of contiguous memory. The compiler doesn't try to append multiple blocks into that memory; like with the GLSL syntax, it's intended to just have one block containing all the push constant data.
This is consistent with other places where the compiler has to pack variables in a block: it only packs within a block, not across multiple blocks. Two separate non-pushconstant cbuffers would refer to two distinct buffers in memory, with contents that begin at offset zero within their individual buffer. There's only one "push constant buffer", hence you should only decorate one cbuffer with vk::push_constant.

Is there any reason to use NSInteger instead of uint8_t with NS_ENUM?

The general standard appears to use NS_ENUM with NSInteger as the base type. Why is this the case? Assuming less than 256 cases (which covers almost any enumeration), is there any reason to use that instead of uint8_t, which could use less memory space? Either imports into Swift fine.
This is different than NS_OPTIONS, where a larger type makes sense, since you shouldn't be doing any bit math with enumerations, and you can use every number representable by the base type as a value.
The answer to the question in the title:
Is there any reason to use NSInteger instead of uint8_t with NS_ENUM?
is probably not.
When declaring an enum in C if no underlying type is specified the compiler is free to choose any suitable type from char and the signed and unsigned integer types which can at least represent all the values required. The current Xcode/Clang compiler picks a 4-byte integer. One could reasonably assume the compiler writers made an informed choice - some balance of performance and storage.
Smaller types, such as uint8_t, will usually be aligned on smaller boundaries in memory (or on disc) - but that is only of benefit if the adjacent field matches the alignment e.g. if a 2-byte size typed field follows a 1-byte sized typed field then unless otherwise specified (e.g. with a #pragma packed) there will probably be an intervening unused byte.
Whether any performance or storage differences are significant will be heavily dependent on the application. Follow the usual rule of thumb - don't optimise until an issue is found.
However if you find semantic benefit in limiting the size then certainly do so - there is no general reason you shouldn't. The choice is similar to picking signed vs. unsigned integers, some programmers avoid unsigned types for values that will be ≥ 0 unless absolutely required for the extra range, while others appreciate the semantic benefit.
Summary: There is no right answer, its largely a subjective issue.
HTH
First of all: The memory footprint is close to completely meaningless. You are talking about 1 Byte vs. 4/8 Bytes. (If the memory alignment does not force the usage of 4/8 bytes whatever you chosed.) How many NS_ENUM (C) objects do you want to have in your running app?
I guess that the reason is pretty easy: NSInteger is akin of "catch all" integer type in Cocoa. That makes assignments easier, especially you do not have to care about assigning a bigger integer type to a smaller one. Without casting this would lead to warnings.
Having more than one integer type in a desktop app with a 32/64 bit model is akin of an anachronism. Nor a Mac neither a MacBook neither an iPhone is an embedded micro controller …
You can use any integer data type including uint8_t with NS_ENUM as.
typedef NS_ENUM(uint8_t, eEnumAddEditViewMode) {
eWBEnumAddMode,
eWBEnumEditMode
};
In old c style standard NSInteger is default, because NSInteger is akin of "catch all" integer type in objective c. and developer can easily type boxing and unboxing with their own variable. This is just developer friendly best practise.

RenderScript Variable types and Element types, simple example

I clearly see the need to deepen my knowledge in RenderScript memory allocation and data types (I'm still confused about the sheer number of data types and finding the correct corresponding types on either side - allocations and elements. (or when to refer the forEach to input, to output or to both, etc.) Therefore I will read and re-read the documentation, which is really not bad - but it needs some time to get the necessary "intuition" how to use it correctly. But for now, please help me with this basic one (and I will return later with hopefully less stupid questions...). I need a very simple kernel that takes an ARGB Color Bitmap and returns an integer Array of gray-values. My attempt was the following:
#pragma version(1)
#pragma rs java_package_name(com.example.xxxx)
#pragma rs_fp_relaxed
uint __attribute__((kernel)) grauInt(uchar4 in) {
uint gr= (uint) (0.2125*in.r + 0.7154*in.g + 0.0721*in.b);
return gr;
}
and Java side:
int[] data1 = new int[width*height];
ScriptC_gray graysc;
graysc=new ScriptC_gray(rs);
Type.Builder TypeOut = new Type.Builder(rs, Element.U8(rs));
TypeOut.setX(width).setY(height);
Allocation outAlloc = Allocation.createTyped(rs, TypeOut.create());
Allocation inAlloc = Allocation.createFromBitmap(rs, bmpfoto1,
Allocation.MipmapControl.MIPMAP_NONE, Allocation.USAGE_SCRIPT);
graysc.forEach_grauInt(inAlloc, outAlloc);
outAlloc.copyTo(data1);
This crashed with the message cannot locate symbol "convert_uint". What's wrong with this conversion? Is the code otherwise correct?
UPDATE: isn't that ridiculous? I don't get this "easy one" run, even after 2 hours trying. I still struggle with the different Element- and variable-types. Let's recap: Input is a Bitmap. Output is an int[] Array. So, why doesnt it work when I use U8 in the Java-side Out-allocation, createFromBitmap in the Java-side In-allocation, uchar4 as kernel Input and uint as the kernel Output (RSRuntimeException: Type mismatch with U32) ?
There is no convert_uint() function. How about simple casting? Other than that, the code looks alright (assuming width and height have correct values).
UPDATE: I have just noticed that you allocate Element.I32 (i.e. signed integer type), but return uint from the kernel. These should match. And in any case, unless you need more than 8-bit precision, you should be able to fit your result in U8.
UPDATE: If you are changing the output type, make sure you change it in all places, e.g. if the kernel returns an uint, the allocation should use U32. If the kernel returns a char, the allocation should use I8. And so on...
You can't use a Uint[] directly because the input Bitmap is actually 2-dimensional. Can you create the output Allocation with a proper width/height and try that? You should still be able to extract the values into a Java array when you are finished.

Memcpy and Memset on structures of Short Type in C

I have a query about using memset and memcopy on structures and their reliablity. For eg:
I have a code looks like this
typedef struct
{
short a[10];
short b[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
My question is,
1): in memset if i set to 0 it is fine. But when setting 2 i get m.a and m.b as 514 instead of 2. When I make them as char instead of short it is fine. Does it mean we cannot use memset for any initialization other than 0? Is it a limitation on short for eg
2): Is it reliable to do memcopy between two structures above of type short. I have a huge
strings of a,b,c,d,e... I need to make sure copy is perfect one to one.
3): Am I better off using memset and memcopy on individual arrays rather than collecting in a structure as above?
One more query,
In the structue above i have array of variables. But if I am passed pointer to these arrays
and I want to collect these pointers in a structure
typedef struct
{
short *pa[10];
short *pb[10];
}tDataStruct;
tDataStruct m,n;
memset(&m, 2, sizeof(m));
memcpy(&n,&m,sizeof(m));
In this case if i or memset of memcopy it only changes the address rather than value. How do i change the values instead? Is the prototype wrong?
Please suggest. Your inputs are very imp
Thanks
dsp guy
memset set's bytes, not shorts. always. 514 = (256*2) + (1*2)... 2s appearing on byte boundaries.
1.a. This does, admittedly, lessen it's usefulness for purposes such as you're trying to do (array fill).
reliable as long as both structs are of the same type. Just to be clear, these structures are NOT of "type short" as you suggest.
if I understand your question, I don't believe it matters as long as they are of the same type.
Just remember, these are byte level operations, nothing more, nothing less. See also this.
For the second part of your question, try
memset(m.pa, 0, sizeof(*(m.pa));
memset(m.pb, 0, sizeof(*(m.pb));
Note two operations to copy from two different addresses (m.pa, m.pb are effectively addresses as you recognized). Note also the sizeof: not sizeof the references, but sizeof what's being referenced. Similarly for memcopy.