How to demonstrate a memory misalignment error in C on a macbook pro (Intel 64 bit processor) - objective-c

In an attempt to understand C memory alignment or whatever the term is (data structure alignment?), i'm trying to write code that results in a alignment error. The original reason that brought me to learning about this is that i'm writing data parsing code that reads binary data received over the network. The data contains some uint32s, uint64s, floats, and doubles, and i'd like to make sure they are never corrupted due to errors in my parsing code.
An unsuccessful attempt at causing some problem due to misalignment:
uint32_t integer = 1027;
uint8_t * pointer = (uint8_t *)&integer;
uint8_t * bytes = malloc(5);
bytes[0] = 23; // extra byte to misalign uint32_t data
bytes[1] = pointer[0];
bytes[2] = pointer[1];
bytes[3] = pointer[2];
bytes[4] = pointer[3];
uint32_t integer2 = *(uint32_t *)(bytes + 1);
printf("integer: %u\ninteger2: %u\n", integer, integer2);
On my machine both integers print out the same. (macbook pro with Intel 64 bit processor, not sure what exactly determines alignment behaviour, is it the architecture? or exact CPU model? or compiler maybe? i use Xcode so clang)
I guess my processor/machine/setup supports unaligned reads so it takes the above code without any problems.
What would a case where parsing of say an uint32_t would fail because of code not taking alignment in account? Is there a way to make it fail on a modern Intel 64 bit system? Or am i safe from alignment errors when using simple datatypes like integers and floats (no structs)?
Edit: If anyone's reading this later, i found a similar question with interesting info: Mis-aligned pointers on x86

Normally, the x86 architecture doesn't have alignment requirements [except for some SIMD instructions like movdqa).
However, since you're trying to write code to cause such an exception ...
There is an alignment check exception bit that can be set into the x86 flags register. If you turn in on, an unaligned access will generate an exception which will show up [under linux at least] as a bus error (i.e. SIGBUS)
See my answer here: any way to stop unaligned access from c++ standard library on x86_64? for details and some sample programs to generate an exception.

Related

STM32CubeMX I2C code writing to reserved register bits

I'm developing an I2C driver on the STM32F74 family processors. I'm using the STM32CubeMX Low Level drivers and I can't make sense of the generated defines for I2C start and stop register values (CR2).
The code is generated in stm32f7xx_ll_i2c.h and is as follows.
/** #defgroup I2C_LL_EC_GENERATE Start And Stop Generation
* #{
*/
#define LL_I2C_GENERATE_NOSTARTSTOP 0x00000000U
/*!< Don't Generate Stop and Start condition. */
#define LL_I2C_GENERATE_STOP (uint32_t)(0x80000000U | I2C_CR2_STOP)
/*!< Generate Stop condition (Size should be set to 0). */
#define LL_I2C_GENERATE_START_READ (uint32_t)(0x80000000U | I2C_CR2_START | I2C_CR2_RD_WRN)
/*!< Generate Start for read request. */
My question is why is bit 31 included in these defines? (0x80000000U). The reference manual (RM0385) states "Bits 31:27 Reserved, must be kept at reset value.". I can't decide between modifying the generated code or keeping the 31 bit. I'll happily take recommendations simply whether its more likely that this is something needed or that I'm going to break things by writing to a reserved bit.
Thanks in advance!
I am guessing here because who knows what was on the minds of the library authors? (Not a lot if you look at the source code!). But I would guess that it is a "dirty-trick" to check that when calling LL functions you are using the specified macros.
However it is severely flawed because the "trick" is only valid for Cortex-M3/4 STM32 variants (e.g. F1xx, F2xx, F4xx) where the I2C peripheral is very different and registers such as I2C_CR2 are only 15 bits wide.
The trick is that the library functions have parameter checking asserts such as:
assert_param(IS_TRANSFER_REQUEST(Request));
Where the IS_TRANSFER_REQUEST is defined thus:
#define IS_TRANSFER_REQUEST(REQUEST) (((REQUEST) == I2C_GENERATE_STOP) || \
((REQUEST) == I2C_GENERATE_START_READ) || \
((REQUEST) == I2C_GENERATE_START_WRITE) || \
((REQUEST) == I2C_NO_STARTSTOP))
This forces you to use the LL defined macros as parameters and not some self-defined or calculated mask because they all have that "unused" check bit in them.
If that truly is the the reason, it is an ill-advised practice that did not envisage the newer I2C peripheral. You might think that the bit was stripped from the parameter before being written to the register. I have checked, it is not. And if did you would be paying for that overhead on every call, which is also undesirable.
As an error detection technique if that is what it is, it is not even applied consistently; for example all the GPIO_PIN_xx macros are 16 bits wide and since they are masks not pin numbers, using bit 31 could for example guard against passing a literal pin-number 10 where the mask 1<<10 is in fact required. Passing 10 would refer to pins 3 and 1 not 10. And to be honest that mistake is far more likely than, passing an incorrect I2C transfer request type.
In the end however "Reserved" generally means "unused but may be used in future implementations", and requiring you to use the "reset value" is a way of ensuring forward binary compatibility. If you had such a device no doubt there would be a corresponding library update to support it - but it would require re-compilation of the code. The risk is low and probably only a problem if you attempt to run this binary on a newer incompatible part that used this bits.
I agree with Clifford, the ST CubeMC / HAL / LL library code is, in places, some of the worst written code imaginable. I have a particular issue with lines such as "TIMx->CCER &= ~TIM_CCER_CC1E" where TIM_CCER_CC1e is defined as 0x0001 and the CCER register contains reserved bits that should remain at 0. There are hundreds of such examples all throughout the library code. ST remain silent to my request for advice.

Why is the pCode type const uint32_t*? (pCode in VkShaderModuleCreateInfo )?

I just read the Shader Modules Vulkan tutorial, and I didn't understand something.
Why is createInfo.pCode a uint32_t rather than unsigned char or uint8_t? Is it faster? (because moving pointer is now 4 bytes)?
VkShaderModule createShaderModule(const std::vector<char>& code) {
VkShaderModuleCreateInfo createInfo{};
createInfo.sType = VK_STRUCTURE_TYPE_SHADER_MODULE_CREATE_INFO;
createInfo.codeSize = code.size();
createInfo.pCode = reinterpret_cast<const uint32_t*>(code.data());
VkShaderModule shaderModule;
if (vkCreateShaderModule(device, &createInfo, nullptr, &shaderModule) != VK_SUCCESS) {
throw std::runtime_error("failed to create shader module!");
}
}
A SPIR-V module is defined a stream of 32-bit words. Passing in data using a uint32_t pointer tells the driver that the data is 32-bit aligned and allows the shader compiler in the driver to directly access the data using aligned 32-bit loads.
This is usually faster (in most CPU designs) than random unaligned access, in particular for cases that cross cache lines.
This is also more portable C/C++. Using direct unaligned memory access instructions is possible in most CPU architectures, but not standard in the language. The portable alternative using a byte stream requires assembling 4 byte loads and merging them, which is less efficient than just making a direct aligned word load.
Note using reinterpret_cast here assumes the data is aligned correctly. For the base address of std::vector it will work (data allocated via new must be aligned enough for the largest supported primitive type), but it's one to watch out for if you change where the code comes from in future.
According to VUID-VkShaderModuleCreateInfo-pCode-parameter, the specification requires that:
pCode must be a valid pointer to an array of 4/codeSize uint32_t values
pCode being a uint32_t* allows this requirement to be partially validated by the compiler, making it one less thing to worry about for users of the API.
Because SPIR-V word is 32 bits.
It should not matter for performance. It is just a type.

(STM32) Erasing flash and writing to flash gives HAL_FLASH_ERROR_PGP error (using HAL)

Trying to write to flash to store some configuration. I am using an STM32F446ze where I want to use the last 16kb sector as storage.
I specified VOLTAGE_RANGE_3 when I erased my sector. VOLTAGE_RANGE_3 is mapped to:
#define FLASH_VOLTAGE_RANGE_3 0x00000002U /*!< Device operating range: 2.7V to 3.6V */
I am getting an error when writing to flash when I use FLASH_TYPEPROGRAM_WORD. The error is HAL_FLASH_ERROR_PGP. Reading the reference manual I read that this has to do with using wrong parallelism/voltage levels.
From the reference manual I can read
Furthermore, in the reference manual I can read:
Programming errors
It is not allowed to program data to the Flash
memory that would cross the 128-bit row boundary. In such a case, the
write operation is not performed and a program alignment error flag
(PGAERR) is set in the FLASH_SR register. The write access type (byte,
half-word, word or double word) must correspond to the type of
parallelism chosen (x8, x16, x32 or x64). If not, the write operation
is not performed and a program parallelism error flag (PGPERR) is set
in the FLASH_SR register
So I thought:
I erased the sector in voltage range 3
That gives me 2.7 to 3.6v specification
That gives me x32 parallelism size
I should be able to write WORDs to flash.
But, this line give me an error (after unlocking the flash)
uint32_t sizeOfStorageType = ....; // Some uint I want to write to flash as test
HAL_StatusTypeDef flashStatus = HAL_FLASH_Program(TYPEPROGRAM_WORD, address++, (uint64_t) sizeOfStorageType);
auto err= HAL_FLASH_GetError(); // err == 4 == HAL_FLASH_ERROR_PGP: FLASH Programming Parallelism error flag
while (flashStatus != HAL_OK)
{
}
But when I start to write bytes instead, it goes fine.
uint8_t *arr = (uint8_t*) &sizeOfStorageType;
HAL_StatusTypeDef flashStatus;
for (uint8_t i=0; i<4; i++)
{
flashStatus = HAL_FLASH_Program(TYPEPROGRAM_BYTE, address++, (uint64_t) *(arr+i));
while (flashStatus != HAL_OK)
{
}
}
My questions:
Am I understanding it correctly that after erasing a sector, I can only write one TYPEPROGRAM? Thus, after erasing I can only write bytes, OR, half-words, OR, words, OR double words?
What am I missing / doing wrong in above context. Why can I only write bytes, while I erased with VOLTAGE_RANGE_3?
This looks like an data alignment error, but not the one related with 128-bit flash memory rows which is mentioned in the reference manual. That one is probably related with double word writes only, and is irrelevant in your case.
If you want to program 4 bytes at a time, your address needs to be word aligned, meaning that it needs to be divisible by 4. Also, address is not a uint32_t* (pointer), it's a raw uint32_t so address++ increments it by 1, not 4. As far as I know, Cortex M4 core converts unaligned accesses on the bus into multiple smaller size aligned accesses automatically, but this violates the flash parallelism rule.
BTW, it's perfectly valid to perform a mixture of byte, half-word and word writes as long as they are properly aligned. Also, unlike the flash hardware of F0, F1 and F3 series, you can try to overwrite a previously written location without causing an error. 0->1 bit changes are just ignored.

Objective-c: convert array of uint8 to int32

I'm looking for function which can fast convert array of uint8's to int32's (keeping count of numbers).
There is already such a function to convert uint8 to double in vDSP library:
vDSP_vfltu8D
How can analogous function be implemented on Objective-c (iOS, amd arch)? Pure C solutions accepted too.
In that case, based on the comments above:
ARM's Neon SIMD/Vector library is what you're looking for, but I'm not 100% sure it's supported on iOS. Even if it was, I wouldn't recommend it. You've got a 64-bit architecture on iOS, meaning you would only be able to DOUBLE the speed of your process (because you're converting to int32s).
Now that is if there was a single command that could do this. There isn't. There are a few commands that would allow you to, when used in succession, load the uint8s into a 64-bit register, shift them and zero out the other bytes, and then store those as int32s. Those commands will have more overhead because it takes several operations to do it.
If you really want to look into the commands available, check them out here (again, not sure if they're supported on iOS): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CJAJIIGG.html
The iOS architecture isn't really built for this kind of processing. Vector commands in most cases only become useful when a computer has 256-bit registers, allowing you to load in 32 bytes at a time and operate on them simultaneously. I would recommend you go with the normal approach of converting one at a time in a loop (or maybe unwrap the loop to remove a bit of overhead like so:
//not syntactically correct code
for (int i = 0; i < lengthOfArray; i+=4) {
int32Array[i] = (int32)int8Array[i];
int32Array[i + 1] = (int32)int8Array[i + 1];
int32Array[i + 2] = (int32)int8Array[i + 2];
int32Array[i + 3] = (int32)int8Array[i + 3];
}
While it's a small optimization, it removes 3/4s of the looping overhead. It won't do much, but hey, it's something.
Source: I worked on Intel's SIMD/Vector team, converting C functions to optimize on 256-bit registers. Some things just couldn't be done efficiently, unfortunately.

Assembly code for optimized bitshifting of a vector

i'm trying to write a routine that will logically bitshift by n positions to the right all elements of a vector in the most efficient way possible for the following vector types: BYTE->BYTE, WORD->WORD, DWORD->DWORD and WORD->BYTE (assuming that only 8 bits are present in the result). I would like to have three routines for each type depending on the type of processor (SSE2 supported, only MMX suppported, only standard instruction se supported). Therefore i need 12 functions in total.
I have already found by myself how to backup and restore the registers that i need, how to make a loop, how to copy data into regular registers or MMX registers and how to shift by 1 position logically.
Because i'm not familiar with assembly language that's about it.
Which registers should i use for each instruction set?
How will the availability of the large vector (an image) in L1 cache be optimized?
How do i find the next element of the vector (a pointer kind of thing), i know i can make a mov by address and i assume i have to increment the address by 1, 2 or 4 depending on my type of data?
Although i have all the ideas, writing the code is a bit difficult at this point.
Thank you.
Arnaud.
Edit:
Here is what i'm trying to do for MMX for a shift by 1 on a DWORD:
__asm("push mm"); // backup register
__asm("push cx"); // backup register
__asm("mov %cx, length"); // initialize loop
__asm("loopstart_shift1:"); // start label
__asm("movd %xmm0, r/m32"); // get 32 bits data
__asm("psrlq %xmm0, 1"); // right shift 32 bits data logically (stuffs 0 on the left) by 1
__asm("mov r/m32,%xmm0"); // set 32 bits data
__asm("dec %cx"); // decrement index
__asm("cmp %cx,0");
__asm("jnz loopstart_shift1");
__asm("pop cx"); // restore register
__asm("pop mm"); // restore register
__asm("emms"); // leave MMX state
I strongly suggest you pause and take a look at using intrinsics with C or C++ instead of trying to write raw asm - that way the C/C++ compiler will take care of all the register allocation, instruction scheduling and general housekeeping tasks and you can just focus on the important parts, e.g. instead of using psrlq see _m_psrlq in mmintrin.h. (Better yet, look at using 128 bit SSE intrinsics.)
Sounds like you'd benefit from either using or looking into BitMagic's source. its entirely intrinsics based too, which makes its far more portable (though from the looks of it your using GCC, so it might have to get an MSVC to GCC intrinics mapping).