Valgrind examine memory, patching lackey - valgrind

I would like to patch valgrind's lackey example tool. I would like to
examine the memory of the instrumented binary for the appearence
of a certain string sequence around the pointer of a store instruction.
Alternatively scan all memory regions on each store for the appearence
of such a sequence. Does anyone know a reference to a adequate
example? Basically I'd like to
for (i = -8; i <= 8; i++) {
if (strncmp(ptr+i, "needle", 6) == 0)
printf("Here ip: %x\n", ip);
}
But how can I verify that ptr in the range of [-8,8] is valid? Is there
a function that tracks the heap regions? Or do I have to track /proc/pid/maps each time?
// Konrad

Turns out that the exp-dhat tools in valgrind works for me:
static VG_REGPARM(3)
void dh_handle_write ( Addr addr, UWord szB )
{
Block* bk = find_Block_containing(addr);
if (bk) {
if (is_subinterval_of(bk->payload, bk->req_szB, addr-10, 10*2)) {
int i = 0;
for (i = -10; i <= 10; i++) {
if ((VG_(memcmp)(((char*)addr)+ i, searchfor, 6) == 0)) {
ExeContext *ec = VG_(record_ExeContext)( VG_(get_running_tid)(), 0 );
VG_(pp_ExeContext) ( ec );
VG_(printf)(" ---------------- ----------- found %08lx # %08lx --------\n", addr, ip);
}
}
}
bk->n_writes += szB;
if (bk->histoW)
inc_histo_for_block(bk, addr, szB);
}
}
Each time for a write I search for the occurance of array searchfor and print a stacktrace if found...

Related

STM32 Crash on Flash Sector Erase

I'm trying to write 4 uint32's of data into the flash memory of my STM32F767ZI so I've looked at some examples and in the reference manual but still I cannot do it. My goal is to write 4 uint32's into the flash and read them back and compare with the original data, and light different leds depending on the success of the comparison.
My code is as follows:
void flash_write(uint32_t offset, uint32_t *data, uint32_t size) {
FLASH_EraseInitTypeDef EraseInitStruct = {0};
uint32_t SectorError = 0;
HAL_FLASH_Unlock();
EraseInitStruct.TypeErase = FLASH_TYPEERASE_SECTORS;
EraseInitStruct.VoltageRange = FLASH_VOLTAGE_RANGE_3;
EraseInitStruct.Sector = FLASH_SECTOR_11;
EraseInitStruct.NbSectors = 1;
//EraseInitStruct.Banks = FLASH_BANK_1; // or FLASH_BANK_2 or FLASH_BANK_BOTH
st = HAL_FLASHEx_Erase(&EraseInitStruct, &SectorError);
if (st == HAL_OK) {
for (int i = 0; i < size; i += 4) {
st = HAL_FLASH_Program(FLASH_TYPEPROGRAM_WORD, FLASH_USER_START_ADDR + offset + i, *(data + i)); //This is what's giving me trouble
if (st != HAL_OK) {
// handle the error
break;
}
}
}else {
// handle the error
}
HAL_FLASH_Lock();
}
void flash_read(uint32_t offset, uint32_t *data, uint32_t size) {
for (int i = 0; i < size; i += 4) {
*(data + i) = *(__IO uint32_t*)(FLASH_USER_START_ADDR + offset + i);
}
}
int main(void) {
uint32_t data[] = {'a', 'b', 'c', 'd'};
uint32_t read_data[] = {0, 0, 0, 0};
HAL_Init();
SystemClock_Config();
MX_GPIO_Init();
flash_write(0, data, sizeof(data));
flash_read(0, read_data, sizeof(read_data));
if (compareArrays(data,read_data,4))
{
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_7,SET);
}
else
{
HAL_GPIO_WritePin(GPIOB, GPIO_PIN_14,SET);
}
return 0;
}
The problem is that before writing data I must erase a sector, and when I do it with the HAL_FLASHEx_Erase(&EraseInitStruct, &SectorError), function, the program always crashes, and sometimes even corrupts my codespace forcing me to update firmware.
I've selected the sector farthest from the code space but still it crashes when i try to erase it.
I've read in the reference manual that
Any attempt to read the Flash memory while it is being written or erased, causes the bus to
stall. Read operations are processed correctly once the program operation has completed.
This means that code or data fetches cannot be performed while a write/erase operation is
ongoing.
which I believe means the code should ideally be run from RAM while we operate on the flash, but I've seen other people online not have this issue so I'm wondering if that's the only problem I have. With that in mind I wanted to confirm if this is my only issue, or if I'm doing something wrong?
In your loop, you are adding multiples of 4 to i, but then you are adding i to data. When you add to a pointer it is automatically multiplied by the size of the pointed type, so you are adding multiples of 16 bytes and reading past the end of your input buffer.
Also, make sure you initialize all members of EraseInitStruct. Uncomment that line and set the correct value!

Creating threads with pthread_create() doesn't work on my linux

I have this piece of c/c++ code:
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads.
The ret value is always 0 (for 10 times).
But it should create 10 threads. I don't understand why, instead, it creates n < 10 threads. The ret value is always 0 (for 10 times).
Your program contains at least one data race, therefore its behavior is undefined.
The provided source is also is incomplete, so it's impossible to be sure that I can test the same thing you are testing. Nevertheless, I performed the minimum augmentation needed for g++ to compile it without warnings, and tested that:
#include <cstdlib>
#include <cstdio>
#include <pthread.h>
pthread_mutex_t nItermutex = PTHREAD_MUTEX_INITIALIZER;
const int nFracK = 100;
const int dQ = 4;
int nIter = 0;
void * myThreadFun(void *vargp)
{
int start = atoi((char*)vargp) % nFracK;
printf("Thread start = %d, dQ = %d\n", start, dQ);
pthread_mutex_lock(&nItermutex);
nIter++;
pthread_mutex_unlock(&nItermutex);
return NULL;
}
void Opt() {
pthread_t thread[200];
char start[100];
for(int i = 0; i < 10; i++) {
sprintf(start, "%d", i);
int ret = pthread_create (&thread[i], NULL, myThreadFun, (void*) start);
printf("ret = %d on thread %d\n", ret, i);
}
for(int i = 0; i < 10; i++)
pthread_join(thread[i], NULL);
}
int main(void) {
Opt();
return 0;
}
The fact that its behavior is undefined notwithstanding, when I run this program on my Linux machine, it invariably prints exactly ten "Thread start" lines, albeit not all with distinct numbers. The most plausible conclusion is that the program indeed does start ten (additional) threads, which is consistent with the fact that the output also seems to indicate that each call to pthread_create() indicates success by returning 0. I therefore reject your assertion that fewer than ten threads are actually started.
Presumably, the followup question would be why the program does not print the expected output, and here we return to the data race and accompanying undefined behavior. The main thread writes a text representation of iteration variable i into local array data of function Opt, and passes a pointer to that same array to each call to pthread_create(). When it then cycles back to do it again, there is a race between the newly created thread trying to read back the data and the main thread overwriting the array's contents with new data. I suppose that your idea was to avoid passing &i, but this is neither better nor fundamentally different.
You have several options for avoiding a data race in such a situation, prominent among them being:
initialize each thread indirectly from a different object, for example:
int start[10];
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun, &start[i]);
}
Note there that each thread is passed a pointer to a different array element, which the main thread does not subsequently modify.
initialize each thread directly from the value passed to it. This is not always a viable alternative, but it is possible in this case:
for(int i = 0; i < 10; i++) {
start[i] = i;
int ret = pthread_create(&thread[i], NULL, myThreadFun,
reinterpret_cast<void *>(static_cast<std::intptr_t>(i)));
}
accompanied by corresponding code in the thread function:
int start = reinterpret_cast<std::intptr_t>(vargp) % nFracK;
This is a fairly common idiom, though more often used when writing in pthreads's native language, C, where it's less verbose.
Use a mutex, semaphore, or other synchronization object to prevent the main thread from modifying the array before the child has read it. (Left as an exercise.)
Any of those options can be used to write a program that produces the expected output, with each thread responsible for printing one line. Supposing, of course, that the expectations of the output do not include that the relative order of the threads' outputs will be the same as the relative order in which they were started. If you want that, then only the option of synchronizing the parent and child threads will achieve it.

Debug data/neon performance hazards in arm neon code

Originally the problem appeared when I tried to optimize an algorithm for neon arm and some minor part of it was taking 80% of according to profiler. I tried to test to see what can be done to improve it and for that I created array of function pointers to different versions of my optimized function and then I run them in the loop to see in profiler which one performs better:
typedef unsigned(*CalcMaxFunc)(const uint16_t a[8][4], const uint16_t b[4][4]);
CalcMaxFunc CalcMaxFuncs[] =
{
CalcMaxFunc_NEON_0,
CalcMaxFunc_NEON_1,
CalcMaxFunc_NEON_2,
CalcMaxFunc_NEON_3,
CalcMaxFunc_C_0
};
int N = sizeof(CalcMaxFunc) / sizeof(CalcMaxFunc[0]);
for (int i = 0; i < 10 * N; ++i)
{
auto f = CalcMaxFunc[i % N];
unsigned retI = f(a, b);
// just random code to ensure that cpu waits for the results
// and compiler doesn't optimize it away
if (retI > 1000000)
break;
ret |= retI;
}
I got surprising results: performance of a function was totally depend on its position within CalcMaxFuncs array. For example, when I swapped CalcMaxFunc_NEON_3 to be first it would be 3-4 times slower and according to profiler it would stall at the last bit of the function where it tried to move data from neon to arm register.
So, what does it make stall sometimes and not in other times? BY the way, I profile on iPhone6 in xcode if that matters.
When I intentionally introduced neon pipeline stalls by mixing-in some floating point division between calling these functions in the loop I eliminated unreliable behavior, now all of them perform the same regardless of the order in which they were called. So, why in the first place did I have that problem and what can I do to eliminate it in actual code?
Update:
I tried to create a simple test function and then optimize it in stages and see how I could possibly avoid neon->arm stalls.
Here's the test runner function:
void NeonStallTest()
{
int findMinErr(uint8_t* var1, uint8_t* var2, int size);
srand(0);
uint8_t var1[1280];
uint8_t var2[1280];
for (int i = 0; i < sizeof(var1); ++i)
{
var1[i] = rand();
var2[i] = rand();
}
#if 0 // early exit?
for (int i = 0; i < 16; ++i)
var1[i] = var2[i];
#endif
int ret = 0;
for (int i=0; i<10000000; ++i)
ret += findMinErr(var1, var2, sizeof(var1));
exit(ret);
}
And findMinErr is this:
int findMinErr(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err = 0;
for (int j = 0; j < 16; ++j)
{
int x = var1[j] - var2[j];
err += x * x;
}
if (ret_err > err)
{
ret_err = err;
ret = i;
}
}
return ret;
}
Basically it it does sum of squared difference between each uint8_t[16] block and returns index of the block pair that has lowest squared difference. So, then I rewrote it in neon intrisics (no particular attempt was made to make it fast, as it's not the point):
int findMinErr_NEON(uint8_t* var1, uint8_t* var2, int size)
{
int ret = 0;
int ret_err = INT_MAX;
for (int i = 0; i < size / 16; ++i, var1 += 16, var2 += 16)
{
int err;
uint8x8_t var1_0 = vld1_u8(var1 + 0);
uint8x8_t var1_1 = vld1_u8(var1 + 8);
uint8x8_t var2_0 = vld1_u8(var2 + 0);
uint8x8_t var2_1 = vld1_u8(var2 + 8);
int16x8_t s0 = vreinterpretq_s16_u16(vsubl_u8(var1_0, var2_0));
int16x8_t s1 = vreinterpretq_s16_u16(vsubl_u8(var1_1, var2_1));
uint16x8_t u0 = vreinterpretq_u16_s16(vmulq_s16(s0, s0));
uint16x8_t u1 = vreinterpretq_u16_s16(vmulq_s16(s1, s1));
#ifdef __aarch64__1
err = vaddlvq_u16(u0) + vaddlvq_u16(u1);
#else
uint32x4_t err0 = vpaddlq_u16(u0);
uint32x4_t err1 = vpaddlq_u16(u1);
err0 = vaddq_u32(err0, err1);
uint32x2_t err00 = vpadd_u32(vget_low_u32(err0), vget_high_u32(err0));
err00 = vpadd_u32(err00, err00);
err = vget_lane_u32(err00, 0);
#endif
if (ret_err > err)
{
ret_err = err;
ret = i;
#if 0 // enable early exit?
if (ret_err == 0)
break;
#endif
}
}
return ret;
}
Now, if (ret_err > err) is clearly data hazard. Then I manually "unrolled" loop by two and modified code to use err0 and err1 and check them after performing next round of compute. According to profiler I got some improvements. In simple neon loop I got roughly 30% of entire function spent in the two lines vget_lane_u32 followed by if (ret_err > err). After I unrolled by two these operations started to take 25% (e.g. I got roughly 10% overall speedup). Also, check armv7 version, there is only 8 instructions between when err0 is set (vmov.32 r6, d16[0]) and when it's accessed (cmp r12, r6). T
Note, in the code early exit is ifdefed out. Enabling it would make function even slower. If I unrolled it by four and changed to use four errN variables and deffer check by two rounds then I still saw vget_lane_u32 in profiler taking too much time. When I checked generated asm, appears that compiler destroys all the optimizations attempts because it reuses some of the errN registers which effectively makes CPU access results of vget_lane_u32 much earlier than I want (and I aim to delay access by 10-20 instructions). Only when I unrolled by 4 and marked all four errN as volatile vget_lane_u32 totally disappeared from the radar in profiler, however, the if (ret_err > errN) check obviously got slow as hell as now these probably ended up as regular stack variables overall these 4 checks in 4x manual loop unroll started to take 40%. Looks like with proper manual asm it's possible to make it work properly: have early loop exit, while avoiding neon->arm stalls and have some arm logic in the loop, however, extra maintenance required to deal with arm asm makes it 10x more complex to maintain that kind of code in a large project (that doesn't have any armasm).
Update:
Here's sample stall when moving data from neon to arm register. To implement early exist I need to move from neon to arm once per loop. This move alone takes more than 50% of entire function according to sampling profiler that comes with xcode. I tried to add lots of noops before and/or after the mov, but nothing seems to affect results in profiler. I tried to use vorr d0,d0,d0 for noops: no difference. What's the reason for the stall, or the profiler simply shows wrong results?

Saving randomly generated passwords to a text file in order to display them later

I'm currently in a traineeship and I currently have to softwares I'm working on. The most important was requested yesterday and I'm stucked on the failure of its main feature: saving passwords.
The application is developped in C++\CLR using Visual Studio 2013 (Couldn't install MFC libraries somehow, installation kept failing and crashing even after multiple reboots.) and aims to generate a password from a seed provided by the user. The generated password will be save onto a .txt file. If the seed has already been used then the previously generated password will show up.
Unfortunately I can't save the password and seed to the file, though I can write the seed if I don't get to the end of the document. I went for the "if line is empty then write this to the document" but it doesn't work and I can't find out why. However I can read the passwords without any problem.
Here's the interresting part of the source:
int seed;
char genRandom() {
static const char letters[] =
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(letters) - 1;
return letters[rand() % stringLength];
}
System::Void OK_Click(System::Object^ sender, System::EventArgs^ e) {
fstream passwords;
if (!(passwords.is_open())) {
passwords.open("passwords.txt", ios::in | ios::out);
}
string gen = msclr::interop::marshal_as<std::string>(GENERATOR->Text), line, genf = gen;
bool empty_line_found = false;
while (empty_line_found == false) {
getline(passwords, line);
if (gen == line) {
getline(passwords, line);
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(line);
break;
}
if (line.empty()) {
for (unsigned int i = 0; i < gen.length(); i++) {
seed += gen[i];
}
srand(seed);
string pass;
for (int i = 0; i < 10; ++i) {
pass += genRandom();
}
passwords << pass << endl << gen << "";
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(pass);
empty_line_found = true;
}
}
}
I've also tried replacing ios::in by ios::app and it doesn't work. And yes I have included fstream, iostream, etc.
Thanks in advance!
[EDIT]
Just solved this problem. Thanks Rook for putting me on the right way. It feels like a silly way to do it, but I've closed the file and re-openned it using ios::app to write at the end of it. I also solved a stupid mistake resulting in writing the password before the seed and not inserting a final line so the main loop can still work. Here's the code in case someone ends up with the same problem:
int seed;
char genRandom() {
static const char letters[] =
"0123456789"
"ABCDEFGHIJKLMNOPQRSTUVWXYZ"
"abcdefghijklmnopqrstuvwxyz";
int stringLength = sizeof(letters) - 1;
return letters[rand() % stringLength];
}
System::Void OK_Click(System::Object^ sender, System::EventArgs^ e) {
fstream passwords;
if (!(passwords.is_open())) {
passwords.open("passwords.txt", ios::in | ios::out);
}
string gen = msclr::interop::marshal_as<std::string>(GENERATOR->Text), line, genf = gen;
bool empty_line_found = false;
while (empty_line_found == false) {
getline(passwords, line);
if (gen == line) {
getline(passwords, line);
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(line);
break;
}
if (line.empty()) {
passwords.close();
passwords.open("passwords.txt", ios::app);
for (unsigned int i = 0; i < gen.length(); i++) {
seed += gen[i];
}
srand(seed);
string pass;
for (int i = 0; i < 10; ++i) {
pass += genRandom();
}
passwords << gen << endl << pass << endl << "";
PASSWORD->Text = msclr::interop::marshal_as<System::String^>(pass);
empty_line_found = true;
}
}
passwords.close();
}
So, here's an interesting thing:
passwords << pass << endl << gen << "";
You're not ending that with a newline. This means the very end of your file could be missing a newline too. This has an interesting effect when you do this on the final line:
getline(passwords, line);
getline will read until it sees a line ending, or an EOF. If there's no newline, it'll hit that EOF and then set the EOF bit on the stream. That means the next time you try to do this:
passwords << pass << endl << gen << "";
the stream will refuse to write anything, because it is in an eof state. There are various things you can do here, but the simplest would be to do passwords.clear() to remove any error flags like eof. I'd be very cautious about accidentally clearing genuine error flags though; read the docs for fstream carefully.
I also reiterate my comment about C++/CLR being a glue language, and not a great language for general purpose development, which would be best done using C++ or a .net language, such as C#. If you're absolutely wedded to C++/CLR for some reason, you may as well make use of the extensive .net library so you don't have to pointlessly martial managed types back and forth. See System::IO::FileStream for example.

Vulkan depth image binding error

Hi I am trying to bind depth memory buffer but I get an error saying as below. I have no idea why this error is popping up.
The depth format is VK_FORMAT_D16_UNORM and the usage is VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT. I have read online that the TILING shouldnt be linear but then I get a different error. Thanks!!!
The code for creating and binding the image is as below.
VkImageCreateInfo imageInfo = {};
// If the depth format is undefined, use fallback as 16-byte value
if (Depth.format == VK_FORMAT_UNDEFINED) {
Depth.format = VK_FORMAT_D16_UNORM;
}
const VkFormat depthFormat = Depth.format;
VkFormatProperties props;
vkGetPhysicalDeviceFormatProperties(*deviceObj->gpu, depthFormat, &props);
if (props.linearTilingFeatures & VK_FORMAT_FEATURE_DEPTH_STENCIL_ATTACHMENT_BIT) {
imageInfo.tiling = VK_IMAGE_TILING_LINEAR;
}
else if (props.optimalTilingFeatures & VK_FORMAT_FEATURE_DEPTH_STENCIL_ATTACHMENT_BIT) {
imageInfo.tiling = VK_IMAGE_TILING_OPTIMAL;
}
else {
std::cout << "Unsupported Depth Format, try other Depth formats.\n";
exit(-1);
}
imageInfo.sType = VK_STRUCTURE_TYPE_IMAGE_CREATE_INFO;
imageInfo.pNext = NULL;
imageInfo.imageType = VK_IMAGE_TYPE_2D;
imageInfo.format = depthFormat;
imageInfo.extent.width = width;
imageInfo.extent.height = height;
imageInfo.extent.depth = 1;
imageInfo.mipLevels = 1;
imageInfo.arrayLayers = 1;
imageInfo.samples = NUM_SAMPLES;
imageInfo.queueFamilyIndexCount = 0;
imageInfo.pQueueFamilyIndices = NULL;
imageInfo.sharingMode = VK_SHARING_MODE_EXCLUSIVE;
imageInfo.usage = VK_IMAGE_USAGE_DEPTH_STENCIL_ATTACHMENT_BIT;
imageInfo.flags = 0;
// User create image info and create the image objects
result = vkCreateImage(deviceObj->device, &imageInfo, NULL, &Depth.image);
assert(result == VK_SUCCESS);
// Get the image memory requirements
VkMemoryRequirements memRqrmnt;
vkGetImageMemoryRequirements(deviceObj->device, Depth.image, &memRqrmnt);
VkMemoryAllocateInfo memAlloc = {};
memAlloc.sType = VK_STRUCTURE_TYPE_MEMORY_ALLOCATE_INFO;
memAlloc.pNext = NULL;
memAlloc.allocationSize = 0;
memAlloc.memoryTypeIndex = 0;
memAlloc.allocationSize = memRqrmnt.size;
// Determine the type of memory required with the help of memory properties
pass = deviceObj->memoryTypeFromProperties(memRqrmnt.memoryTypeBits, 0, /* No requirements */ &memAlloc.memoryTypeIndex);
assert(pass);
// Allocate the memory for image objects
result = vkAllocateMemory(deviceObj->device, &memAlloc, NULL, &Depth.mem);
assert(result == VK_SUCCESS);
// Bind the allocated memeory
result = vkBindImageMemory(deviceObj->device, Depth.image, Depth.mem, 0);
assert(result == VK_SUCCESS);
Yes, linear tiling may not be supported for depth usage Images.
Consult the specification and Valid Usage section of VkImageCreateInfo. The capability is queried by vkGetPhysicalDeviceFormatProperties and vkGetPhysicalDeviceImageFormatProperties commands. Though depth formats are "opaque", so there is not much reason to use linear tiling.
This you seem to be doing in your code.
But the error informs you that you are trying to use a memory type that is not allowed for the given Image. Use vkGetImageMemoryRequirements command to query which memory types are allowed.
Possibly you have some error there (you are using 0x1 which is obviously not part of 0x84 per the message). You may want to reuse the example code in the Device Memory chapter of the specification. Provide your memoryTypeFromProperties implementation for more specific answer.
I accidentally set the typeIndex to 1 instead of i and it works now. In my defense I have been vulkan coding the whole day and my eyes are bleeding :). Thanks for the help.
bool VulkanDevice::memoryTypeFromProperties(uint32_t typeBits, VkFlags
requirementsMask, uint32_t *typeIndex)
{
// Search memtypes to find first index with those properties
for (uint32_t i = 0; i < 32; i++) {
if ((typeBits & 1) == 1) {
// Type is available, does it match user properties?
if ((memoryProperties.memoryTypes[i].propertyFlags & requirementsMask) == requirementsMask) {
*typeIndex = i;// was set to 1 :(
return true;
}
}
typeBits >>= 1;
}
// No memory types matched, return failure
return false;
}