I am using Apache Arrow Java API, which access the direct memory.
I am also using Redis, when this Java API accessing direct memory, Redis xstream continues to grow in memory.
I found occasionally, Arrow would calculate the wrong result of following operations,
public static int bytesToInt(byte[] bytes) {
return ((bytes[3] & 255) << 24) +
((bytes[2] & 255) << 16) +
((bytes[1] & 255) << 8) +
((bytes[0] & 255));
}
this method return negative result and following code raise error
messageLength = MessageSerializer.bytesToInt(buffer.array());
ByteBuffer.allocate(messageLength);
And I found if I restart Redis(delete all previous data), the error would disappear for a while. If not, the error would occur much sooner while I call the Arrow method.
The reproduce is critical, to my experience, Arrow direct memory access should be frequent and redis xstream growth should be fast enough.
So I am not asking to reproduce and solve the issue completely, but to ask that is it possible that these two memory access conflict? And could the result above has the reasonable theoretical support?
Related
Trying to write to flash to store some configuration. I am using an STM32F446ze where I want to use the last 16kb sector as storage.
I specified VOLTAGE_RANGE_3 when I erased my sector. VOLTAGE_RANGE_3 is mapped to:
#define FLASH_VOLTAGE_RANGE_3 0x00000002U /*!< Device operating range: 2.7V to 3.6V */
I am getting an error when writing to flash when I use FLASH_TYPEPROGRAM_WORD. The error is HAL_FLASH_ERROR_PGP. Reading the reference manual I read that this has to do with using wrong parallelism/voltage levels.
From the reference manual I can read
Furthermore, in the reference manual I can read:
Programming errors
It is not allowed to program data to the Flash
memory that would cross the 128-bit row boundary. In such a case, the
write operation is not performed and a program alignment error flag
(PGAERR) is set in the FLASH_SR register. The write access type (byte,
half-word, word or double word) must correspond to the type of
parallelism chosen (x8, x16, x32 or x64). If not, the write operation
is not performed and a program parallelism error flag (PGPERR) is set
in the FLASH_SR register
So I thought:
I erased the sector in voltage range 3
That gives me 2.7 to 3.6v specification
That gives me x32 parallelism size
I should be able to write WORDs to flash.
But, this line give me an error (after unlocking the flash)
uint32_t sizeOfStorageType = ....; // Some uint I want to write to flash as test
HAL_StatusTypeDef flashStatus = HAL_FLASH_Program(TYPEPROGRAM_WORD, address++, (uint64_t) sizeOfStorageType);
auto err= HAL_FLASH_GetError(); // err == 4 == HAL_FLASH_ERROR_PGP: FLASH Programming Parallelism error flag
while (flashStatus != HAL_OK)
{
}
But when I start to write bytes instead, it goes fine.
uint8_t *arr = (uint8_t*) &sizeOfStorageType;
HAL_StatusTypeDef flashStatus;
for (uint8_t i=0; i<4; i++)
{
flashStatus = HAL_FLASH_Program(TYPEPROGRAM_BYTE, address++, (uint64_t) *(arr+i));
while (flashStatus != HAL_OK)
{
}
}
My questions:
Am I understanding it correctly that after erasing a sector, I can only write one TYPEPROGRAM? Thus, after erasing I can only write bytes, OR, half-words, OR, words, OR double words?
What am I missing / doing wrong in above context. Why can I only write bytes, while I erased with VOLTAGE_RANGE_3?
This looks like an data alignment error, but not the one related with 128-bit flash memory rows which is mentioned in the reference manual. That one is probably related with double word writes only, and is irrelevant in your case.
If you want to program 4 bytes at a time, your address needs to be word aligned, meaning that it needs to be divisible by 4. Also, address is not a uint32_t* (pointer), it's a raw uint32_t so address++ increments it by 1, not 4. As far as I know, Cortex M4 core converts unaligned accesses on the bus into multiple smaller size aligned accesses automatically, but this violates the flash parallelism rule.
BTW, it's perfectly valid to perform a mixture of byte, half-word and word writes as long as they are properly aligned. Also, unlike the flash hardware of F0, F1 and F3 series, you can try to overwrite a previously written location without causing an error. 0->1 bit changes are just ignored.
Perl internally uses dedicated hash PL_strtab as shared storage for hash's keys, but in fork environment like apache/mod_perl this creates a big issue. Best practice says to preload modules in parent process, but nobody says it's eventually allocates memory for PL_strtab and these pages of memory tend to be implicitly modified in child processes. There are seems to be 2 major reasons of modification:
Reason 1: reallocation (hsplit()) may happen when PL_strtab growths in child process.
Reason 2: REFCNT every time new reference created.
Example below shows 16MB copy-on-write leak in attempt to use hash. Attempts to recompile perl with -DNODEFAULT_SHAREKEYS fails (https://rt.perl.org/SelfService/Display.html?id=133384). I was able to get access to PL_strtab via XS module.
Ideally I'm looking for a way to downgrade all hashes created in parent to keep hash keys within a hash (HE object) rather than PL_strtab, i.e. turn off SHAREKEYS flag. This should allow to shrink PL_strtab to minimum possible size. Ideally it should have 0 keys in parent.
Please let me know you think it's theoretically possible via XS.
#!/usr/bin/env perl
use strict;
use warnings;
use Linux::Smaps;
$SIG{CHLD} = sub { waitpid(-1, 1) };
# comment this block
{
my %h;
# pre-growth PL_strtab hash, kind of: keys %$PL_strtab = 2_000_000;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $pid = fork // die "Cannot fork: $!";
unless ($pid) {
# child
my $s = Linux::Smaps->new($$)->all;
my $before = $s->shared_clean + $s->shared_dirty;
{
my %h;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $s2 = Linux::Smaps->new($$)->all;
my $after = $s2->shared_clean + $s2->shared_dirty;
warn 'COPY-ON-WRITE: ' . ($before - $after) . ' KB';
exit 0;
}
sleep 1000;
print "DONE\n";
Note, that sample %h in parent get destroyed and not accessible in child. The only purpose of it is to preallocate more memory for PL_strtab and make copy-on-write issue more noticeable.
The problem that PL_strtab is shared data structure (not %h). It's solely controlled by Perl and there is no way to control it or use IPC::Shareable or any other well-known for me CPAN modules.
Real life example:
In apache/mod_perl, Starman or any other prefork environment everybody tries to preload as much as possible modules in parent process. Right?
If any of preloaded modules creates hash (even temporary) with big number of keys Perl silently allocates more and more memory for internal PL_strtab hash.
PL_strtab silently get touched in children on any attempt to use hashes.
Problem even worse, because huge percentage of modules we preload are CPAN modules -> there is no way to know which of them overuse hashes resulting in increased memory footprint of parent process.
Is there a solution to avoid memory leak in a simple expression evaluation like this ?
inter.SetVariable("tick", tick++);
if (inter.Eval<bool>("(tick%2)==1"))
{
odd++;
if ((odd % 100) == 0)
System.GC.Collect();
}
else
even++;
I'm running this code periodically in a WinForm application on a Linux machine with Mono (5.0.1.1) and the memory usage continuously increase.
Tested on Windows the Process.WorkingSet64 was increasing with a lower rate than Linux.
The GC.GetTotalMemory is always stable.
If possible, it is better to use the Parse method and then just Invoke the expression multiple times.
Something like:
// One time only
string expression = "(tick%2)==1";
Lambda parsedExpression = interpreter.Parse(expression, new Parameter("tick", typeof(int)));
// Call invoke for each cycle...
var result = parsedExpression.Invoke(tick++);
But from my previous tests I don't have see any Memory Leak, are you sure that is this the problem?
I'm looking for function which can fast convert array of uint8's to int32's (keeping count of numbers).
There is already such a function to convert uint8 to double in vDSP library:
vDSP_vfltu8D
How can analogous function be implemented on Objective-c (iOS, amd arch)? Pure C solutions accepted too.
In that case, based on the comments above:
ARM's Neon SIMD/Vector library is what you're looking for, but I'm not 100% sure it's supported on iOS. Even if it was, I wouldn't recommend it. You've got a 64-bit architecture on iOS, meaning you would only be able to DOUBLE the speed of your process (because you're converting to int32s).
Now that is if there was a single command that could do this. There isn't. There are a few commands that would allow you to, when used in succession, load the uint8s into a 64-bit register, shift them and zero out the other bytes, and then store those as int32s. Those commands will have more overhead because it takes several operations to do it.
If you really want to look into the commands available, check them out here (again, not sure if they're supported on iOS): http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.dui0489e/CJAJIIGG.html
The iOS architecture isn't really built for this kind of processing. Vector commands in most cases only become useful when a computer has 256-bit registers, allowing you to load in 32 bytes at a time and operate on them simultaneously. I would recommend you go with the normal approach of converting one at a time in a loop (or maybe unwrap the loop to remove a bit of overhead like so:
//not syntactically correct code
for (int i = 0; i < lengthOfArray; i+=4) {
int32Array[i] = (int32)int8Array[i];
int32Array[i + 1] = (int32)int8Array[i + 1];
int32Array[i + 2] = (int32)int8Array[i + 2];
int32Array[i + 3] = (int32)int8Array[i + 3];
}
While it's a small optimization, it removes 3/4s of the looping overhead. It won't do much, but hey, it's something.
Source: I worked on Intel's SIMD/Vector team, converting C functions to optimize on 256-bit registers. Some things just couldn't be done efficiently, unfortunately.
Working on a filter following, I am having a problem of doing these pieces of codes for processing an image in GPU:
for(int h=0; h<height; h++) {
for(int w=1; w<width; w++) {
image[h][w] = (1-a)*image[h][w] + a*image[h][w-1];
}
}
If I define:
dim3 threads_perblock(32, 32)
then each block I have: 32 threads can be communicated. The threads of this block can not communicate with the threads from other blocks.
Within a thread_block, I can translate that pieces of code using shared_memory however, for edge (I would say): image[0,31] and image[0,32] in different threadblocks. The image[0,31] should get value from image[0,32] to calculate its value. But they are in different threadblocks.
so that is the problem.
How would I solve this?
Thanks in advance.
If image is in global memory then there is no problem - you don't need to use shared memory and you can just access pixels directly from image without any problem.
However if you have already done some processing prior to this, and a block of image is already in shared memory, then you have a problem, since you need to do neighbourhood operations which are outside the range of your block. You can do one of the following - either:
write shared memory back to global memory so that it is accessible to neighbouring blocks (disadvantage: performance, synchronization between blocks can be tricky)
or:
process additional edge pixels per block with an overlap (1 pixel in this case) so that you have additional pixels in each block to handle the edge cases, e.g. work with a 34x34 block size but store only the 32x32 central output pixels (disadvantage: requires additional logic within kernel, branches may result in warp divergence, not all threads in block are fully used)
Unfortunately neighbourhood operations can be really tricky in CUDA and there is always a down-side whatever method you use to handle edge cases.
You can just use a busy spin (no joke). Just make the thread processing a[32] execute:
while(!variable);
before starting to compute and the thread processing a[31] do
variable = 1;
when it finishes. It's up to you to generalize this. I know this is considered "rogue programming" in CUDA, but it seems the only way to achieve what you want. I had a very similar problem and it worked for me. Your performance might suffer though...
Be careful however, that
dim3 threads_perblock(32, 32)
means you have 32 x 32 = 1024 threads per block.