PL_strtab/SHAREKEYS and copy-on-write leak - apache

Perl internally uses dedicated hash PL_strtab as shared storage for hash's keys, but in fork environment like apache/mod_perl this creates a big issue. Best practice says to preload modules in parent process, but nobody says it's eventually allocates memory for PL_strtab and these pages of memory tend to be implicitly modified in child processes. There are seems to be 2 major reasons of modification:
Reason 1: reallocation (hsplit()) may happen when PL_strtab growths in child process.
Reason 2: REFCNT every time new reference created.
Example below shows 16MB copy-on-write leak in attempt to use hash. Attempts to recompile perl with -DNODEFAULT_SHAREKEYS fails (https://rt.perl.org/SelfService/Display.html?id=133384). I was able to get access to PL_strtab via XS module.
Ideally I'm looking for a way to downgrade all hashes created in parent to keep hash keys within a hash (HE object) rather than PL_strtab, i.e. turn off SHAREKEYS flag. This should allow to shrink PL_strtab to minimum possible size. Ideally it should have 0 keys in parent.
Please let me know you think it's theoretically possible via XS.
#!/usr/bin/env perl
use strict;
use warnings;
use Linux::Smaps;
$SIG{CHLD} = sub { waitpid(-1, 1) };
# comment this block
{
my %h;
# pre-growth PL_strtab hash, kind of: keys %$PL_strtab = 2_000_000;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $pid = fork // die "Cannot fork: $!";
unless ($pid) {
# child
my $s = Linux::Smaps->new($$)->all;
my $before = $s->shared_clean + $s->shared_dirty;
{
my %h;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $s2 = Linux::Smaps->new($$)->all;
my $after = $s2->shared_clean + $s2->shared_dirty;
warn 'COPY-ON-WRITE: ' . ($before - $after) . ' KB';
exit 0;
}
sleep 1000;
print "DONE\n";
Note, that sample %h in parent get destroyed and not accessible in child. The only purpose of it is to preallocate more memory for PL_strtab and make copy-on-write issue more noticeable.
The problem that PL_strtab is shared data structure (not %h). It's solely controlled by Perl and there is no way to control it or use IPC::Shareable or any other well-known for me CPAN modules.
Real life example:
In apache/mod_perl, Starman or any other prefork environment everybody tries to preload as much as possible modules in parent process. Right?
If any of preloaded modules creates hash (even temporary) with big number of keys Perl silently allocates more and more memory for internal PL_strtab hash.
PL_strtab silently get touched in children on any attempt to use hashes.
Problem even worse, because huge percentage of modules we preload are CPAN modules -> there is no way to know which of them overuse hashes resulting in increased memory footprint of parent process.

Related

Megamorphic virtual call - trying to optimize it

we know that there are some techniques that make virtual calls not so expensive in JVM like Inline Cache or Polymorphic Inline Cache.
Let's consider the following situation:
Base is an interface.
public void f(Base[] b) {
for(int i = 0; i < b.length; i++) {
b[i].m();
}
}
I see from my profiler that calling virtual (interface) method m is relatively expensive.
f is on the hot path and it was compiled to machine code (C2) but I see that call to m is a real virtual call. It means that it was not optimised by JVM.
The question is, how to deal with a such situation? Obviously, I cannot make the method m not virtual here because it requires a serious redesign.
Can I do anything or I have to accept it? I was thinking how to "force" or "convince" a JVM to
use polymorphic inline cache here - the number of different types in b` is quite low - between 4-5 types.
to unroll this loop - length of b is also relatively small. After an unroll it is possible that Inline Cache will be helpful here.
Thanks in advance for any advices.
Regards,
HotSpot JVM can inline up to two different targets of a virtual call, for more receivers there will be a call via vtable/itable [1].
To force inlining of more receivers, you may try to devirtualize the call manually, e.g.
if (b.getClass() == X.class) {
((X) b).m();
} else if (b.getClass() == Y.class) {
((Y) b).m();
} ...
During execution of profiled code (in the interpreter or C1), JVM collects receiver type statistics per call site. This statistics is then used in the optimizing compiler (C2). There is just one call site in your example, so the statistics will be aggregated throughout the entire execution.
However, for example, if b[0] always has just two receivers X or Y, and b[1] always has another two receivers Z or W, JIT compiler may benefit from splitting the code into multiple call sites, i.e. manual unrolling:
int len = b.length;
if (len > 0) b[0].m();
if (len > 1) b[1].m();
if (len > 2) b[2].m();
...
This will split the type profile, so that b[0].m() and b[1].m() can be optimized individually.
These are low level tricks relying on the particular JVM implementation. In general, I would not recommend them for production code, since these optimizations are fragile, but they definitely make the source code harder to read. After all, megamorphic calls are not that bad [2].
[1] https://shipilev.net/blog/2015/black-magic-method-dispatch/
[2] https://shipilev.net/jvm/anatomy-quarks/16-megamorphic-virtual-calls/

Read binary files without having them buffered in the volume block cache

Older, now deprecated, macOS file system APIs provided flags to read a file unbuffered.
I seek a modern way to accomplish the same, so that I can read a file's data into memory without it being cached needlessly somewhere else in memory (such as the volume cache).
Reading with fread and first calling setvbuf (fp, NULL, _IONBF, 0) is not having the desired effect in my tests, for example. I am seeking other low-level functions that let me read into a prepared memory buffer and that let me avoid buffering of the whole data.
Background
I am writing a file search program. It reads large amounts of file content (many GBs) that isn't and won't be used by the user otherwise. It would be a waste to have all this data cached in the volume cache as it'll soon get purged by further reads again, anyway. It'll also likely lead to purging file data that's actually in use by the user or system, causing more cache misses.
Therefore, I should be able to tell the system that I do not need the file data cached. The little caching needed for cluster boundaries is not an issue. It's the many large chunks that I read briefly into memory to search it that is not needed to be cached.
Two suggestions:
Use the read() system call instead of stdio.
Disable data caching with the F_NOCACHE option for fcntl().
In Swift that would be something like (error checking omitted for brevity):
import Foundation
let path = "/path/to/file"
let fd = open(path, O_RDONLY)
fcntl(fd, F_NOCACHE, 1)
var buffer = Data(count: 1024 * 1024)
buffer.withUnsafeMutableBytes { ptr in
let amount = read(fd, ptr.baseAddress, ptr.count)
}
close(fd)

Is it safe, to share an array between threads?

Is it safe, to share an array between promises like I did it in the following code?
#!/usr/bin/env perl6
use v6;
sub my_sub ( $string, $len ) {
my ( $s, $l );
if $string.chars > $len {
$s = $string.substr( 0, $len );
$l = $len;
}
else {
$s = $string;
$l = $s.chars;
}
return $s, $l;
}
my #orig = <length substring character subroutine control elements now promise>;
my $len = 7;
my #copy;
my #length;
my $cores = 4;
my $p = #orig.elems div $cores;
my #vb = ( 0..^$cores ).map: { [ $p * $_, $p * ( $_ + 1 ) ] };
#vb[#vb.end][1] = #orig.elems;
my #promise;
for #vb -> $r {
#promise.push: start {
for $r[0]..^$r[1] -> $i {
( #copy[$i], #length[$i] ) = my_sub( #orig[$i], $len );
}
};
}
await #promise;
It depends how you define "array" and "share". So far as array goes, there are two cases that need to be considered separately:
Fixed size arrays (declared my #a[$size]); this includes multi-dimensional arrays with fixed dimensions (such as my #a[$xs, $ys]). These have the interesting property that the memory backing them never has to be resized.
Dynamic arrays (declared my #a), which grow on demand. These are, under the hood, actually using a number of chunks of memory over time as they grow.
So far as sharing goes, there are also three cases:
The case where multiple threads touch the array over its lifetime, but only one can ever be touching it at a time, due to some concurrency control mechanism or the overall program structure. In this case the arrays are never shared in the sense of "concurrent operations using the arrays", so there's no possibility to have a data race.
The read-only, non-lazy case. This is where multiple concurrent operations access a non-lazy array, but only to read it.
The read/write case (including when reads actually cause a write because the array has been assigned something that demands lazy evaluation; note this can never happen for fixed size arrays, as they are never lazy).
Then we can summarize the safety as follows:
| Fixed size | Variable size |
---------------------+----------------+---------------+
Read-only, non-lazy | Safe | Safe |
Read/write or lazy | Safe * | Not safe |
The * indicating the caveat that while it's safe from Perl 6's point of view, you of course have to make sure you're not doing conflicting things with the same indices.
So in summary, fixed size arrays you can safely share and assign to elements of from different threads "no problem" (but beware false sharing, which might make you pay a heavy performance penalty for doing so). For dynamic arrays, it is only safe if they will only be read from during the period they are being shared, and even then if they're not lazy (though given array assignment is mostly eager, you're not likely to hit that situation by accident). Writing, even to different elements, risks data loss, crashes, or other bad behavior due to the growing operation.
So, considering the original example, we see my #copy; and my #length; are dynamic arrays, so we must not write to them in concurrent operations. However, that happens, so the code can be determined not safe.
The other posts already here do a decent job of pointing in better directions, but none nailed the gory details.
Just have the code that is marked with the start statement prefix return the values so that Perl 6 can handle the synchronization for you. Which is the whole point of that feature.
Then you can wait for all of the Promises, and get all of the results using an await statement.
my #promise = do for #vb -> $r {
start
do # to have the 「for」 block return its values
for $r[0]..^$r[1] -> $i {
$i, my_sub( #orig[$i], $len )
}
}
my #results = await #promise;
for #results -> ($i,$copy,$len) {
#copy[$i] = $copy;
#length[$i] = $len;
}
The start statement prefix is only sort-of tangentially related to parallelism.
When you use it you are saying, “I don't need these results right now, but probably will later”.
That is the reason it returns a Promise (asynchrony), and not a Thread (concurrency)
The runtime is allowed to delay actually running that code until you finally ask for the results, and even then it could just do all of them sequentially in the same thread.
If the implementation actually did that, it could result in something like a deadlock if you instead poll the Promise by continually calling it's .status method waiting for it to change from Planned to Kept or Broken, and only then ask for its result.
This is part of the reason the default scheduler will start to work on any Promise codes if it has any spare threads.
I recommend watching jnthn's talk “Parallelism, Concurrency,
and Asynchrony in Perl 6”.
slides
This answer applies to my understanding of the situation on MoarVM, not sure what the state of art is on the JVM backend (or the Javascript backend fwiw).
Reading a scalar from several threads can be done safely.
Modifying a scalar from several threads can be done without having to fear for a segfault, but you may miss updates:
$ perl6 -e 'my $i = 0; await do for ^10 { start { $i++ for ^10000 } }; say $i'
46785
The same applies to more complex data structures like arrays (e.g. missing values being pushed) and hashes (missing keys being added).
So, if you don't mind missing updates, changing shared data structures from several threads should work. If you do mind missing updates, which I think is what you generally want, you should look at setting up your algorithm in a different way, as suggested by #Zoffix Znet and #raiph.
No.
Seriously. Other answers seem to make too many assumptions about the implementation, none of which are tested by the spec.

File reading and checksums in go. Difference between methods

Recently I'm into creating checksums for files in go. My code is working with small and big files. I tried two methods, the first uses ioutil.ReadFile("filename") and the second is working with os.Open("filename").
Examples:
The first function is working with the io/ioutil and works for small files. When I try to copy a big file my ram gets blastet and for a 1.5GB iso it uses 3GB of ram.
func byteCopy(fileToCopy string) {
file, err := ioutil.ReadFile(fileToCopy) //1.5GB file
omg(err) //error handling function
ioutil.WriteFile("2.iso", file, 0777)
os.Remove("2.iso")
}
Even worse when I want to create a checksum with crypto/sha512 and io/ioutil.
It will never finish and abort because it runs out of memory.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
When using the function below everything works fine.
func ioHash() {
f, err := os.Open(iso) //iso is a big ~ 1.5tb file
omg(err) //error handling function
defer f.Close()
h := sha512.New()
io.Copy(h, f)
fmt.Printf("%x", h.Sum(nil))
}
My Question:
Why is the ioutil.ReadFile() function not working right? The 1.5GB file should not fill my 16GB of ram. I don't know where to look right now.
Could somebody explain the differences between the methods? I don't get it with reading the go-doc and examples.
Having usable code is nice, but understanding why its working is way above that.
Thanks in advance!
The following code doesn't do what you think it does.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
fmt.Printf("%x", h.Sum(file))
}
This first reads your 1.5GB iso. As jnml pointed out, it continuously makes bigger and bigger buffers to fill it. In the end, And total buffer size is no less than 1.5GB and no greater than 1.875GB (by the current implementation).
However, after that you then make another buffer! h.Sum(file) doesn't hash file. It appends the current hash to file! This may or may not cause yet another allocation.
The real problem is that you are taking that file, now appended with the hash, and printing it with %x. Fmt actually pre-computes using the same type of method jnml pointed out that ioutil.ReadAll used. So it constantly allocated bigger and bigger buffers to store the hex of your file. Since each letter is 4 bits, that means we are talking about no less than a 3GB buffer for that and no greater than 3.75GB.
This means your active buffers may be as big 5.625GB. Combine that with the GC not being perfect and not removing all the intermediate buffers, and it could very easily fill your space.
The correct way to write that code would have been.
func ioutilHash() {
file, _ := ioutil.ReadFile(iso)
h := sha512.New()
h.Write(file)
fmt.Printf("%x", h.Sum(nil))
}
This doesn't do nearly the number the allocations.
The bottom line is that ReadFile is rarely what you want to use. IO streaming (using readers and writers) is always the best way when it is an option. Not only do you allocate much less when you use io.Copy, you also hash and read the disk concurrently. In your ReadFile example, the two resources are used synchronously when they don't depend on each other.
ioutil.ReadFile is working right. It's your fault to abuse the system resources by using that function for things you know are huge.
ioutil.ReadFile is a handy helper for files you're pretty sure in advance that they're going to be small. Like configuration files, most source code files etc. (Actually it's optimizing things for files <= 1e9 bytes, but that's an implementation detail and not part of the API contract. Your 1.5GB file forces it to use slice growing and thus allocating more than one big buffer for your data in the process of reading the file.)
Even your other approach using os.File is not okay. You definitely should be using the "bufio" package for sequential processing of large files, see bufio.NewReader.

Uniqueness of global Python objects void in sub-interpreters?

I have a question about inner-workings of Python sub-interpreter initialization (from Python/C API) and Python id() function. More precisely, about handling of global module objects in a WSGI Python containers (like uWSGI used with nginx and mod_wsgi on Apache).
The following code works as expected (isolated) in both of the mentioned environments, but I can not explain to my self why the id() function always returns the same value per variable, regardless of the process/sub-interpreter in which it is executed.
from __future__ import print_function
import os, sys
def log(*msg):
print(">>>", *msg, file=sys.stderr)
class A:
def __init__(self, x):
self.x = x
def __str__(self):
return self.x
def set(self, x):
self.x = x
a = A("one")
log("class instantiated.")
def application(environ, start_response):
output = "pid = %d\n" % os.getpid()
output += "id(A) = %d\n" % id(A)
output += "id(a) = %d\n" % id(a)
output += "str(a) = %s\n\n" % a
a.set("two")
status = "200 OK"
response_headers = [
('Content-type', 'text/plain'), ('Content-Length', str(len(output)))
]
start_response(status, response_headers)
return [output]
I have tested this code in uWSGI with one master process and 2 workers; and in mod_wsgi using a deamon mode with two processes and one thread per process. The typical output is:
pid = 15278
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = one
on first load, then:
pid = 15282
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = one
on second, and then
pid = 15278 | pid = 15282
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = two
on every other. As you can see, id() (memory location) of both the class and the class instance remains the same in both processes (first/second load above), while at the same time class instances live in a separate context (otherwise the second request would show "two" instead of "one")!
I suspect the answer might be hinted by Python docs:
id(object):
Return the “identity” of an object. This is an integer (or long integer) which
is guaranteed to be unique and constant for this object during its lifetime. Two
objects with non-overlapping lifetimes may have the same id() value.
But if that indeed is the reason, I'm troubled by the next statement that claims the id() value is object's address!
While I appreciate the fact this could very well be just a Python/C API "clever" feature that solves (or rather fixes) a problem of caching object references (pointers) in 3rd party extension modules, I still find this behavior to be inconsistent with... well, common sense. Could someone please explain this?
I've also noticed mod_wsgi imports the module in each process (i.e. twice), while uWSGI is importing the module only once for both processes. Since the uWSGI master process does the importing, I suppose it seeds the children with copies of that context. Both workers work independently afterwards (deep copy?), while at the same time using the same object addresses, seemingly. (Also, a worker gets reinitialized to the original context upon reload.)
I apologize for such a long post, but I wanted to give enough details.
Thank you!
It's not entirely clear what you're asking; I'd give a more concise answer if the question was more specific.
First, the id of an object is, in fact--at least in CPython--its address in memory. That's perfectly normal: two objects in the same process at the same time can't share an address, and an object's address never changes in CPython, so the address works neatly as an id. I don't know how this violates common sense.
Next, note that a backend process may be spawned in two very distinct ways:
A generic WSGI backend handler will fork processes, and then each of the processes will start a backend. This is simple and language-agnostic, but wastes a lot of memory and wastes time loading the backend code repeatedly.
A more advanced backend will load the Python code once, and then fork copies of the server after it's loaded. This causes the code to be loaded only once, which is much faster and reduces memory waste significantly. This is how production-quality WSGI servers work.
However, the end result in both of these cases is identical: separate, forked processes.
So, why are you ending up with the same IDs? That depends on which of the above methods is in use.
With a generic WSGI handler, it's happening simply because each process is doing essentially the same thing. So long as processes are doing the same thing, they'll tend to end up with the same IDs; at some point they'll diverge and this will no longer happen.
With a pre-loading backend, it's happening because this initial code happens only once, before the server forks, so it's guaranteed to have the same ID.
However, either way, once the fork happens they're separate objects, in separate contexts. There's no significance to objects in separate processes having the same ID.
This is simple to explain by way of a demonstration. You see, when uwsgi creates a new process, it forks the interpreter. Now, forks have interesting memory properties:
import os, time
if os.fork() == 0:
print "child first " + str(hex(id(os)))
time.sleep(2)
os.attr = 'test'
print "child second " + str(hex(id(os)))
else:
time.sleep(1)
print "parent first " + str(hex(id(os)))
time.sleep(2)
print "parent second " + str(hex(id(os)))
print os.attr
Output:
child first 0xb782414cL
parent first 0xb782414cL
child second 0xb782414cL
parent second 0xb782414cL
Traceback (most recent call last):
File "test.py", line 13, in <module>
print os.attr
AttributeError: 'module' object has no attribute 'attr'
Although the objects seem to reside at the same memory addr, they are different objects, but this is not python, but the os.
edit: I suspect the reason that mod_wsgi imports twice is that it creates further processes via calling python rather than forking. uwsgi's approach is better because it can use less memory. fork's page sharing is COW (copy on write).