Uniqueness of global Python objects void in sub-interpreters? - mod-wsgi

I have a question about inner-workings of Python sub-interpreter initialization (from Python/C API) and Python id() function. More precisely, about handling of global module objects in a WSGI Python containers (like uWSGI used with nginx and mod_wsgi on Apache).
The following code works as expected (isolated) in both of the mentioned environments, but I can not explain to my self why the id() function always returns the same value per variable, regardless of the process/sub-interpreter in which it is executed.
from __future__ import print_function
import os, sys
def log(*msg):
print(">>>", *msg, file=sys.stderr)
class A:
def __init__(self, x):
self.x = x
def __str__(self):
return self.x
def set(self, x):
self.x = x
a = A("one")
log("class instantiated.")
def application(environ, start_response):
output = "pid = %d\n" % os.getpid()
output += "id(A) = %d\n" % id(A)
output += "id(a) = %d\n" % id(a)
output += "str(a) = %s\n\n" % a
a.set("two")
status = "200 OK"
response_headers = [
('Content-type', 'text/plain'), ('Content-Length', str(len(output)))
]
start_response(status, response_headers)
return [output]
I have tested this code in uWSGI with one master process and 2 workers; and in mod_wsgi using a deamon mode with two processes and one thread per process. The typical output is:
pid = 15278
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = one
on first load, then:
pid = 15282
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = one
on second, and then
pid = 15278 | pid = 15282
id(A) = 139748093678128
id(a) = 139748093962360
str(a) = two
on every other. As you can see, id() (memory location) of both the class and the class instance remains the same in both processes (first/second load above), while at the same time class instances live in a separate context (otherwise the second request would show "two" instead of "one")!
I suspect the answer might be hinted by Python docs:
id(object):
Return the “identity” of an object. This is an integer (or long integer) which
is guaranteed to be unique and constant for this object during its lifetime. Two
objects with non-overlapping lifetimes may have the same id() value.
But if that indeed is the reason, I'm troubled by the next statement that claims the id() value is object's address!
While I appreciate the fact this could very well be just a Python/C API "clever" feature that solves (or rather fixes) a problem of caching object references (pointers) in 3rd party extension modules, I still find this behavior to be inconsistent with... well, common sense. Could someone please explain this?
I've also noticed mod_wsgi imports the module in each process (i.e. twice), while uWSGI is importing the module only once for both processes. Since the uWSGI master process does the importing, I suppose it seeds the children with copies of that context. Both workers work independently afterwards (deep copy?), while at the same time using the same object addresses, seemingly. (Also, a worker gets reinitialized to the original context upon reload.)
I apologize for such a long post, but I wanted to give enough details.
Thank you!

It's not entirely clear what you're asking; I'd give a more concise answer if the question was more specific.
First, the id of an object is, in fact--at least in CPython--its address in memory. That's perfectly normal: two objects in the same process at the same time can't share an address, and an object's address never changes in CPython, so the address works neatly as an id. I don't know how this violates common sense.
Next, note that a backend process may be spawned in two very distinct ways:
A generic WSGI backend handler will fork processes, and then each of the processes will start a backend. This is simple and language-agnostic, but wastes a lot of memory and wastes time loading the backend code repeatedly.
A more advanced backend will load the Python code once, and then fork copies of the server after it's loaded. This causes the code to be loaded only once, which is much faster and reduces memory waste significantly. This is how production-quality WSGI servers work.
However, the end result in both of these cases is identical: separate, forked processes.
So, why are you ending up with the same IDs? That depends on which of the above methods is in use.
With a generic WSGI handler, it's happening simply because each process is doing essentially the same thing. So long as processes are doing the same thing, they'll tend to end up with the same IDs; at some point they'll diverge and this will no longer happen.
With a pre-loading backend, it's happening because this initial code happens only once, before the server forks, so it's guaranteed to have the same ID.
However, either way, once the fork happens they're separate objects, in separate contexts. There's no significance to objects in separate processes having the same ID.

This is simple to explain by way of a demonstration. You see, when uwsgi creates a new process, it forks the interpreter. Now, forks have interesting memory properties:
import os, time
if os.fork() == 0:
print "child first " + str(hex(id(os)))
time.sleep(2)
os.attr = 'test'
print "child second " + str(hex(id(os)))
else:
time.sleep(1)
print "parent first " + str(hex(id(os)))
time.sleep(2)
print "parent second " + str(hex(id(os)))
print os.attr
Output:
child first 0xb782414cL
parent first 0xb782414cL
child second 0xb782414cL
parent second 0xb782414cL
Traceback (most recent call last):
File "test.py", line 13, in <module>
print os.attr
AttributeError: 'module' object has no attribute 'attr'
Although the objects seem to reside at the same memory addr, they are different objects, but this is not python, but the os.
edit: I suspect the reason that mod_wsgi imports twice is that it creates further processes via calling python rather than forking. uwsgi's approach is better because it can use less memory. fork's page sharing is COW (copy on write).

Related

Modify the stack of a PLY parser

Trying to modify a value already pushed to the stack of a PLY/Yacc parser. I'm using ply on python 3.
Basically I want to invert the previous two values when a token SWAP is used.
Imagine we have this stack:
1, 2, 3, 4, SWAP
I need it to reduce to:
1, 2, 4, 3
the value you write to p[0] will be pushed to the stack, but how can I push more then one value?
# this fails because it consume two values and pushes only one
# results into: `1`, `2`, `4`
def p_swap(p):
'value : value value SWAP'
p[0] = p[2]
# this was just a try... fails as well
def p_swap(p):
'value : value value SWAP'
p[0] = p[2]
p[1] = p[1]
# this locked as a good idea since consumes only only value and modify the second in place
# it fails because the stack (negative indexes) are immutable:
# https://github.com/dabeaz/ply/blob/master/ply/yacc.py#L234
# results into: `1`, `2`, `3`, `3`
def p_swap(p):
'value : value SWAP'
p[0] = p[-1]
p[-1] = p[1] # this is a NOP
p is an instance of this class
I guess it was designed to be immutable to enforce the parsing to be done a certain way (the correct way), but I'm missing it: what's the correct way to modify the stack or to design a parser?
It sounds like you're trying to create a stack-based language like Forth or Joy. If so, you shouldn't need a bottom-up parser, and you shouldn't be surprised that a bottom-up parser-generator doesn't work the way you want it to.
Stack-based languages are mostly simply streams of tokens. Each token has some kind of stack effect, and they are just applied in sequence; there's usually little or no syntactic structure beyond that. Consequently, the languages really aren't parsed; at best, they are tokenised.
Most stack-based languages contain some kind of nested control structures which are not strictly conformant with the above (but not all; see Postscript, for example). But even these are so simple that a real parser is unnecessary.
Of course, nothing stops you from using a generated parser to parse a trivial language. But if you do that, you should certainly not expect to be able to gain access to the parser's internal datastructures. The parser stack is used by the parser in ways which might not be fully obvious, and which certainly must not be interfered with. If you want to implement a stack-based language interpreter, you need to use your own value stack. (Or stacks; many stack-based languages have several different stacks, each with its own semantics.)

PL_strtab/SHAREKEYS and copy-on-write leak

Perl internally uses dedicated hash PL_strtab as shared storage for hash's keys, but in fork environment like apache/mod_perl this creates a big issue. Best practice says to preload modules in parent process, but nobody says it's eventually allocates memory for PL_strtab and these pages of memory tend to be implicitly modified in child processes. There are seems to be 2 major reasons of modification:
Reason 1: reallocation (hsplit()) may happen when PL_strtab growths in child process.
Reason 2: REFCNT every time new reference created.
Example below shows 16MB copy-on-write leak in attempt to use hash. Attempts to recompile perl with -DNODEFAULT_SHAREKEYS fails (https://rt.perl.org/SelfService/Display.html?id=133384). I was able to get access to PL_strtab via XS module.
Ideally I'm looking for a way to downgrade all hashes created in parent to keep hash keys within a hash (HE object) rather than PL_strtab, i.e. turn off SHAREKEYS flag. This should allow to shrink PL_strtab to minimum possible size. Ideally it should have 0 keys in parent.
Please let me know you think it's theoretically possible via XS.
#!/usr/bin/env perl
use strict;
use warnings;
use Linux::Smaps;
$SIG{CHLD} = sub { waitpid(-1, 1) };
# comment this block
{
my %h;
# pre-growth PL_strtab hash, kind of: keys %$PL_strtab = 2_000_000;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $pid = fork // die "Cannot fork: $!";
unless ($pid) {
# child
my $s = Linux::Smaps->new($$)->all;
my $before = $s->shared_clean + $s->shared_dirty;
{
my %h;
foreach my $x (1 .. 2_000_000) {
$h{$x} = undef;
}
}
my $s2 = Linux::Smaps->new($$)->all;
my $after = $s2->shared_clean + $s2->shared_dirty;
warn 'COPY-ON-WRITE: ' . ($before - $after) . ' KB';
exit 0;
}
sleep 1000;
print "DONE\n";
Note, that sample %h in parent get destroyed and not accessible in child. The only purpose of it is to preallocate more memory for PL_strtab and make copy-on-write issue more noticeable.
The problem that PL_strtab is shared data structure (not %h). It's solely controlled by Perl and there is no way to control it or use IPC::Shareable or any other well-known for me CPAN modules.
Real life example:
In apache/mod_perl, Starman or any other prefork environment everybody tries to preload as much as possible modules in parent process. Right?
If any of preloaded modules creates hash (even temporary) with big number of keys Perl silently allocates more and more memory for internal PL_strtab hash.
PL_strtab silently get touched in children on any attempt to use hashes.
Problem even worse, because huge percentage of modules we preload are CPAN modules -> there is no way to know which of them overuse hashes resulting in increased memory footprint of parent process.

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()

QThread how to use readwritelock on shared list in a tableview model?

i have a QTablewView which stores data as a List, the list is the backend data for a model.
self.shots=[{'name':'abc010','taskdir','/show/abc/abc010','file':'xxx.ma'},
{'name':'abc020','taskdir','/show/abc/abc020','file':'yyy.ma'},
... ]
the name, taskdir,file attributes are from 3 seperate QThread, when i press a button, 3 threads are created and they get the result for the first element for my self.shots list, they run one by one, then i hit the button a second time, another 3 threads returns the second element(a dictionary) in my list. and so one.
so essentially my question is do i need to use read write lock in this case? my threads are writing to the same list,(because i may press the button a second time when the frist 3 threads are still running.)
current i'm getting segfault randomly without using any readwritelocks. is this the reason for crashing ?
thanks if anyone can give me a pseudo code about how to use the read write lock.
i'm using this generic thread function to create my threads.
class GenericThread(QThread):
def __init__(self, function, *args, **kwargs):
QThread.__init__(self)
# super(GenericThread, self).__init__()
self.function = function
self.args = args
self.kwargs = kwargs
def __del__(self):
self.wait()
def run(self, *args):
self.function(*self.args, **self.kwargs)
Can you provide some more code because it's not clear to me how the threads communicate.
Some remarks:
You should store the data in the model, not in the view. That's the whole point of using a model/view design, it eliminates data consistency problems and makes it possible to have multiple views of the same model. Qt Model/View Tutorial
You should not subclass QThread, but instead use moveToThread(QThread*) to move an object with the required functionality to the thread. How To Really, Truly Use QThreads

Why is iterating through a large Django QuerySet consuming massive amounts of memory?

The table in question contains roughly ten million rows.
for event in Event.objects.all():
print event
This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.
I also tried Event.objects.iterator() which behaved the same way.
I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).
What have I misunderstood?
(I don't know whether it's relevant, but I'm using PostgreSQL.)
Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():
print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
Memory Efficient Django QuerySet iterator
batch querysets
QuerySet Foreach
Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import model
paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can
# change this to desired chunk size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can do whatever you want with the row
print "done processing page %s" % page
Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:
for event in Event.objects.all().iterator():
print event
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
from django.core.paginator import Paginator
def chunked_iterator(queryset, chunk_size=10000):
paginator = Paginator(queryset, chunk_size)
for page in range(1, paginator.num_pages + 1):
for obj in paginator.page(page).object_list:
yield obj
for event in chunked_iterator(Event.objects.all()):
print event
For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.
Django doesn't have good solution for fetching large items from database.
import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)
for index, eid in enumerate(eids):
event = Event.object.get(id=eid)
# do necessary work with event
if index % 100 == 0:
gc.collect()
print("completed 100 items")
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.
This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
No database activity actually occurs until you do something to evaluate the queryset.
So when the print event is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.
But if you do something like:
Event.objects.all()[300:900]
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.
Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first pip3 install tqdm
from tqdm import tqdm
def spoonfeed(qs, func, chunk=1000, start=0):
"""
Chunk up a large queryset and run func on each item.
Works with automatic primary key fields.
chunk -- how many objects to take on at once
start -- PK to start from
>>> spoonfeed(Spam.objects.all(), nom_nom)
"""
end = qs.order_by('pk').last()
progressbar = tqdm(total=qs.count())
if not end:
return
while start < end.pk:
for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
func(o)
progressbar.update(1)
start += chunk
progressbar.close()
To use this you write a function that does operations on your object:
def set_population(town):
town.population = calculate_population(...)
town.save()
and than run that function on your queryset:
spoonfeed(Town.objects.all(), set_population)
Here a solution including len and count:
class GeneratorWithLen(object):
"""
Generator that includes len and count for given queryset
"""
def __init__(self, generator, length):
self.generator = generator
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.generator
def __getitem__(self, item):
return self.generator.__getitem__(item)
def next(self):
return next(self.generator)
def count(self):
return self.__len__()
def batch(queryset, batch_size=1024):
"""
returns a generator that does not cache results on the QuerySet
Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size
:param batch_size: Size for the maximum chunk of data in memory
:return: generator
"""
total = queryset.count()
def batch_qs(_qs, _batch_size=batch_size):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset.
"""
for start in range(0, total, _batch_size):
end = min(start + _batch_size, total)
yield (start, end, total, _qs[start:end])
def generate_items():
queryset.order_by() # Clearing... ordering by id if PK autoincremental
for start, end, total, qs in batch_qs(queryset):
for item in qs:
yield item
return GeneratorWithLen(generate_items(), total)
Usage:
events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
# Do something with the Event
There are a lot of outdated results here. Not sure when it was added, but Django's QuerySet.iterator() method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.
I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
import MySQLdb
db_config = {} # config your db here
connection = MySQLdb.connect(
host=db_config['HOST'], user=db_config['USER'],
port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
record = cursor.fetchone()
if record is None:
break
# Do something with record here
cursor.close()
connection.close()
Ref:
Retrieving million of rows from MySQL
How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once