Why is iterating through a large Django QuerySet consuming massive amounts of memory? - sql

The table in question contains roughly ten million rows.
for event in Event.objects.all():
print event
This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.
I also tried Event.objects.iterator() which behaved the same way.
I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).
What have I misunderstood?
(I don't know whether it's relevant, but I'm using PostgreSQL.)

Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():
print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
Memory Efficient Django QuerySet iterator
batch querysets
QuerySet Foreach

Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import model
paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can
# change this to desired chunk size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can do whatever you want with the row
print "done processing page %s" % page

Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:
for event in Event.objects.all().iterator():
print event
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
from django.core.paginator import Paginator
def chunked_iterator(queryset, chunk_size=10000):
paginator = Paginator(queryset, chunk_size)
for page in range(1, paginator.num_pages + 1):
for obj in paginator.page(page).object_list:
yield obj
for event in chunked_iterator(Event.objects.all()):
print event

For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.

Django doesn't have good solution for fetching large items from database.
import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)
for index, eid in enumerate(eids):
event = Event.object.get(id=eid)
# do necessary work with event
if index % 100 == 0:
gc.collect()
print("completed 100 items")
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.

This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
No database activity actually occurs until you do something to evaluate the queryset.
So when the print event is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.
But if you do something like:
Event.objects.all()[300:900]
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.

Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first pip3 install tqdm
from tqdm import tqdm
def spoonfeed(qs, func, chunk=1000, start=0):
"""
Chunk up a large queryset and run func on each item.
Works with automatic primary key fields.
chunk -- how many objects to take on at once
start -- PK to start from
>>> spoonfeed(Spam.objects.all(), nom_nom)
"""
end = qs.order_by('pk').last()
progressbar = tqdm(total=qs.count())
if not end:
return
while start < end.pk:
for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
func(o)
progressbar.update(1)
start += chunk
progressbar.close()
To use this you write a function that does operations on your object:
def set_population(town):
town.population = calculate_population(...)
town.save()
and than run that function on your queryset:
spoonfeed(Town.objects.all(), set_population)

Here a solution including len and count:
class GeneratorWithLen(object):
"""
Generator that includes len and count for given queryset
"""
def __init__(self, generator, length):
self.generator = generator
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.generator
def __getitem__(self, item):
return self.generator.__getitem__(item)
def next(self):
return next(self.generator)
def count(self):
return self.__len__()
def batch(queryset, batch_size=1024):
"""
returns a generator that does not cache results on the QuerySet
Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size
:param batch_size: Size for the maximum chunk of data in memory
:return: generator
"""
total = queryset.count()
def batch_qs(_qs, _batch_size=batch_size):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset.
"""
for start in range(0, total, _batch_size):
end = min(start + _batch_size, total)
yield (start, end, total, _qs[start:end])
def generate_items():
queryset.order_by() # Clearing... ordering by id if PK autoincremental
for start, end, total, qs in batch_qs(queryset):
for item in qs:
yield item
return GeneratorWithLen(generate_items(), total)
Usage:
events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
# Do something with the Event

There are a lot of outdated results here. Not sure when it was added, but Django's QuerySet.iterator() method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.

I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
import MySQLdb
db_config = {} # config your db here
connection = MySQLdb.connect(
host=db_config['HOST'], user=db_config['USER'],
port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
record = cursor.fetchone()
if record is None:
break
# Do something with record here
cursor.close()
connection.close()
Ref:
Retrieving million of rows from MySQL
How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once

Related

What is the difference between SeedSequence.spawn and SeedSequence.generate_state

I am trying to use numpy's SeedSequence to seed RNGs in different processes. However I am not sure whether I should use ss.generate_state or ss.spawn:
import concurrent.futures
import numpy as np
def worker(seed):
rng = np.random.default_rng(seed)
return rng.random(1)
num_repeats = 1000
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result1 = np.hstack(list(pool.map(worker, ss.generate_state(num_repeats))))
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
with concurrent.futures.ProcessPoolExecutor() as pool:
result2 = np.hstack(list(pool.map(worker, ss.spawn(num_repeats))))
What are the differences between the two approaches and which should I use?
Using ss.generate_state is ~10% faster for the basic example above, likely because we are serializing floats instead of objects.
Well the SeedSequence is intended to generate good quality seeds from not so good seeds.
Performance
The generate_state is much faster than spawn
The difference of time for transfering the object or the state should not be the main reason for the difference you notice. You can test this without any
%%timeit
ss.generate_state(num_repeats)
100 times faster than
ss.spawn(num_repeats)
Seed size
In the code map(worker, ss.generate_state(num_repeats)) the RNGs are seeded with integers, while in the map(worker, ss.spawn(num_repeats)) the RNGs are seeded with SeedSequence that can potentially give result in quality initialization, the seed sequence can be used internally to generate a state vector with as many bits as required to completely initialize the RNG. Well, to be honest I expect it to do some expansion on this number to initialize the RNG, not simply padding with zeros for example.
Repeated use
The most important difference is that generate_state gives the same result if called multiple times. On the other hand, spawn gives different results each call.
For illustration, check the following example
ss = np.random.SeedSequence(243799254704924441050048792905230269161)
print('With generate_state')
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print(np.hstack([worker(s) for s in ss.generate_state(5)]))
print('With spawn')
print(np.hstack([worker(s) for s in ss.spawn(5)]))
print(np.hstack([worker(s) for s in ss.spawn(5)]))
With generate_state
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
[0.6625651 0.17654256 0.25323331 0.38250588 0.52670541]
With spawn
[0.06988312 0.40886412 0.55733136 0.43249601 0.53394111]
[0.64885573 0.16788206 0.12435154 0.14676836 0.51876499]
As you see the arrays generated by seeding different RNGs with generate_state gives the same result, not only immediately after construction, but every time the method is called. Spawn should give you the same results using a newly constructed SeedSequence (I am using numpy 1.19.2), however if you run the same code twice using the same instance, the second time will give produce different seeds.

pandas read_sql_table sqlalchemy chunksize issue

I am very new to python and I have an issue with the read_sql_table part of pandas.
If I simply provide the table name and the engine from sqlalchemy, it reads the data and I am able to print the head of the dataframe.
If I add index_col and columns, it works as well.
As soon as I add CHUNKSIZE as 10000, it fails to print the head with the error 'generator' object has no attribute 'head'
Just iterate over it:
for chunk in pd.read_sql_table(table_name, chunksize=10000):
print(chunk.head())
A generator in Python is a way to lazily evaluate things.
So there simply isn't anything to get the .head() of when you provide input to the chunksize keyword argument.
What you'll need to become familiar with is iterating over those results.
Example:
generator_object = pd.read_sql_table('your_table',con=your_connection_string,
chunksize=CHUNKSIZE)
for chunk in generator_object:
print(chunk)
Another thing you can do is to request the first chunk of your table with next():
generator_object = pd.read_sql_table('your_table',con=your_connection_string,
chunksize=CHUNKSIZE)
next(generator_object).head()
But please note that this consumes the chunk, and generator_object will no longer return that chunk.
Further reading:
You can also get multiple chunks using itertools.islice:
import itertools as it
CHUNKSIZE = 10
iterable_slice = it.islice(generator_object,3) # get 3*10 == 30 records
for chunk in iterable_slice:
print(chunk)

How to cache data during the first epoch correctly (Tensorflow, dataset)?

I'm trying to used the cache transformation for a dataset. Here is my current code (simplified):
dataset = tf.data.TFRecordDataset(filenames, num_parallel_reads=1)
dataset = dataset.apply(tf.contrib.data.shuffle_and_repeat(buffer_size=5000, count=1))
dataset = dataset.map(_parser_a, num_parallel_calls=12)
dataset = dataset.padded_batch(
20,
padded_shapes=padded_shapes,
padding_values=padding_values
)
dataset = dataset.prefetch(buffer_size=1)
dataset = dataset.cache()
After the first epoch, I received the following error message:
The calling iterator did not fully read the dataset we were attempting
to cache. In order to avoid unexpected truncation of the sequence, the
current [partially cached] sequence will be dropped. This can occur if
you have a sequence similar to dataset.cache().take(k).repeat().
Instead, swap the order (i.e. dataset.take(k).cache().repeat())
Then, the code proceeded and still read data from the hard drive instead of the cache. So, where should I place dataset.cache() to avoid the error?
Thanks.
The implementation of the Dataset.cache() transformation is fairly simple: it builds up a list of the elements that pass through it as you iterate over completely it the first time, and it returns elements from that list on subsequent attempts to iterate over it. If the first pass only performs a partial pass over the data then the list is incomplete, and TensorFlow doesn't try to use the cached data, because it doesn't know whether the remaining elements will be needed, and in general it might need to reprocess all the preceding elements to compute the remaining elements.
By modifying your program to consume the entire dataset, and iterate over it until tf.errors.OutOfRangeError is raised, the cache will have a complete list of the elements in the dataset, and it will be used on all subsequent iterations.

Building relationships in Neo4j via Py2neo is very slow

We have 5 different types of nodes in database. Largest one has ~290k, the smallest is only ~3k. Each node type has an id field and they are all indexed. I am using py2neo to build relationship, but it is very slow (~ 2 relationships inserted per second)
I used pandas read from a relationship csv, iterate each row to create a relationship wrapped in transaction. I tried batch out 10k creation statements in one transaction, but it does not seem to improve the speed a lot.
Below is the code:
df = pd.read_csv(r"C:\relationship.csv",dtype = datatype, skipinitialspace=True, usecols=fields)
df.fillna('',inplace=True)
def f(node_1 ,rel_type, node_2):
try:
tx = graph.begin()
tx.evaluate('MATCH (a {node_id:$label1}),(b {node_id:$label2}) MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'label1': node_1, 'label2': node_2})
tx.commit()
except Exception as e:
print(str(e))
for index, row in df.iterrows():
if(index%1000000 == 0):
print(index)
try:
f(row["node_1"],row["rel_type"],row["node_2"])
except:
print("error index: " + index)
Can someone help me what I did wrong here. Thanks!
You state that there are "5 different types of nodes" (which I interpret to mean 5 node labels, in neo4j terminology). And, furthermore, you state that their id properties are already indexed.
But your f() function is not generating a Cypher query that uses the labels at all, and neither does it use the id property. In order to take advantage of your indexes, your Cypher query has to specify the node label and the id value.
Since there is currently no efficient way to parameterize the label when performing a MATCH, the following version of the f() function generates a Cypher query that has hardcoded labels (as well as a hardcoded relationship type):
def f(label_1, id_1, rel_type, label_2, id_2):
try:
tx = graph.begin()
tx.evaluate(
'MATCH' +
'(a:' + label_1 + '{id:$id1}),' +
'(b:' + label_2 + '{id:$id2}) ' +
'MERGE (a)-[r:'+rel_type+']->(b)',
parameters = {'id1': id_1, 'id2': id_2})
tx.commit()
except Exception as e:
print(str(e))
The code that calls f() will also have to be changed to pass in both the label names and the id values for a and b. Hopefully, your df rows will contain that data (or enough info for you to derive that data).
If your aim is for better performance then you will need to consider a different pattern for loading these, i.e. batching. You're currently running one Cypher MERGE statement for each relationship and wrapping that in its own transaction in a separate function call.
Batching these by looking at multiple statements per transaction or per function call will reduce the number of network hops and should improve performance.

How to find large objects in ZODB

I'm trying to analyze my ZODB because it grew really large (it's also large after packing).
The package zodbbrowser has a feature that displays the amount of bytes of an object. It does so by getting the length of the pickled state (name of the variable), but it also does a bit of magic which I don't fully understand.
How would I go to find the largest objects in my ZODB?
I've written a method which should do exactly this. Feel free to use it, but be aware that this is very memory consuming. The package zodbbrowser must be installed.
def zodb_objects_by_size(self):
"""
Recurse over the ZODB tree starting from self.aq_parent. For
each object, use zodbbrowser's implementation to get the raw
object state. Put each length into a Counter object and
return a list of the biggest objects, specified by path and
size.
"""
from zodbbrowser.history import ZodbObjectHistory
from collections import Counter
def recurse(obj, results):
# Retrieve state pickle from ZODB, get length
history = ZodbObjectHistory(obj)
pstate = history.loadStatePickle()
length = len(pstate)
# Add length to Counter instance
path = '/'.join(obj.getPhysicalPath())
results[path] = length
# Recursion
for child in obj.contentValues():
# Play around portal tools and other weird objects which
# seem to contain themselves
if child.contentValues() == obj.contentValues():
continue
# Rolling in the deep
try:
recurse(child, results)
except (RuntimeError, AttributeError), err:
import pdb; pdb.set_trace() ## go debug
results = Counter()
recurse(self.aq_parent, results)
return results.most_common()