I am using the groovy.sql.Sql class to query a database and process the results. My problem is that the ResultSet can be very large; so large that I risk running out of memory if I try to process the whole ResultSet at once. I know the Sql.rows() method supports paging using offset and max results parameters but I haven't been able to find a good example of how to use it (and I'm not certain that paging is what I'm looking for).
Basically, here's what I'm trying to do:
def endOfResultSet = false
for(int x = 1; !endOfResultSet; x+=1000){
def result = sql.rows("Select * from table", x, 1000)
processResult(result)
endOfResultSet = result.size()!=1000
}
My question is if Groovy is smart enough to reuse the same result set for the sql.rows("Select * from table", x, 1000) call or if it will be repeatedly be running the same statement on the database and then paging to where the offset starts.
Your help is appreciated, Thanks!
Edit: What I'm trying to avoid is running the same query on the database multiple times. I'd like to run the query once, get the first 1,000 rows, process them, get the next 1,000 rows, etc... until all the rows are processed.
I assume you've seen this blog post about paging?
To answer your question, if we look at the code for the Sql class in Groovy, we can see that the code for rows(String,int,int) calls rows(String,int,int,null)
And the code for that is:
AbstractQueryCommand command = createQueryCommand(sql);
ResultSet rs = null;
try {
rs = command.execute();
List<GroovyRowResult> result = asList(sql, rs, offset, maxRows, metaClosure);
rs = null;
return result;
} finally {
command.closeResources(rs);
}
So as you can see, it gets the full ResultSet, then steps through this inside the asList method, filling a List<GroovyRowResult> object with just the results you requested.
Edit (after the question was edited)
As I said in my comment below, I think you're going to need to write your own paging query for the specific database you are using... For example, with MySQL, your above query can be changed to:
def result = sql.rows( "SELECT * FROM table LIMIT ${Sql.expand x}, 1000" )
Other databases will have different methods for this sort of thing...I don't believe there is a standard implementation
Answer from above is not correct. If you dig deeper, you'll find that if the ResultSet is not TYPE_FORWARD_ONLY, then the "absolute" method of the ResultSet is invoked to position a server side cursor. Then maxRows are returned. If the ResultSet is TYPE_FORWARD_ONLY, then ResultSet.next() is invoked offset number of times, then maxRows are returned. The exact performance characteristics will depend on the underlying jdbc driver implementation, but usually you want a scrollable result set when using the paging feature.
The resultset is not reused between invocations. Sounds like you want something like streaming, not paging.
Also, I wrote the patch, btw.
http://jira.codehaus.org/browse/GROOVY-4622
Related
I'm creating a small script for automation, but I ran into a problem.
Suppose I use the Get API method to get results, then I want to add one to each result.
The assumption itself is not difficult:
def numbers = get(endpoint)
numbers.each{int number ->
log.info(number+1)
}
I am having a hard time, however, figuring out the correct approach to pagination. The response limit for a query is 100. Before submitting a query, I don't know how many responses to expect. (There might be more than 100 and then I have to use pagination).
In this case, should I first determine how many results I can get, and with this knowledge only then create for loops for each "page"?
Or should I try with a while loop, and continue sending GET request, until total quantity would be < 100?
Something like:
bool continue = true;
int startAt = 0;
while(continue){
def numbers = get(endpoint)
.queryString('startAt', startAt)
number.each{int number ->
log.info(number+1)
}
startAt += 100;
if(numbers.total == 100) continue = false;
}
For now I was using for loop, but had two different endpoints. One endpoint was showing me max results, second one details for each result. But the second one was limited to 100 results, so I counted how many loops I need by dividing total results by 100.
I have an xsjs sertvice that is filling some tables with data from another table.
after sometime running, the service gives the following error:
InternalError: dberror(Connection.prepareStatement): 608 - exceed maximum number of prepared statements: the number of prepared statements per connection cannot exceed the max statements
I'm opening a $.db.getConnection() at the beginning and only closing at the end, with a prepareStatement statement on a for loop. (there are several loops like the one bellow for other tables)
var aSQL = "select field from table";
var conn = $.hdb.getConnection(); var connInsert = $.db.getConnection();
var rsLevel1 = conn.executeQuery(aSQL);
var s = {};
var loc_descr_group = [];
var row = {};
for (i = 0; i < rsLevel1.length; i++) {
var entry = rsLevel1[i].field;
var split = entry.split(",");
for (var j = 0; j<split.length; j++){
if (loc_descr_group.indexOf(split[j]) == -1){
loc_descr_group.push(split[j]);
var value = split[j].replace(/'/g,"''");
sSQL = "insert into another_table "
+ " values ('"+value+"')";
pstmt = connInsert.prepareStatement(sSQL);
pstmt.execute();
connInsert.commit();
}
}
}
connInsert.close();
conn.close();
I couldn't find any information about the max number of prepareStatement used on xsjs. Is there one?
Thank you.
The problem here is not that there is a per-connection limit of prepared statements, but that the code needlessly creates new prepared statements in a loop.
The whole idea of prepared statements is reuse. When running several statements that are structurally the same and differ only in the actual values covered, using prepared statements allow to parse, check and optimise the query structure once and reuse it over and over again.
Instead of creating the prepared statement object for every insert, it's much better to create it once before the nested loop construct.
And instead of pasting quoted and comma-delimited values into the SQL string, rather using bind variables can improve both execution speed and security of the insert statement.
Furthermore, there is a COMMIT after each insert. If that is really required, then using an autocommit connection might be the better choice. If it's not required, the COMMIT should only be send once after the loops have finished.
This is not just a question of performance (COMMITs are always synchronous - your code waits for it) but also of possibly half inserted records.
Finally, the code uses two different connection methods $.db.getConnection and $.hdb.db.connection to create two separate connection objects. For the given context that is unnecessary and rather confusing.
Just using the newer $.hdb.db.connection and a single connection would suffice.
Because of performance issues with fetching about 30k results from DB as entities when using Hibernate JPA, i instead tried to write a namedQuery to have more control over the query and its runtime. What i end up with is almost 20 seconds just for those few entities, and those 20 seconds are necessary for the "old" query and my own namedQuery (which doesn't take a second to get the result when executed in a sql client), so basically it doesn't make any difference whether i use a namedQuery or the hibernate-generated query.
Is it safe to assume that 98% of the time is used for mapping those results to their corresponding entities? And if so, how should i speed this up? Below is the query that i wrote myself (note that i explicitly have to state all the columns in the SELECT)
SELECT exp.ID
,exp.CREATEDBY
,exp.CREATEDTIME
,exp.DELETED
,exp.LASTCHANGE
,exp.LASTCHANGEBY
,exp.STATUS
,exp.BRIXFIGURE
,exp.GRAMMAGE
,exp.INDIVIDUALPACKAGING
,exp.MINORDERQUANTITY
,exp.PACKAGINGHEIGHT
,exp.PACKAGINGLENGTH
,exp.PACKAGINGWIDTH
,exp.PALETTESIZE
,exp.QUANTITY
,exp.UNIT
,exp.VALIDUNTIL
,exp.EXPORTELEMENT_START
,exp.EXPORTSTATUS
,exp.webServiceResponse
,exp.CATEGORYID
,exp.COMMENTID
,exp.SUPPLIERID
,exp.TRANSPORTPACKAGINGID
,exp.LocationId
,exp.PriceRowId
,exp.EXPORTELEMENT_ENDDATE
,exp.BASEPRICE
,exp.BASEUNIT
,exp.BARCODES
,exp.EXPIRYDATE
,exp.PREORDERPERIOD
,exp.EXPORTWEEKID
,exp.EXPORT_TENDER_UID
,exp.EXPORT_UID
,exp.CURRENCY_ID
,exp.WEIGHT_PER_BOX
FROM EXPORTELEMENT AS exp
JOIN EXPORTELEMENT_LOCATION as exlo ON exlo.EXPORTELEMENTID = exp.ID
WHERE exlo.LOCATIONID = :locationId
AND exp.EXPORTELEMENT_ENDDATE <= :endDate
AND exp.EXPORTELEMENT_START >= :startDate
AND exp.DELETED = :deleted
Writing raw sql vs. letting hibernate/jpa do it for you doesn't improve the performance. The reason might be that your object is mapped to other objects (Fetch eager as opposed to lazy) that map to other objects etc...So you could potentially be pulling your whole db. You might think your query is the only one being executed, but reality is the other mappings might be creating/executing more sql queries...In my case for 10,000 rows doing the mapping myself took 100 milliseconds, but letting hibernate/jpa do the mapping took 10s, a whole 100x.
What improves the performance is doing the mapping yourself. Something like this:
#Query(nativeQuery = true, value = "your_raw_sql_here")
List<Object[]> yourNativeQueryMethod();
Then you can map the object yourself:
for( Object[] objectArray: results) {
BigInteger id = (BigInteger) objectArray[0];
//etc...
}
In Lucene, a query can be composed of many sub-queries. (such as TermQuery objects)
I'd like a way to iterate over the documents returned by a search, and for each document, to then iterate over the sub-queries.
For each sub-query, I'd like to get the number of times it matched. (I'm also interested in the fieldNorm, etc.)
I can get access to that data by using indexSearcher.explain, but that feels quite hacky because I would then need to parse the "description" member of each nested Explanation object to try and find the term frequency, etc. (also, calling "explain" is very slow, so I'm hoping for a faster approach)
The context here is that I'd like to experiment with re-ranking Lucene's top N search results, and to do that it's obviously helpful to extract as many "features" as possible about the matches.
Via looking at the source code for classes like TermQuery, the following appears to be a basic approach:
// For each document... (scoreDoc.doc is an integer)
Weight weight = weightCache.get(query);
if (weight == null)
{
weight = query.createWeight(indexSearcher, true);
weightCache.put(query, weight);
}
IndexReaderContext context = indexReader.getContext();
List<LeafReaderContext> leafContexts = context.leaves();
int n = ReaderUtil.subIndex(scoreDoc.doc, leafContexts);
LeafReaderContext leafReaderContext = leafContexts.get(n);
Scorer scorer = weight.scorer(leafReaderContext);
int deBasedDoc = scoreDoc.doc - leafReaderContext.docBase;
int thisDoc = scorer.iterator().advance(deBasedDoc);
float freq = 0;
if (thisDoc == deBasedDoc)
{
freq = scorer.freq();
}
The 'weightCache' is of type Map and is useful so that you don't have to re-create the Weight object for every document you process. (otherwise, the code runs about 10x slower)
Is this approximately what I should be doing? Are there any obvious ways to make this run faster? (it takes approx 2 ms for 280 documents, as compared to about 1 ms to perform the query itself)
Another challenge with this approach is that it requires code to navigate through your Query object to try and find the sub-queries. For example, if it's a BooleanQuery, you call query.clauses() and recurse on them to look for all leaf TermQuery objects, etc. Not sure if there is a more elegant / less brittle way to do that.
The table in question contains roughly ten million rows.
for event in Event.objects.all():
print event
This causes memory usage to increase steadily to 4 GB or so, at which point the rows print rapidly. The lengthy delay before the first row printed surprised me – I expected it to print almost instantly.
I also tried Event.objects.iterator() which behaved the same way.
I don't understand what Django is loading into memory or why it is doing this. I expected Django to iterate through the results at the database level, which'd mean the results would be printed at roughly a constant rate (rather than all at once after a lengthy wait).
What have I misunderstood?
(I don't know whether it's relevant, but I'm using PostgreSQL.)
Nate C was close, but not quite.
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and it executes its database query the first time you iterate over it. For example, this will print the headline of all entries in the database:
for e in Entry.objects.all():
print e.headline
So your ten million rows are retrieved, all at once, when you first enter that loop and get the iterating form of the queryset. The wait you experience is Django loading the database rows and creating objects for each one, before returning something you can actually iterate over. Then you have everything in memory, and the results come spilling out.
From my reading of the docs, iterator() does nothing more than bypass QuerySet's internal caching mechanisms. I think it might make sense for it to a do a one-by-one thing, but that would conversely require ten-million individual hits on your database. Maybe not all that desirable.
Iterating over large datasets efficiently is something we still haven't gotten quite right, but there are some snippets out there you might find useful for your purposes:
Memory Efficient Django QuerySet iterator
batch querysets
QuerySet Foreach
Might not be the faster or most efficient, but as a ready-made solution why not use django core's Paginator and Page objects documented here:
https://docs.djangoproject.com/en/dev/topics/pagination/
Something like this:
from django.core.paginator import Paginator
from djangoapp.models import model
paginator = Paginator(model.objects.all(), 1000) # chunks of 1000, you can
# change this to desired chunk size
for page in range(1, paginator.num_pages + 1):
for row in paginator.page(page).object_list:
# here you can do whatever you want with the row
print "done processing page %s" % page
Django's default behavior is to cache the whole result of the QuerySet when it evaluates the query. You can use the QuerySet's iterator method to avoid this caching:
for event in Event.objects.all().iterator():
print event
https://docs.djangoproject.com/en/stable/ref/models/querysets/#iterator
The iterator() method evaluates the queryset and then reads the results directly without doing caching at the QuerySet level. This method results in better performance and a significant reduction in memory when iterating over a large number of objects that you only need to access once. Note that caching is still done at the database level.
Using iterator() reduces memory usage for me, but it is still higher than I expected. Using the paginator approach suggested by mpaf uses much less memory, but is 2-3x slower for my test case.
from django.core.paginator import Paginator
def chunked_iterator(queryset, chunk_size=10000):
paginator = Paginator(queryset, chunk_size)
for page in range(1, paginator.num_pages + 1):
for obj in paginator.page(page).object_list:
yield obj
for event in chunked_iterator(Event.objects.all()):
print event
For large amounts of records, a database cursor performs even better. You do need raw SQL in Django, the Django-cursor is something different than a SQL cursur.
The LIMIT - OFFSET method suggested by Nate C might be good enough for your situation. For large amounts of data it is slower than a cursor because it has to run the same query over and over again and has to jump over more and more results.
Django doesn't have good solution for fetching large items from database.
import gc
# Get the events in reverse order
eids = Event.objects.order_by("-id").values_list("id", flat=True)
for index, eid in enumerate(eids):
event = Event.object.get(id=eid)
# do necessary work with event
if index % 100 == 0:
gc.collect()
print("completed 100 items")
values_list can be used to fetch all the ids in the databases and then fetch each object separately. Over a time large objects will be created in memory and won't be garbage collected til for loop is exited. Above code does manual garbage collection after every 100th item is consumed.
This is from the docs:
http://docs.djangoproject.com/en/dev/ref/models/querysets/
No database activity actually occurs until you do something to evaluate the queryset.
So when the print event is run the query fires (which is a full table scan according to your command.) and loads the results. Your asking for all the objects and there is no way to get the first object without getting all of them.
But if you do something like:
Event.objects.all()[300:900]
http://docs.djangoproject.com/en/dev/topics/db/queries/#limiting-querysets
Then it will add offsets and limits to the sql internally.
Massive amount of memory gets consumed before the queryset can be iterated because all database rows for a whole query get processed into objects at once and it can be a lot of processing depending on a number of rows.
You can chunk up your queryset into smaller digestible bits. I call the pattern to do this "spoonfeeding". Here's an implementation with a progress-bar I use in my management commands, first pip3 install tqdm
from tqdm import tqdm
def spoonfeed(qs, func, chunk=1000, start=0):
"""
Chunk up a large queryset and run func on each item.
Works with automatic primary key fields.
chunk -- how many objects to take on at once
start -- PK to start from
>>> spoonfeed(Spam.objects.all(), nom_nom)
"""
end = qs.order_by('pk').last()
progressbar = tqdm(total=qs.count())
if not end:
return
while start < end.pk:
for o in qs.filter(pk__gt=start, pk__lte=start+chunk):
func(o)
progressbar.update(1)
start += chunk
progressbar.close()
To use this you write a function that does operations on your object:
def set_population(town):
town.population = calculate_population(...)
town.save()
and than run that function on your queryset:
spoonfeed(Town.objects.all(), set_population)
Here a solution including len and count:
class GeneratorWithLen(object):
"""
Generator that includes len and count for given queryset
"""
def __init__(self, generator, length):
self.generator = generator
self.length = length
def __len__(self):
return self.length
def __iter__(self):
return self.generator
def __getitem__(self, item):
return self.generator.__getitem__(item)
def next(self):
return next(self.generator)
def count(self):
return self.__len__()
def batch(queryset, batch_size=1024):
"""
returns a generator that does not cache results on the QuerySet
Aimed to use with expected HUGE/ENORMOUS data sets, no caching, no memory used more than batch_size
:param batch_size: Size for the maximum chunk of data in memory
:return: generator
"""
total = queryset.count()
def batch_qs(_qs, _batch_size=batch_size):
"""
Returns a (start, end, total, queryset) tuple for each batch in the given
queryset.
"""
for start in range(0, total, _batch_size):
end = min(start + _batch_size, total)
yield (start, end, total, _qs[start:end])
def generate_items():
queryset.order_by() # Clearing... ordering by id if PK autoincremental
for start, end, total, qs in batch_qs(queryset):
for item in qs:
yield item
return GeneratorWithLen(generate_items(), total)
Usage:
events = batch(Event.objects.all())
len(events) == events.count()
for event in events:
# Do something with the Event
There are a lot of outdated results here. Not sure when it was added, but Django's QuerySet.iterator() method uses a server-side cursor with a chunk size, to stream results from the database. So if you're using postgres, this should now be handled out of the box for you.
I usually use raw MySQL raw query instead of Django ORM for this kind of task.
MySQL supports streaming mode so we can loop through all records safely and fast without out of memory error.
import MySQLdb
db_config = {} # config your db here
connection = MySQLdb.connect(
host=db_config['HOST'], user=db_config['USER'],
port=int(db_config['PORT']), passwd=db_config['PASSWORD'], db=db_config['NAME'])
cursor = MySQLdb.cursors.SSCursor(connection) # SSCursor for streaming mode
cursor.execute("SELECT * FROM event")
while True:
record = cursor.fetchone()
if record is None:
break
# Do something with record here
cursor.close()
connection.close()
Ref:
Retrieving million of rows from MySQL
How does MySQL result set streaming perform vs fetching the whole JDBC ResultSet at once