Efficient Django query filter first n records - sql

I'm trying to filter for the first (or last) n Message objects which match a certain filter. Right now I need to filter for all matches then slice.
def get(self, request, chat_id, n):
last_n_messages = Message.objects.filter(chat=chat_id).order_by('-id')[:n]
last_n_sorted = reversed(last_n_messages)
serializer = MessageSerializer(last_n_sorted, many=True)
return Response(serializer.data, status=status.HTTP_200_OK)
This is clearly not efficient. Is there a way to get the first (or last) n items without exhaustively loading every match?

I found the answer myself. Basically the [:n] puts a SQL "LIMIT" in the query which makes the query itself NOT exhaustive. So it's fine in terms of efficiency.
Link for similar answer below.
https://stackoverflow.com/a/6574137/4775212

If the limit is known, it can also be using range.
Message.objects.filter(chat__range=[chat_id, chat_id+n]).order_by('-id')

Related

ArraySum only a certain rows in the cfoutput of a query

I am trying to add only certain rows to this output. Currently this output adds all rows, which is used for a Total row at the very end of
<cfoutput query="qrySummary">
#numberFormat(ArraySum(ListToArray(ValueList(qrySummary.secThreeCount))), ",")#
This obviously totals all of SecThreeCount values, but what if I want to exclude the last 2 rows from that list??
Is that possible in coldfusion?
(If this makes a difference)>> So there are 13 rows returning, I want the first 11 rows and exclude the last 2.
I know that I can limit the SQL return of that query to exclude those 2 rows, but I wanted less code to write and keep things neat. And also learn if it is possible:)
Thanks in advance.
Well I think if you don't need those two rows, you should not be returning them in the first place. That would be the best answer. From your statement "but I wanted less code to write" you're optimising in the wrong place: don't optimise for yourself, optimise for the solution.
Leigh's come in underneath me as I've been testing the code for this, but here's a proof of concept using subList():
numbers = queryNew("");
queryAddColumn(numbers, "id", "integer", [1,2,3,4,5,6]);
queryAddColumn(numbers, "maori", "varchar", ["tahi", "rua", "toru", "wha", "rima", "ono"]);
maori = listToArray(valueList(numbers.maori));
subset = maori.subList(2,5);
writeDump([numbers, subset]);
This returns an array with elements ["toru","wha","rima"].
If you are running CF10, one option is using ArraySlice. Grab only the first eleven elements, then apply arraySum.
<cfset allValues = ListToArray(ValueList(qrySummary.secThreeCount))>
<cfset subTotal = arraySum( arraySlice(allValues, 1, 11))>
For earlier versions, there is the undocumented subList(...) approach. It takes advantage of the fact that CF arrays are java.util.List objects under the hood, and uses List.subList(..) to grab a subset of the array.
Another approach. You don't want to add up the values for the last 2 elements, so remove them from your array:
<cfset values = ListToArray(ValueList(qrySummary.secThreeCount))>
<!--- delete the last element --->
<cfset arrayDeleteAt(values, arrayLen(values))>
<!--- delete the last element again --->
<cfset arrayDeleteAt(values, arrayLen(values))>
#numberFormat(ArraySum(values), ",")#
Alternatively, given that you're looping over the query anyway, you could simply add the totals up as you go (with a tiny bit of logic to not bother if you're on the last or penultimate row)

Getting N related models given M models

Given two models:
class Post(models.Model):
...
class Comment(models.Model):
post = models.ForeignKey(Post)
...
Is there a good way to get the last 3 Comments on a set of Posts (i.e. in a single roundtrip to the DB instead of once per post)? A naive implementation to show what I mean:
for post in Post.objects.filter(id__in=some_post_ids):
post.latest_comments = list(Comment.objects.filter(post=post).order_by('-id')[:3])
Given some_post_ids == [1, 2], the above will result in 3 queries:
[{'sql': 'SELECT "myapp_post"."id" FROM "myapp_post" WHERE "myapp_post"."id" IN (1, 2)', 'time': '0.001'},
{'sql': 'SELECT "myapp_comment"."id", "myapp_comment"."post_id" FROM "myapp_comment" WHERE "myapp_comment"."post_id" = 1 LIMIT 3', 'time': '0.001'},
{'sql': 'SELECT "myapp_comment"."id", "myapp_comment"."post_id" FROM "myapp_comment" WHERE "myapp_comment"."post_id" = 2 LIMIT 3', 'time': '0.001'}]
From Django's docs:
Slicing. As explained in Limiting QuerySets, a QuerySet can be sliced, using Python’s array-slicing syntax. Slicing an unevaluated QuerySet usually returns another unevaluated QuerySet, but Django will execute the database query if you use the “step” parameter of slice syntax, and will return a list. Slicing a QuerySet that has been evaluated (partially or fully) also returns a list.
Your naive implementation is correct and should only make one DB query. However, don't call list on it, I believe that will cause the DB to be hit immediately (although it should still only be a single query). The queryset it already iterable and there really shouldn't be any need to call list. More on calling list from the same doc page:
list(). Force evaluation of a QuerySet by calling list() on it. For example:
entry_list = list(Entry.objects.all())
Be warned, though, that this could have a large memory overhead, because Django will load each element of the list into memory. In contrast, iterating over a QuerySet will take advantage of your database to load data and instantiate objects only as you need them.
UPDATE:
With your added explanation I believe the following should work (however, it's untested so report back!):
post.latest_comments = Comment.objects.filter(post__in=some_post_ids).order_by('-id')
Admittedly it doesn't do the limit of 3 comments per post, I'm sure that's possible but can't think of the syntax off the top of my head. Also, remember you can always do a manual query on any Model to get better optimisation, so you can run Comment.query("Select ...;")
Given the information here on the "select top N from group" problem, if you're IN clause will be a small number of posts, it may just be cheaper to either a) do the multiple queries or b) select all comments for the posts then filter in Python. I'd suggest using a if it's a small number of posts with lots of comments and b if there will be relatively few comments per post.

Get the last element of the list in Django

I have a model:
class List:
data = ...
previous = models.ForeignKey('List', related_name='r1')
obj = models.ForeignKey('Obj', related_name='nodes')
This is one direction list containing reference to some obj of Obj class. I can reverse relation and get some list's all elements refering to obj by:
obj.nodes
But how Can I get the very last node? Without using raw sql, genering as little SQL queries by django as can.
obj.nodes is a RelatedManager, not a list. As with any manager, you can get the last queried element by
obj.nodes.all().reverse()[0]
This makes sense anyway only if there is any default order defined on the Node's Meta class, because otherwise the semantic of 'reverse' don't make any sense. If you don't have any specified order, set it explicitly:
obj.nodes.order_by('-pk')[0]
len(obj.nodes)-1
should give you the index of the last element (counting from 0) of your list
so something like
obj.nodes[len(obj.nodes)-1]
should give the last element of the list
i'm not sure it's good for your case, just give it a try :)
I see this question is quite old, but in newer versions of Django there are first() and last() methods on querysets now.
Well, you just can use [-1] index and it will return last element from the list. Maybe this question are close to yours:
Getting the last element of a list in Python
for further reading, Django does not support negative indexing and using something like
obj.nodes.all()[-1]
will raise an error.
in newer versions of Django you can use last() function on queryset to get the last item of your list.
obj.nodes.last()
another approach is to use len() function to get the index of last item of a list
obj.nodes[len(obj.nodes)-1]

In Django, what is the most efficient way to check for an empty query set?

I've heard suggestions to use the following:
if qs.exists():
...
if qs.count():
...
try:
qs[0]
except IndexError:
...
Copied from comment below: "I'm looking for a statement like "In MySQL and PostgreSQL count() is faster for short queries, exists() is faster for long queries, and use QuerySet[0] when it's likely that you're going to need the first element and you want to check that it exists. However, when count() is faster it's only marginally faster so it's advisable to always use exists() when choosing between the two."
query.exists() is the most efficient way.
Especially on postgres count() can be very expensive, sometimes more expensive then a normal select query.
exists() runs a query with no select_related, field selections or sorting and only fetches a single record. This is much faster then counting the entire query with table joins and sorting.
qs[0] would still includes select_related, field selections and sorting; so it would be more expensive.
The Django source code is here (django/db/models/sql/query.py RawQuery.has_results):
https://github.com/django/django/blob/60e52a047e55bc4cd5a93a8bd4d07baed27e9a22/django/db/models/sql/query.py#L499
def has_results(self, using):
q = self.clone()
if not q.distinct:
q.clear_select_clause()
q.clear_ordering(True)
q.set_limits(high=1)
compiler = q.get_compiler(using=using)
return compiler.has_results()
Another gotcha that got me the other day is invoking a QuerySet in an if statement. That executes and returns the whole query !
If the variable query_set may be None (unset argument to your function) then use:
if query_set is None:
#
not:
if query_set:
# you just hit the database
exists() is generally faster than count(), though not always (see test below). count() can be used to check for both existence and length.
Only use qs[0]if you actually need the object. It's significantly slower if you're just testing for existence.
On Amazon SimpleDB, 400,000 rows:
bare qs: 325.00 usec/pass
qs.exists(): 144.46 usec/pass
qs.count() 144.33 usec/pass
qs[0]: 324.98 usec/pass
On MySQL, 57 rows:
bare qs: 1.07 usec/pass
qs.exists(): 1.21 usec/pass
qs.count(): 1.16 usec/pass
qs[0]: 1.27 usec/pass
I used a random query for each pass to reduce the risk of db-level caching. Test code:
import timeit
base = """
import random
from plum.bacon.models import Session
ip_addr = str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))
try:
session = Session.objects.filter(ip=ip_addr)%s
if session:
pass
except:
pass
"""
query_variatons = [
base % "",
base % ".exists()",
base % ".count()",
base % "[0]"
]
for s in query_variatons:
t = timeit.Timer(stmt=s)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100)/100000)
It depends on use context.
According to documentation:
Use QuerySet.count()
...if you only want the count, rather than doing len(queryset).
Use QuerySet.exists()
...if you only want to find out if at least one result exists, rather than if queryset.
But:
Don't overuse count() and exists()
If you are going to need other data from the QuerySet, just evaluate it.
So, I think that QuerySet.exists() is the most recommended way if you just want to check for an empty QuerySet. On the other hand, if you want to use results later, it's better to evaluate it.
I also think that your third option is the most expensive, because you need to retrieve all records just to check if any exists.
#Sam Odio's solution was a decent starting point but there's a few flaws in the methodology, namely:
The random IP address could end up matching 0 or very few results
An exception would skew the results, so we should aim to avoid handling exceptions
So instead of filtering something that might match, I decided to exclude something that definitely won't match, hopefully still avoiding the DB cache, but also ensuring the same number of rows.
I only tested against a local MySQL database, with the dataset:
>>> Session.objects.all().count()
40219
Timing code:
import timeit
base = """
import random
import string
from django.contrib.sessions.models import Session
never_match = ''.join(random.choice(string.ascii_uppercase) for _ in range(10))
sessions = Session.objects.exclude(session_key=never_match){}
if sessions:
pass
"""
s = base.format('count')
query_variations = [
"",
".exists()",
".count()",
"[0]",
]
for variation in query_variations:
t = timeit.Timer(stmt=base.format(variation))
print "{} => {:02f} usec/pass".format(variation.ljust(10), 1000000 * t.timeit(number=100)/100000)
outputs:
=> 1390.177710 usec/pass
.exists() => 2.479579 usec/pass
.count() => 22.426991 usec/pass
[0] => 2.437079 usec/pass
So you can see that count() is roughly 9 times slower than exists() for this dataset.
[0] is also fast, but it needs exception handling.
I would imagine that the first method is the most efficient way (you could easily implement it in terms of the second method, so perhaps they are almost identical). The last one requires actually getting a whole object from the database, so it is almost certainly the most expensive.
But, like all of these questions, the only way to know for your particular database, schema and dataset is to test it yourself.
I was also in this trouble. Yes exists() is faster for most cases but it depends a lot on the type of queryset you are trying to do. For example for a simple query like:
my_objects = MyObject.objets.all() you would use my_objects.exists(). But if you were to do a query like: MyObject.objects.filter(some_attr='anything').exclude(something='what').distinct('key').values() probably you need to test which one fits better (exists(), count(), len(my_objects)). Remember the DB engine is the one who will perform the query, and to get a good result in performance, it depends a lot on the data structure and how the query is formed. One thing you can do is, audit the queries and test them on your own against the DB engine and compare your results you will be surprised by how naive sometimes django is, try QueryCountMiddleware to see all the queries executed, and you will see what i am talking about.

In django, how do I sort a model on a field and then get the last item?

Specifically, I have a model that has a field like this
pub_date = models.DateField("date published")
I want to be able to easily grab the object with the most recent pub_date. What is the easiest/best way to do this?
Would something like the following do what I want?
Edition.objects.order_by('pub_date')[:-1]
obj = Edition.objects.latest('pub_date')
You can also simplify things by putting get_latest_by in the model's Meta, then you'll be able to do
obj = Edition.objects.latest()
See the docs for more info. You'll probably also want to set the ordering Meta option.
Harley's answer is the way to go for the case where you want the latest according to some ordering criteria for particular Models, as you do, but the general solution is to reverse the ordering and retrieve the first item:
Edition.objects.order_by('-pub_date')[0]
Note:
Normal python lists accept negative indexes, which signify an offset from the end of the list, rather than the beginning like a positive number. However, QuerySet objects will raise AssertionError: Negative indexing is not supported. if you use a negative index, which is why you have to do what insin said: reverse the ordering and grab the 0th element.
Be careful of using
Edition.objects.order_by('-pub_date')[0]
as you might be indexing an empty QuerySet. I'm not sure what the correct Pythonic approach is, but the simplest would be to wrap it in an if/else or try/catch:
try:
last = Edition.objects.order_by('-pub_date')[0]
except IndexError:
# Didn't find anything...
But, as #Harley said, when you're ordering by date, latest() is the djangonic way to do it.
This has already been answered, but for more reference, this is what Django Book has to say about Slicing Data on QuerySets:
Note that negative slicing is not supported:
>>> Publisher.objects.order_by('name')[-1]
Traceback (most recent call last):
...
AssertionError: Negative indexing is not supported.
This is easy to get around, though. Just change the order_by()
statement, like this:
>>> Publisher.objects.order_by('-name')[0]
Refer the link for more such details. Hope that helps!