I use Peewee as my ORM and while trying to use .first() to get the first element in a list of objects, I found this weird behaviour, where the in-memory list of objects gets modified.
In [7]: events = Event.select()
In [8]: len(events)
Out[8]: 10000 # correct count of events in my local DB
In [9]: first_event = events.first()
In [10]: first_event
Out[10]: <Event: 311718318>
In [11]: len(events)
Out[11]: 1 # the in-memory list of events has somehow been changed
I initially had 10k objects of Event, and just by doing .first(), that in-memory list events gets modified somehow. This seems very weird and I faced a lot of customer issues on production because of this. Thus, I wanted to know if this is some known issue with Peewee, or maybe my understanding isn't right.
Going through my previous experience with Django, .first() simply returns the first element in the list or None, and doesn't do anything to the in-memory list of objects. Why is the behaviour different in Peewee? I couldn't find anything relevant in the documentation.
Yes, Peewee .first() explicitly appends a LIMIT 1 so that multiple invocations of .first() do not result in multiple queries being executed when you only want the first result. This behavior is a bit different than .get(), which will execute a fresh query every time you call it.
You can explicitly clone your query before calling .first() if this behavior is undesirable:
events = Event.select()
first = events.clone().first()
Alternatively, I'd suggest using .get() and catching the Event.DoesNotExist or using plain-old first_event = events[0] and catching the possible IndexError.
I'm trying to filter for the first (or last) n Message objects which match a certain filter. Right now I need to filter for all matches then slice.
def get(self, request, chat_id, n):
last_n_messages = Message.objects.filter(chat=chat_id).order_by('-id')[:n]
last_n_sorted = reversed(last_n_messages)
serializer = MessageSerializer(last_n_sorted, many=True)
return Response(serializer.data, status=status.HTTP_200_OK)
This is clearly not efficient. Is there a way to get the first (or last) n items without exhaustively loading every match?
I found the answer myself. Basically the [:n] puts a SQL "LIMIT" in the query which makes the query itself NOT exhaustive. So it's fine in terms of efficiency.
Link for similar answer below.
https://stackoverflow.com/a/6574137/4775212
If the limit is known, it can also be using range.
Message.objects.filter(chat__range=[chat_id, chat_id+n]).order_by('-id')
class Seller(object):
type = ...
name = ...
cars = models.ManyToManyField(Car)
class PotentialBuyer(object):
name = ...
cars = models.ManyToManyField(Car)
class Car(object):
extra_field = ...
extra_field2 = ...
Suppose I have a relationship like this. I would like to use extra queryset modifier to get the list of cars that are already been picked out by PotentialBuyers when I fetch a seller object. I suppose the query queryset will something like this.
def markPending(self)
return self.extra(select={'pending': 'select images from PotentialBuyer as t ...'})
How can I accomplish this? Is there a better way? I could fetch the seller object and the potential object and do sets, but I'd think it would be cleaner to make it handled by the database. I am using PostgreSQL 9.5.
I think the Exists subquery expression will do what you want. Or at least it'll get you started on the right path. Docs Or you might want to use an aggregate to count the number of them.
Edit: If you need to select the full objects rather than the count, existence or a single entity, then use a Prefetch instance in prefetch_related. https://docs.djangoproject.com/en/2.0/ref/models/querysets/#django.db.models.Prefetch
Not quite the answer, but this is the solution I ended up with and I am satisfied by the performance. Perhaps someone can answer the question later :
from api.models import PotentialBuyer
potentials = PotentialBuyer.objects.filter(owner=user_id, default=True).first().cars.all()
Car.objects.filter(....).annotate(pending=Case(When(id__in=potentials, then=Value(True)), default=Value(False), output_field=BooleanField()))
I am trying to find a complement using ActiveRecord and/or SQL.
I have a collection of 'annotations' that each have two relevant fields:
session_datum_id which corresponds to the user who performed the
annotation. Null means it has not yet been done.
post_id which represents the post the annotation is 'about'. Cannot
be null.
There are potentially multiple annotations per post_id.
I would like to efficiently find an annotation that satisfies two constraints:
session_datum_id is null. This means this particular annotation hasn't already been performed.
the session_datum passed in as an arg has not already performed another annotation with the same post_id.
Here is a very naive version which does a join outside the DB. It finds all annotations this user has already performed and removes those post_ids from the exhaustive list of annotations that still need to be performed. It then picks at random from the resulting list:
def self.random_empty_unseen(session_datum)
mine = where('session_datum_id = ?', session_datum)
elligible = where('session_datum_id IS NULL')
mine.each do |i|
elligible.each do |j|
if (i.post_id == j.post_id)
elligible.delete(j)
end
end
end
elligible[rand(elligible.count)]
end
As the list of annotations gets large, this will bog down terribly. I can imagine a probabilistic algorithm where we select an elligible annotation at random and then checks if the user has already performed it (retrying if so) but there are degenerate cases where that won't work. (Large number of annotations and the user has performed all but one of them.)
Is there a closed form query for this, perhaps using NOT EXISTS?
SELECT a1.*
FROM annotations AS a1
JOIN annotations AS a2
ON a1.post_id=a2.post_id
WHERE a2.session_datum_id=session_datum AND a1.session_datum_id IS NULL
I've heard suggestions to use the following:
if qs.exists():
...
if qs.count():
...
try:
qs[0]
except IndexError:
...
Copied from comment below: "I'm looking for a statement like "In MySQL and PostgreSQL count() is faster for short queries, exists() is faster for long queries, and use QuerySet[0] when it's likely that you're going to need the first element and you want to check that it exists. However, when count() is faster it's only marginally faster so it's advisable to always use exists() when choosing between the two."
query.exists() is the most efficient way.
Especially on postgres count() can be very expensive, sometimes more expensive then a normal select query.
exists() runs a query with no select_related, field selections or sorting and only fetches a single record. This is much faster then counting the entire query with table joins and sorting.
qs[0] would still includes select_related, field selections and sorting; so it would be more expensive.
The Django source code is here (django/db/models/sql/query.py RawQuery.has_results):
https://github.com/django/django/blob/60e52a047e55bc4cd5a93a8bd4d07baed27e9a22/django/db/models/sql/query.py#L499
def has_results(self, using):
q = self.clone()
if not q.distinct:
q.clear_select_clause()
q.clear_ordering(True)
q.set_limits(high=1)
compiler = q.get_compiler(using=using)
return compiler.has_results()
Another gotcha that got me the other day is invoking a QuerySet in an if statement. That executes and returns the whole query !
If the variable query_set may be None (unset argument to your function) then use:
if query_set is None:
#
not:
if query_set:
# you just hit the database
exists() is generally faster than count(), though not always (see test below). count() can be used to check for both existence and length.
Only use qs[0]if you actually need the object. It's significantly slower if you're just testing for existence.
On Amazon SimpleDB, 400,000 rows:
bare qs: 325.00 usec/pass
qs.exists(): 144.46 usec/pass
qs.count() 144.33 usec/pass
qs[0]: 324.98 usec/pass
On MySQL, 57 rows:
bare qs: 1.07 usec/pass
qs.exists(): 1.21 usec/pass
qs.count(): 1.16 usec/pass
qs[0]: 1.27 usec/pass
I used a random query for each pass to reduce the risk of db-level caching. Test code:
import timeit
base = """
import random
from plum.bacon.models import Session
ip_addr = str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))+'.'+str(random.randint(0,256))
try:
session = Session.objects.filter(ip=ip_addr)%s
if session:
pass
except:
pass
"""
query_variatons = [
base % "",
base % ".exists()",
base % ".count()",
base % "[0]"
]
for s in query_variatons:
t = timeit.Timer(stmt=s)
print "%.2f usec/pass" % (1000000 * t.timeit(number=100)/100000)
It depends on use context.
According to documentation:
Use QuerySet.count()
...if you only want the count, rather than doing len(queryset).
Use QuerySet.exists()
...if you only want to find out if at least one result exists, rather than if queryset.
But:
Don't overuse count() and exists()
If you are going to need other data from the QuerySet, just evaluate it.
So, I think that QuerySet.exists() is the most recommended way if you just want to check for an empty QuerySet. On the other hand, if you want to use results later, it's better to evaluate it.
I also think that your third option is the most expensive, because you need to retrieve all records just to check if any exists.
#Sam Odio's solution was a decent starting point but there's a few flaws in the methodology, namely:
The random IP address could end up matching 0 or very few results
An exception would skew the results, so we should aim to avoid handling exceptions
So instead of filtering something that might match, I decided to exclude something that definitely won't match, hopefully still avoiding the DB cache, but also ensuring the same number of rows.
I only tested against a local MySQL database, with the dataset:
>>> Session.objects.all().count()
40219
Timing code:
import timeit
base = """
import random
import string
from django.contrib.sessions.models import Session
never_match = ''.join(random.choice(string.ascii_uppercase) for _ in range(10))
sessions = Session.objects.exclude(session_key=never_match){}
if sessions:
pass
"""
s = base.format('count')
query_variations = [
"",
".exists()",
".count()",
"[0]",
]
for variation in query_variations:
t = timeit.Timer(stmt=base.format(variation))
print "{} => {:02f} usec/pass".format(variation.ljust(10), 1000000 * t.timeit(number=100)/100000)
outputs:
=> 1390.177710 usec/pass
.exists() => 2.479579 usec/pass
.count() => 22.426991 usec/pass
[0] => 2.437079 usec/pass
So you can see that count() is roughly 9 times slower than exists() for this dataset.
[0] is also fast, but it needs exception handling.
I would imagine that the first method is the most efficient way (you could easily implement it in terms of the second method, so perhaps they are almost identical). The last one requires actually getting a whole object from the database, so it is almost certainly the most expensive.
But, like all of these questions, the only way to know for your particular database, schema and dataset is to test it yourself.
I was also in this trouble. Yes exists() is faster for most cases but it depends a lot on the type of queryset you are trying to do. For example for a simple query like:
my_objects = MyObject.objets.all() you would use my_objects.exists(). But if you were to do a query like: MyObject.objects.filter(some_attr='anything').exclude(something='what').distinct('key').values() probably you need to test which one fits better (exists(), count(), len(my_objects)). Remember the DB engine is the one who will perform the query, and to get a good result in performance, it depends a lot on the data structure and how the query is formed. One thing you can do is, audit the queries and test them on your own against the DB engine and compare your results you will be surprised by how naive sometimes django is, try QueryCountMiddleware to see all the queries executed, and you will see what i am talking about.