Django Really Simple Aggregation (or Group By) - sql

Imagine I just want to know the number of users with the same first_name in Django's auth app.
I know how to do this really easy in SQL:
SELECT first_name, COUNT(1) as num_users
FROM auth_user
GROUP BY first_name
ORDER BY num_users DESC;
And I also know how to get the desired output in Django (e.g. like going through all the users, get their email and do a filter and count, for instance).
Isn't there a simpler way to do this via Django's ORM? I can accomplish it if I'm aggregating with a foreign key but not with one of the table fields. I'm pretty sure I'm missing something.
Thanks.

I blogged about this very issue a couple of years ago. Contrary to the other answers, it's perfectly possible in Django, with no need for raw SQL.

Django's annotations allow you to attach some basic calculations to each object in your queryset (or aggregations across the entire queryset) but you can't filter those annotations (i.e. in your case, you only want to count thoseusers who share your name)
Django also has F() objects which allow you to use a fields value within a query. Ideally you could use these in conjunction with annotations to filter the objects you are annotation, but that's not currently possible (there's a fix on the way)
So, an easy solution is to perform the annotation manually:
users = User.objects.all().extra(select={
'same_name_count' : """
SELECT COUNT(*)
FROM auth_user
WHERE auth_user.first_name = user.first_name
"""
})

Check this: https://docs.djangoproject.com/en/dev/topics/db/aggregation/
from django.db.models import Count
auth_user.objects.annotate(num_users=Count('first_name'))
For more complex queries you can use plain SQL but try to avoid it.
UPD Code fixed. Thanks Timmy O'Mahony for mention!

Related

Django ORM filter multiple fields using 'IN' statement

So I have the following model in Django:
class MemberLoyalty(models.Model):
date_time = models.DateField(primary_key=True)
member = models.ForeignKey(Member, models.DO_NOTHING)
loyalty_value = models.IntegerField()
My goal is to have all the tuples grouped by the member with the most recent date. There are many ways to do it, one of them is using a subquery that groups by the member with max date_time and filtering member_loyalty with its results. The working sql for this solution is as follows:
SELECT
*
FROM
member_loyalty
WHERE
(date_time , member_id) IN (SELECT
max(date_time), member_id
FROM
member_loyalty
GROUP BY member_id);
Another way to do this would be by joining with the subquery.
How could i translate this on a django query? I could not find a way to filter with two fields using IN, nor a way to join with a subquery using a specific ON statement.
I've tried:
cls.objects.values('member_id', 'loyalty_value').annotate(latest_date=Max('date_time'))
But it starts grouping by the loyalty_value.
Also tried building the subquery, but cant find how to join it or use it on a filter:
subquery = cls.objects.values('member_id').annotate(max_date=Max('date_time'))
Also, I am using Mysql so I can not make use of the .distinct('param') method.
This is a typical greatest-per-group query. Stack-overflow even has a tag for it.
I believe the most efficient way to do it with the recent versions of Django is via a window query. Something along the lines should do the trick.
MemberLoyalty.objects.all().annotate(my_max=Window(
expression=Max('date_time'),
partition_by=F('member')
)).filter(my_max=F('date_time'))
Update: This actually won't work, because Window annotations are not filterable. I think in order to filter on window annotation you need to wrap it inside a Subquery, but with Subquery you are actually not obligated to use a Window function, there is another way to do it, which is my next example.
If either MySQL or Django does not support window queries, then a Subquery comes into play.
MemberLoyalty.objects.filter(
date_time=Subquery(
(MemberLoyalty.objects
.filter(member=OuterRef('member'))
.values('member')
.annotate(max_date=Max('date_time'))
.values('max_date')[:1]
)
)
)
If event Subqueries are not available (pre Django 1.11) then this should also work:
MemberLoyalty.objects.annotate(
max_date=Max('member__memberloyalty_set__date_time')
).filter(max_date=F('date_time'))

How to simulate ActiveRecord Model.count.to_sql

I want to display the SQL used in a count. However, Model.count.to_sql will not work because count returns a FixNum that doesn't have a to_sql method. I think the simplest solution is to do this:
Model.where(nil).to_sql.sub(/SELECT.*FROM/, "SELECT COUNT(*) FROM")
This creates the same SQL as is used in Model.count, but is it going to cause a problem further down the line? For example, if I add a complicated where clause and some joins.
Is there a better way of doing this?
You can try
Model.select("count(*) as model_count").to_sql
You may want to dip into Arel:
Model.select(Arel.star.count).to_sql
ASIDE:
I find I often want to find sub counts, so I embed the count(*) into another query:
child_counts = ChildModel.select(Arel.star.count)
.where(Model.arel_attribute(:id).eq(
ChildModel.arel_attribute(:model_id)))
Model.select(Arel.star).select(child_counts.as("child_count"))
.order(:id).limit(10).to_sql
which then gives you all the child counts for each of the models:
SELECT *,
(
SELECT COUNT(*)
FROM "child_models"
WHERE "models"."id" = "child_models"."model_id"
) child_count
FROM "models"
ORDER BY "models"."id" ASC
LIMIT 10
Best of luck
UPDATE:
Not sure if you are trying to solve this in a generic way or not. Also not sure what kind of scopes you are using on your Model.
We do have a method that automatically calls a count for a query that is put into the ui layer. I found using count(:all) is more stable than the simple count, but sounds like that does not overlap your use case. Maybe you can improve your solution using the except clause that we use:
scope.except(:select, :includes, :references, :offset, :limit, :order)
.count(:all)
The where clause and the joins necessary for the where clause work just fine for us. We tend to want to keep the joins and where clause since that needs to be part of the count. While you definitely want to remove the includes (which should be removed by rails automatically in my opinion), but the references (much trickier especially in the case where it references a has_many and requires a distinct) that starts to throw a wrench in there. If you need to use references, you may be able to convert these over to a left_join.
You may want to double check the parameters that these "join" methods take. Some of them take table names and others take relation names. Later rails version have gotten better and take relation names - be sure you are looking at the docs for the right version of rails.
Also, in our case, we spend more time trying to get sub selects with more complicated relationships, we have to do some munging. Looks like we are not dealing with where clauses as much.
ref2

Join with subquery in Django ORM

I want to run a filter using Django's ORM such that I get a distinct set of users with each user's most recent session. I have the tables set up so that a user has many sessions; there is a User and Session model with the Session model having a user = models.ForeignKey(User).
What I've tried so far is Users.objects.distinct('username').order_by('session__last_accessed'), but I know that this won't work because Django puts the session.last_accessed column into the selection, and so it's returning me, for example, 5 duplicate usernames with 5 distinct sessions rather than the single recent session and user.
Is it possible to query this via Django's ORM?
Edit: Okay, after some testing with SQL I've found that the SQL I want to use is:
select user.username, sub_query.last_accessed from (
select user_id, max(last_accessed) as last_accessed
from session
group by user_id
) sub_query
join user on
user.id = sub_query.user_id
order by sub_query.last_accessed desc
limit 5
And I can do sub_query via Session.objects.values('user').annotate(last_accessed=Max('last_accessed')). How can I use this sub_query to get the data I want with the ORM?
Edit 2: Specifically, I want to do this by performing one query only, like the SQL above does. Of course, I can query twice and do some processing in Python, but I'd prefer to hit the database once while using the ORM.
If you are using mysql backend, the following solution can be useful:
users_in_session = Session.objects.values_list('user_id', flat=True)
sessions_by_the_user_list = Session.objects \
.filter(user__in=set(users_in_session)) \
.order_by('last_accessed').distinct()
If you use the sub_query, then order_by('last_accessed') function should be good enough to get data in ordered list. Although as far as I have tested these results seemed unstable.
Update:
You can try:
Session.objects.values('user') \
.annotate(last_accessed=Max('last_accessed')) \
.orde‌​r_by('last_accessed').distinct()
Calling distinct('username') shouldn't return duplicate usernames ever. Are you sure you are using Django version that supports .dictinct(fields), that is Django version later than 1.4? Prior to Django 1.4 .distinct(fields) was accepted by the oRM, but it didn't actually do the correct DISTINCT ON query.
Another hint that things aren't working as expected is that .distinct(username).order_by(session__last_accessed) isn't a valid query - the order_by should have username as first argument because order_by must be prefixed with the field names in .distinct() call. See https://docs.djangoproject.com/en/1.4/ref/models/querysets/#django.db.models.query.QuerySet.distinct for details.

MongoDB custom field query

I am not sure this is a duplicated question or not (I don't think so) but its very interesting question for me:
In SQL we can create custom field and put it in the result:
SELECT *.p, totalOrder=(SELECT sum(price) from orders where id=p.id)
FROM products p;
so the result is a list of products with totalSales value.
What is best approach in NoSQL(MongoDB),
I am sure we should have two types of socuments(products and orders) so I know we don't have Join but the question is do we have custom field assignment in finding queries?
When you use aggregation, you have the $project operation which is exactly that. It is used to rename fields or derive field values through some simple operators. But as usual with MongoDB, you can not get any data from another collection.
When you need to do something which is too complex to express with aggregation, you can use MapReduce and build your output-documents with Javascript. But again, no breaking out of the collection.

Any way to merge two queries in solr?

In my project, we use solr to index a lot of different kind of documents, by example Books and Persons, with some common fields (like the name) and some type-specific fields (like the category, or the group people belong to).
We would like to do queries that can find both books and persons, with for each document type some filters applied. Something like:
find all Books and Persons with "Jean" in the name and/or content
but only Books from category "fiction" and "fantasy"
and only Persons from the group "pangolin"
everything sorted by score
A very simple way to do that would be:
q = name:jean content:jean
&
fq=
(type:book AND category:(fiction fantasy))
OR
(type:person AND group:pangolin)
But alas, as fq are cached, I'd prefer something allowing me simpler and so more reusable fq like :
fq=type:book,
fq=type:person,
fq=category(fiction fantasy),
fq=group:pangolin.
Is there a way to tell solr to merge or combine many queries? Something like 'grouping' fq together.
I read a bit about nested queries with _query_, but the very few documentation about it makes me think it's not the solution I'm looking for.
As Geert-Jan mentioned it in his answer, the possibility to do OR between fq is a solr asking feature, but with very little support by now: https://issues.apache.org/jira/browse/SOLR-1223
So I managed to simulate what I want to in a simple way:
for each field a document type can have, we have to define everytime a value (so if in my own example Books can have no category, at index time we still have to define something like category=noCategoryCode
when using a filter on one of this fields in a query on multiple types, we add a non-present condition in the filter, so fq=category:fiction becomes fq=category:fiction (*:* AND -category:*)
By this way, all other types (like Person) will pass through this filter, and the filter stands quite atomic and often used - so caching is still useful.
So, my full example becomes:
q = name:jean content:jean
&
fq= type:(book person)
&
fq= category:(fiction fantasy) (*:* AND -category:*)
&
fq= group:(pangolin) (*:* AND -group:*)
Still, can't wait SOLR-1223 to be patched :)
You can apply multiple filter queries at the same time
q=name:jean content:jean&fq=type:book&fq=type:person&fq=category(fiction fantasy)&fq=group:pangolin
Perhaps I am not understanding your issue, but the only difference between a query and a filter is that the filter is cached. If you don't care about the caching, just modify their query:
real query +((type:book category:fiction) (type:person group:pangolin))