Aggregation of an annotation in GROUP BY in Django - sql

UPDATE
Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.
The problem
I am trying to translate an SQL query into Django, but am getting an error that I don't understand.
Here is the Django model I have:
class Title(models.Model):
title_id = models.CharField(primary_key=True, max_length=12)
title = models.CharField(max_length=80)
publisher = models.CharField(max_length=100)
price = models.DecimalField(decimal_places=2, blank=True, null=True)
I have the following data:
publisher title_id price title
--------------------------- ---------- ------- -----------------------------------
New Age Books PS2106 7 Life Without Fear
New Age Books PS2091 10.95 Is Anger the Enemy?
New Age Books BU2075 2.99 You Can Combat Computer Stress!
New Age Books TC7777 14.99 Sushi, Anyone?
Binnet & Hardley MC3021 2.99 The Gourmet Microwave
Binnet & Hardley MC2222 19.99 Silicon Valley Gastronomic Treats
Algodata Infosystems PC1035 22.95 But Is It User Friendly?
Algodata Infosystems BU1032 19.99 The Busy Executive's Database Guide
Algodata Infosystems PC8888 20 Secrets of Silicon Valley
Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.
The SQL query that does this is as follows:
SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
SELECT price * 2 AS dbl_price, publisher
FROM title
) AS A
GROUP BY publisher
The desired output would be:
publisher tot_dbl_prices
--------------------------- --------------
Algodata Infosystems 125.88
Binnet & Hardley 45.96
New Age Books 71.86
Django query
The query would look like:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(tot_dbl_prices=Sum('dbl_price'))
but gives an error:
KeyError: 'dbl_price'.
which indicates that it can't find the field dbl_price in the queryset.
The reason for the error
Here is why this error happens: the documentation says
You should also note that average_rating has been explicitly included
in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.
If the values() clause precedes the annotate() clause, any annotations
will be automatically added to the result set. However, if the
values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.
So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().
However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since
If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.
which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.
So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(sum_of_prices=Count('price'))
because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.
The question
So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.
My question is: is there any way to solve this problem (without things like falling back to raw sql)?
Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.

Update: Since Django 2.1, everything works out of the box. No workarounds needed and the produced query is correct.
This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).
The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.
And we can't include dbl_price to fields param, because it will add another GROUP BY statement.
The solution in to make all aggregation, which requires annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations).
Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(sum_of_prices=Sum('dbl_price'))
.values('publisher', 'sum_of_prices')
.annotate(titles_count=Count('id'))
The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices')
Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parameters to clear ordering or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Also, you might want to remove that fake annotation from your output, so call .values() again.
So, final code looks like:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price'))
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices'))
.values('publisher', 'sum_of_prices')
.order_by('publisher')

This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.
You can get your query to work like this:
Title.objects
.values('publisher')
.annotate(total_dbl_price=Sum(2*F('price'))
which produces following SQL:
SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher
which just happens to work in your case.
I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).

Your problem comes from values() follow by annotate(). Order are important.
This is explain in documentation about [order of annotate and values clauses](
https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)
.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income
The values() method takes optional positional arguments, *fields,
which specify field names to which the SELECT should be limited.

This solution by #alexandr addresses it properly.
https://stackoverflow.com/a/44915227/6323666
What you require is this:
from django.db.models import Sum
Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))
Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.

Related

Multiple subtotals - Rollup order of fields

I am trying to run a query that aggregates data, groups the results by several different fields, and extract all relevant "SubTotal" permutations. (similar to CUBE() in MSSQL)
When Using Group By Rollup(), I get only permutations according to the order of the Group By fields in the Rollup function.
For example the query below (runs on a public dataset), it returns subtotal by year, or by year and month, or by year, month and medallion... but it doesn't subtotal by medallion.
SELECT
trip_year,
trip_month,
medallion,
SUM(trip_count) AS Sum_trip_count
FROM
[nyc-tlc:yellow.Trips_ByMonth_ByMedallion]
WHERE
medallion IN ("2R76", "8J82", "3B85", "4L79", "5D59", "6H75", "7P60", "8V48", "1H12", "2C69", "2F38", "5Y86", "5j90", "8A75", "8V41", "9J24", "9J55", "1E13", "1J82")
GROUP BY
ROLLUP(trip_year,
trip_month,
medallion)
My question is:
What should I do in order to get all different permutations of "Sub Totals" in a single query results.
Already tried: Union with similar query but with different order, it works, but not elegant (it would require too many unions).
Thanks
You are correct on both counts. In BigQuery, ROLLUP respects the hierarchy treating the listed fields as a strictly ordered list. Their order will not be changed during aggregation.
The CUBE aggregate commonly found in other SQL environments is unordered and in fact aggregates every possible order/subset of its listed fields. At this time, CUBE has not been implemented in BigQuery. The workaround you suggest is also what I would suggest. UNION all result sets from ROLLUP using each permutation of its contained fields. Albeit not ideal, you should get the same results.
In short, UNIONs of several queries with different permutations of ROLLUP fields is the only way to achieve this at the moment. The downsides are as you state that this may be difficult to maintain and can be more expensive in queries.
If you would like to see CUBE implemented in BigQuery, I strongly encourage you to file a feature request on the Big Query public issue tracker. Be sure to include a thorough use case in this request.
UPDATE: To support the feature request filed by the OP, please star it and you'll receive notifications with updates.

How to simulate ActiveRecord Model.count.to_sql

I want to display the SQL used in a count. However, Model.count.to_sql will not work because count returns a FixNum that doesn't have a to_sql method. I think the simplest solution is to do this:
Model.where(nil).to_sql.sub(/SELECT.*FROM/, "SELECT COUNT(*) FROM")
This creates the same SQL as is used in Model.count, but is it going to cause a problem further down the line? For example, if I add a complicated where clause and some joins.
Is there a better way of doing this?
You can try
Model.select("count(*) as model_count").to_sql
You may want to dip into Arel:
Model.select(Arel.star.count).to_sql
ASIDE:
I find I often want to find sub counts, so I embed the count(*) into another query:
child_counts = ChildModel.select(Arel.star.count)
.where(Model.arel_attribute(:id).eq(
ChildModel.arel_attribute(:model_id)))
Model.select(Arel.star).select(child_counts.as("child_count"))
.order(:id).limit(10).to_sql
which then gives you all the child counts for each of the models:
SELECT *,
(
SELECT COUNT(*)
FROM "child_models"
WHERE "models"."id" = "child_models"."model_id"
) child_count
FROM "models"
ORDER BY "models"."id" ASC
LIMIT 10
Best of luck
UPDATE:
Not sure if you are trying to solve this in a generic way or not. Also not sure what kind of scopes you are using on your Model.
We do have a method that automatically calls a count for a query that is put into the ui layer. I found using count(:all) is more stable than the simple count, but sounds like that does not overlap your use case. Maybe you can improve your solution using the except clause that we use:
scope.except(:select, :includes, :references, :offset, :limit, :order)
.count(:all)
The where clause and the joins necessary for the where clause work just fine for us. We tend to want to keep the joins and where clause since that needs to be part of the count. While you definitely want to remove the includes (which should be removed by rails automatically in my opinion), but the references (much trickier especially in the case where it references a has_many and requires a distinct) that starts to throw a wrench in there. If you need to use references, you may be able to convert these over to a left_join.
You may want to double check the parameters that these "join" methods take. Some of them take table names and others take relation names. Later rails version have gotten better and take relation names - be sure you are looking at the docs for the right version of rails.
Also, in our case, we spend more time trying to get sub selects with more complicated relationships, we have to do some munging. Looks like we are not dealing with where clauses as much.
ref2

MongoDB custom field query

I am not sure this is a duplicated question or not (I don't think so) but its very interesting question for me:
In SQL we can create custom field and put it in the result:
SELECT *.p, totalOrder=(SELECT sum(price) from orders where id=p.id)
FROM products p;
so the result is a list of products with totalSales value.
What is best approach in NoSQL(MongoDB),
I am sure we should have two types of socuments(products and orders) so I know we don't have Join but the question is do we have custom field assignment in finding queries?
When you use aggregation, you have the $project operation which is exactly that. It is used to rename fields or derive field values through some simple operators. But as usual with MongoDB, you can not get any data from another collection.
When you need to do something which is too complex to express with aggregation, you can use MapReduce and build your output-documents with Javascript. But again, no breaking out of the collection.

Django distinct on a specific field

class A:
name = Char...
class B:
base = ForeignKey(A)
value = Integer..
B.objects.values('a__name','value').distinct('a__name')
As you understand above, I try to get the B objects grouping by its related object's name. However, distinct function doesn't take parameter.
I have tried by annotation and aggregation but I couldn't group by a__name
I have also tried values_list with flat=True but it only takes one column name but I need both a__name and value fields.
How can I do that in Django?
Thanks
First, you need Django 1.4+. If you're running a lesser version, you're out of luck. Then, you must be using PostgreSQL. Passing a parameter to distinct does not work with other databases.
See the documentation for distinct and pay attention to the "Note" lines.
You could always issue a raw query, I suppose, as well, if you don't meet the above conditions.

Can scalar functions be applied before filtering when executing a SQL Statement?

I suppose I have always naively assumed that scalar functions in the select part of a SQL query will only get applied to the rows that meet all the criteria of the where clause.
Today I was debugging some code from a vendor and had that assumption challenged. The only reason I can think of for this code failing is that the Substring() function is getting called on data that should have been filtered out by the WHERE clause. But it appears that the substring call is being applied before the filtering happens, the query is failing.
Here is an example of what I mean. Let's say we have two tables, each with 2 columns and having 2 rows and 1 row respectively. The first column in each is just an id. NAME is just a string, and NAME_LENGTH tells us how many characters in the name with the same ID. Note that only names with more than one character have a corresponding row in the LONG_NAMES table.
NAMES: ID, NAME
1, "Peter"
2, "X"
LONG_NAMES: ID, NAME_LENGTH
1, 5
If I want a query to print each name with the last 3 letters cut off, I might first try something like this (assuming SQL Server syntax for now):
SELECT substring(NAME,1,len(NAME)-3)
FROM NAMES;
I would soon find out that this would give me an error, because when it reaches "X" it will try using a negative number for in the substring call, and it will fail.
The way my vendor decided to solve this was by filtering out rows where the strings were too short for the len - 3 query to work. He did it by joining to another table:
SELECT substring(NAMES.NAME,1,len(NAMES.NAME)-3)
FROM NAMES
INNER JOIN LONG_NAMES
ON NAMES.ID = LONG_NAMES.ID;
At first glance, this query looks like it might work. The join condition will eliminate any rows that have NAME fields short enough for the substring call to fail.
However, from what I can observe, SQL Server will sometimes try to calculate the the substring expression for everything in the table, and then apply the join to filter out rows. Is this supposed to happen this way? Is there a documented order of operations where I can find out when certain things will happen? Is it specific to a particular Database engine or part of the SQL standard? If I decided to include some predicate on my NAMES table to filter out short names, (like len(NAME) > 3), could SQL Server also choose to apply that after trying to apply the substring? If so then it seems the only safe way to do a substring would be to wrap it in a "case when" construct in the select?
Martin gave this link that pretty much explains what is going on - the query optimizer has free rein to reorder things however it likes. I am including this as an answer so I can accept something. Martin, if you create an answer with your link in it i will gladly accept that instead of this one.
I do want to leave my question here because I think it is a tricky one to search for, and my particular phrasing of the issue may be easier for someone else to find in the future.
TSQL divide by zero encountered despite no columns containing 0
EDIT: As more responses have come in, I am again confused. It does not seem clear yet when exactly the optimizer is allowed to evaluate things in the select clause. I guess I'll have to go find the SQL standard myself and see if i can make sense of it.
Joe Celko, who helped write early SQL standards, has posted something similar to this several times in various USENET newsfroups. (I'm skipping over the clauses that don't apply to your SELECT statement.) He usually said something like "This is how statements are supposed to act like they work". In other words, SQL implementations should behave exactly as if they did these steps, without actually being required to do each of these steps.
Build a working table from all of
the table constructors in the FROM
clause.
Remove from the working table those
rows that do not satisfy the WHERE
clause.
Construct the expressions in the
SELECT clause against the working table.
So, following this, no SQL dbms should act like it evaluates functions in the SELECT clause before it acts like it applies the WHERE clause.
In a recent posting, Joe expands the steps to include CTEs.
CJ Date and Hugh Darwen say essentially the same thing in chapter 11 ("Table Expressions") of their book A Guide to the SQL Standard. They also note that this chapter corresponds to the "Query Specification" section (sections?) in the SQL standards.
You are thinking about something called query execution plan. It's based on query optimization rules, indexes, temporaty buffers and execution time statistics. If you are using SQL Managment Studio you have toolbox over your query editor where you can look at estimated execution plan, it shows how your query will change to gain some speed. So if just used your Name table and it is in buffer, engine might first try to subquery your data, and then join it with other table.