How to simulate ActiveRecord Model.count.to_sql - sql

I want to display the SQL used in a count. However, Model.count.to_sql will not work because count returns a FixNum that doesn't have a to_sql method. I think the simplest solution is to do this:
Model.where(nil).to_sql.sub(/SELECT.*FROM/, "SELECT COUNT(*) FROM")
This creates the same SQL as is used in Model.count, but is it going to cause a problem further down the line? For example, if I add a complicated where clause and some joins.
Is there a better way of doing this?

You can try
Model.select("count(*) as model_count").to_sql

You may want to dip into Arel:
Model.select(Arel.star.count).to_sql
ASIDE:
I find I often want to find sub counts, so I embed the count(*) into another query:
child_counts = ChildModel.select(Arel.star.count)
.where(Model.arel_attribute(:id).eq(
ChildModel.arel_attribute(:model_id)))
Model.select(Arel.star).select(child_counts.as("child_count"))
.order(:id).limit(10).to_sql
which then gives you all the child counts for each of the models:
SELECT *,
(
SELECT COUNT(*)
FROM "child_models"
WHERE "models"."id" = "child_models"."model_id"
) child_count
FROM "models"
ORDER BY "models"."id" ASC
LIMIT 10
Best of luck
UPDATE:
Not sure if you are trying to solve this in a generic way or not. Also not sure what kind of scopes you are using on your Model.
We do have a method that automatically calls a count for a query that is put into the ui layer. I found using count(:all) is more stable than the simple count, but sounds like that does not overlap your use case. Maybe you can improve your solution using the except clause that we use:
scope.except(:select, :includes, :references, :offset, :limit, :order)
.count(:all)
The where clause and the joins necessary for the where clause work just fine for us. We tend to want to keep the joins and where clause since that needs to be part of the count. While you definitely want to remove the includes (which should be removed by rails automatically in my opinion), but the references (much trickier especially in the case where it references a has_many and requires a distinct) that starts to throw a wrench in there. If you need to use references, you may be able to convert these over to a left_join.
You may want to double check the parameters that these "join" methods take. Some of them take table names and others take relation names. Later rails version have gotten better and take relation names - be sure you are looking at the docs for the right version of rails.
Also, in our case, we spend more time trying to get sub selects with more complicated relationships, we have to do some munging. Looks like we are not dealing with where clauses as much.
ref2

Related

Aggregation of an annotation in GROUP BY in Django

UPDATE
Thanks to the posted answer, I found a much simpler way to formulate the problem. The original question can be seen in the revision history.
The problem
I am trying to translate an SQL query into Django, but am getting an error that I don't understand.
Here is the Django model I have:
class Title(models.Model):
title_id = models.CharField(primary_key=True, max_length=12)
title = models.CharField(max_length=80)
publisher = models.CharField(max_length=100)
price = models.DecimalField(decimal_places=2, blank=True, null=True)
I have the following data:
publisher title_id price title
--------------------------- ---------- ------- -----------------------------------
New Age Books PS2106 7 Life Without Fear
New Age Books PS2091 10.95 Is Anger the Enemy?
New Age Books BU2075 2.99 You Can Combat Computer Stress!
New Age Books TC7777 14.99 Sushi, Anyone?
Binnet & Hardley MC3021 2.99 The Gourmet Microwave
Binnet & Hardley MC2222 19.99 Silicon Valley Gastronomic Treats
Algodata Infosystems PC1035 22.95 But Is It User Friendly?
Algodata Infosystems BU1032 19.99 The Busy Executive's Database Guide
Algodata Infosystems PC8888 20 Secrets of Silicon Valley
Here is what I want to do: introduce an annotated field dbl_price which is twice the price, then group the resulting queryset by publisher, and for each publisher, compute the total of all dbl_price values for all titles published by that publisher.
The SQL query that does this is as follows:
SELECT SUM(dbl_price) AS total_dbl_price, publisher
FROM (
SELECT price * 2 AS dbl_price, publisher
FROM title
) AS A
GROUP BY publisher
The desired output would be:
publisher tot_dbl_prices
--------------------------- --------------
Algodata Infosystems 125.88
Binnet & Hardley 45.96
New Age Books 71.86
Django query
The query would look like:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(tot_dbl_prices=Sum('dbl_price'))
but gives an error:
KeyError: 'dbl_price'.
which indicates that it can't find the field dbl_price in the queryset.
The reason for the error
Here is why this error happens: the documentation says
You should also note that average_rating has been explicitly included
in the list of values to be returned. This is required because of the ordering of the values() and annotate() clause.
If the values() clause precedes the annotate() clause, any annotations
will be automatically added to the result set. However, if the
values() clause is applied after the annotate() clause, you need to explicitly include the aggregate column.
So, the dbl_price could not be found in aggregation, because it was created by a prior annotate, but wasn't included in values().
However, I can't include it in values either, because I want to use values (followed by another annotate) as a grouping device, since
If the values() clause precedes the annotate(), the annotation will be computed using the grouping described by the values() clause.
which is the basis of how Django implements SQL GROUP BY. This means that I can't include dbl_price inside values(), because then the grouping will be based on unique combinations of both fields publisher and dbl_price, whereas I need to group by publisher only.
So, the following query, which only differs from the above in that I aggregate over model's price field rather than annotated dbl_price field, actually works:
Title.objects
.annotate(dbl_price=2*F('price'))
.values('publisher')
.annotate(sum_of_prices=Count('price'))
because the price field is in the model rather than being an annotated field, and so we don't need to include it in values to keep it in the queryset.
The question
So, here we have it: I need to include annotated property into values to keep it in the queryset, but I can't do that because values is also used for grouping (which will be wrong with an extra field). The problem essentially is due to the two very different ways that values is used in Django, depending on the context (whether or not values is followed by annotate) - which is (1) value extraction (SQL plain SELECT list) and (2) grouping + aggregation over the groups (SQL GROUP BY) - and in this case these two ways seem to conflict.
My question is: is there any way to solve this problem (without things like falling back to raw sql)?
Please note: the specific example in question can be solved by moving all annotate statements after values, which was noted by several answers. However, I am more interested in solutions (or discussion) which would keep the annotate statement(s) before values(), for three reasons: 1. There are also more complex examples, where the suggested workaround would not work. 2. I can imagine situations, where the annotated queryset has been passed to another function, which actually does GROUP BY, so that the only thing we know is the set of names of annotated fields, and their types. 3. The situation seems to be pretty straightforward, and it would surprise me if this clash of two distinct uses of values() has not been noticed and discussed before.
Update: Since Django 2.1, everything works out of the box. No workarounds needed and the produced query is correct.
This is maybe a bit too late, but I have found the solution (tested with Django 1.11.1).
The problem is, call to .values('publisher'), which is required to provide grouping, removes all annotations, that are not included in .values() fields param.
And we can't include dbl_price to fields param, because it will add another GROUP BY statement.
The solution in to make all aggregation, which requires annotated fields firstly, then call .values() and include that aggregations to fields param(this won't add GROUP BY, because they are aggregations).
Then we should call .annotate() with ANY expression - this will make django add GROUP BY statement to SQL query using the only non-aggregation field in query - publisher.
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(sum_of_prices=Sum('dbl_price'))
.values('publisher', 'sum_of_prices')
.annotate(titles_count=Count('id'))
The only minus with this approach - if you don't need any other aggregations except that one with annotated field - you would have to include some anyway. Without last call to .annotate() (and it should include at least one expression!), Django will not add GROUP BY to SQL query. One approach to deal with this is just to create a copy of your field:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price')) # note the underscore!
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices')
Also, mention, that you should be careful with QuerySet ordering. You'd better call .order_by() either without parameters to clear ordering or with you GROUP BY field. If the resulting query will contain ordering by any other field, the grouping will be wrong.
https://docs.djangoproject.com/en/1.11/topics/db/aggregation/#interaction-with-default-ordering-or-order-by
Also, you might want to remove that fake annotation from your output, so call .values() again.
So, final code looks like:
Title.objects
.annotate(dbl_price=2*F('price'))
.annotate(_sum_of_prices=Sum('dbl_price'))
.values('publisher', '_sum_of_prices')
.annotate(sum_of_prices=F('_sum_of_prices'))
.values('publisher', 'sum_of_prices')
.order_by('publisher')
This is expected from the way group_by works in Django. All annotated fields are added in GROUP BY clause. However, I am unable to comment on why it was written this way.
You can get your query to work like this:
Title.objects
.values('publisher')
.annotate(total_dbl_price=Sum(2*F('price'))
which produces following SQL:
SELECT publisher, SUM((2 * price)) AS total_dbl_price
FROM title
GROUP BY publisher
which just happens to work in your case.
I understand this might not be the complete solution you were looking for, but some even complex annotations can also be accommodated in this solution by using CombinedExpressions(I hope!).
Your problem comes from values() follow by annotate(). Order are important.
This is explain in documentation about [order of annotate and values clauses](
https://docs.djangoproject.com/en/1.10/topics/db/aggregation/#order-of-annotate-and-values-clauses)
.values('pub_id') limit the queryset field with pub_id. So you can't annotate on income
The values() method takes optional positional arguments, *fields,
which specify field names to which the SELECT should be limited.
This solution by #alexandr addresses it properly.
https://stackoverflow.com/a/44915227/6323666
What you require is this:
from django.db.models import Sum
Title.objects.values('publisher').annotate(tot_dbl_prices=2*Sum('price'))
Ideally I reversed the scenario here by summing them up first and then doubling it up. You were trying to double it up then sum up. Hope this is fine.

Query to Find Adjacent Date Records

There exists in my database a page_history table; the idea is that whenever a record in the page table is changed, that record's old values are stored in the history table.
My job now is to find occasions in which a record was changed, and retrieve the pre- and post-conditions of that change. Specifically, I want to know when a page changed groups, and what groups were involved in the change. The query I have below can find these instances, but with the use of the min function, I can only get back the values that match between the two records:
select page_id,
original_group,
min(created2) change_date
from (select h.page_id,
h.group_id original_group,
i.group_id new_group,
h.created_dttm created1,
i.created_dttm created2
from page_history h,
page_history i
where h.page_id = i.page_id
and h.created_dttm < i.created_dttm
and h.group_id != i.group_id)
group by page_id, original_group, created1
order by page_id
When I try to get, say, any details of the second record, like new_group, I'm hit with a ORA-00979: not a GROUP BY expression error. I don't want to group by new_group, though, because that's going to destroy the logic (I think it would find records displaying times a page changed from a group to another group, regardless of any changes to other groups in between).
My question, then, is how can I modify this query, or go about writing a new one, that achieves a similar end, but with the added availability of columns that do not match between the two records? In essence, how can I find that min record without sacrificing all the other columns I'm not trying to compare? I don't exactly need a complete answer, any suggestions that point me in the right direction would be appreciated.
I use PL/SQL Developer, and it looks like version 11.2.0.2.0 of Oracle.
EDIT: I have found a solution. It's not pretty, and I'd still like to see some alternatives, but if helping me out would threaten to explode your brain, I would advise relocating to an easier question.
Without seeing your table structure it's hard to re-write the query but when you have a min function used like that it invariably seems better to put it into a separate sub select to get what you want and then compare the result of that.

Active Record embed Table.where('x').count inside of select statement

I'm setting up an AR query that is basically meant to find an average of a few values that span three different tables. I'm getting hung up on how to embed the result of a particular Count query inside of the Active Record select statement.
Just by itself, this query returns "3":
Order.where(user_id: 319).count => 3
My question is, can I embed this into a select statement as a SQL alias similar to below:
Table.xxxxxx.select("Order.where(user_id: 319).count AS count,user_id, SUM(quantity*current_price) AS revenue").xxxxx
It seems to be throwing an error and generally not recognizing what I'm trying to do when I declare that first count alias. Any ideas on the syntax?
Well, after examining a bit, I cleared my mind into the ActiveRecord select() syntax.
It's a method that can take a variable length of parameters. So, your failing :
Table.xxxxxx.select("Order.where(user_id: 319).count AS count,user_id, SUM(quantity*current_price) AS revenue").xxxxx
After replacing proper SQL for your misplaced ActiveRecord statement, should be more of like this [be careful, you can't use as count in most cases, count is reserved]:
Table.xxxxx.select("(SELECT count(id) from orders where user_id=319) as usercount", "user_id","SUM(quantity*current_price) AS revenue").xxxx
But I guess you should need more a per-user_id-table.
So, I'd skip Models and go to direct SQL, always being careful to avoid injections:
ActiveRecord::Base.connection.execute('SELECT COUNT(orders.id) as usercount, users.id from users, orders where users.id=orders.user_id group by users.id')
This is simplified of course, you can apply the rest of the data (which I currently do not know) accordingly. The above simplified, not full solution, could be written also as:
Order.joins(:user).select("count(orders.id) as usercount, users.id").group(:user_id)

Rails doesn't respect my select fields using includes

I want write a query with active record and seems it never respect what I want to do. So, here are my example:
Phonogram.preload(:bpms).includes(:bpms).select("phonograms.id", "bpms.bpm")
This query returns all my fields from phonograms and bpms. The problem is that I need put more 15 relationships in this query.
I also tried use joins but didn't work properly. I've 10 phonograms and returns just 3.
Someone experienced that? How did you solve it properly?
Cheers.
select with includes does not produce consistent behavior. It appears that if the included association returns no results, select will work properly, if it returns results, the select statement will have no effect. In fact, it will be completely ignored, such that your select statement could reference invalid table names and no error would be produced. select with joins will produce consistent behavior.
That's why you better go with joins like:
Phonogram.joins(:bpms).select("phonograms.id", "bpms.bpm")

Beginner SQL section: avoiding repeated expression

I'm entirely new at SQL, but let's say that on the StackExchange Data Explorer, I just want to list the top 15 users by reputation, and I wrote something like this:
SELECT TOP 15
DisplayName, Id, Reputation, Reputation/1000 As RepInK
FROM
Users
WHERE
RepInK > 10
ORDER BY Reputation DESC
Currently this gives an Error: Invalid column name 'RepInK', which makes sense, I think, because RepInK is not a column in Users. I can easily fix this by saying WHERE Reputation/1000 > 10, essentially repeating the formula.
So the questions are:
Can I actually use the RepInK "column" in the WHERE clause?
Do I perhaps need to create a virtual table/view with this column, and then do a SELECT/WHERE query on it?
Can I name an expression, e.g. Reputation/1000, so I only have to repeat the names in a few places instead of the formula?
What do you call this? A substitution macro? A function? A stored procedure?
Is there an SQL quicksheet, glossary of terms, language specification, anything I can use to quickly pick up the syntax and semantics of the language?
I understand that there are different "flavors"?
Can I actually use the RepInK "column" in the WHERE clause?
No, but you can rest assured that your database will evaluate (Reputation / 1000) once, even if you use it both in the SELECT fields and within the WHERE clause.
Do I perhaps need to create a virtual table/view with this column, and then do a SELECT/WHERE query on it?
Yes, a view is one option to simplify complex queries.
Can I name an expression, e.g. Reputation/1000, so I only have to repeat the names in a few places instead of the formula?
You could create a user defined function which you can call something like convertToK, which would receive the rep value as an argument and returns that argument divided by 1000. However it is often not practical for a trivial case like the one in your example.
Is there an SQL quicksheet, glossary of terms, language specification, anything I can use to quickly pick up the syntax and semantics of the language?
I suggest practice. You may want to start following the mysql tag on Stack Overflow, where many beginner questions are asked every day. Download MySQL, and when you think there's a question within your reach, try to go for the solution. I think this will help you pick up speed, as well as awareness of the languages features. There's no need to post the answer at first, because there are some pretty fast guns on the topic over here, but with some practice I'm sure you'll be able to bring home some points :)
I understand that there are different "flavors"?
The flavors are actually extensions to ANSI SQL. Database vendors usually augment the SQL language with extensions such as Transact-SQL and PL/SQL.
You could simply re-write the WHERE clause
where reputation > 10000
This won't always be convenient. As an alternativly, you can use an inline view:
SELECT
a.DisplayName, a.Id, a.Reputation, a.RepInK
FROM
(
SELECT TOP 15
DisplayName, Id, Reputation, Reputation/1000 As RepInK
FROM
Users
ORDER BY Reputation DESC
) a
WHERE
a.RepInK > 10
Regarding something like named expressions, while there are several possible alternatives, the query optimizer is going to do best just writing out the formula Reputation / 1000 long-hand. If you really need to run a whole group of queries using the same evaluated value, your best bet is to create view with the field defined, but you wouldn't want to do that for a one-off query.
As an alternative, (and in cases where performance is not much of an issue), you could try something like:
SELECT TOP 15
DisplayName, Id, Reputation, RepInk
FROM (
SELECT DisplayName, Id, Reputation, Reputation / 1000 as RepInk
FROM Users
) AS table
WHERE table.RepInk > 10
ORDER BY Reputation DESC
though I don't believe that's supported by all SQL dialects and, again, the optimizer is likely to do a much worse job which this kind of thing (since it will run the SELECT against the full Users table and then filter that result). Still, for some situations this sort of query is appropriate (there's a name for this... I'm drawing a blank at the moment).
Personally, when I started out with SQL, I found the W3 schools reference to be my constant stopping-off point. It fits my style for being something I can glance at to find a quick answer and move on. Eventually, however, to really take advantage of the database it is necessary to delve into the vendors documentation.
Although SQL is "standarized", unfortunately (though, to some extent, fortunately), each database vendor implements their own version with their own extensions, which can lead to quite different syntax being the most appropriate (for a discussion of the incompatibilities of various databases on one issue see the SQLite documentation on NULL handling. In particular, standard functions, e.g., for handling DATEs and TIMEs tend to differ per vendor, and there are other, more drastic differences (particularly in not support subselects or properly handling JOINs). If you care for some of the details, this document provides both the standard forms and deviations for several major databases.
You CAN refer to RepInK in the Order By clause, but in the Where clause you must repeat the expression. But, as others have said, it will only be executed once.
There are good answers for the technical problem already, so I'll only address some of the rest of your questions.
If you're just working with the DataExplorer, you'll want to familiarize yourself with SQL Server syntax since that's what it's running. The best place to find that, of course, is MSDN's reference.
Yes, there are different variations in SQL syntax. For example, the TOP clause in the query you gave is SQL Server specific; in MySQL you'd use the LIMIT clause instead (and these keywords don't necessarily appear in the same spot in the query!).