Django - finding the extreme member of each group - sql

I've been playing around with the new aggregation functionality in the Django ORM, and there's a class of problem I think should be possible, but I can't seem to get it to work. The type of query I'm trying to generate is described here.
So, let's say I have the following models -
class ContactGroup(models.Model):
.... whatever ....
class Contact(models.Model):
group = models.ForeignKey(ContactGroup)
name = models.CharField(max_length=20)
email = models.EmailField()
...
class Record(models.Model):
contact = models.ForeignKey(Contact)
group = models.ForeignKey(ContactGroup)
record_date = models.DateTimeField(default=datetime.datetime.now)
... name, email, and other fields that are in Contact ...
So, each time a Contact is created or modified, a new Record is created that saves the information as it appears in the contact at that time, along with a timestamp. Now, I want a query that, for example, returns the most recent Record instance for every Contact associated to a ContactGroup. In pseudo-code:
group = ContactGroup.objects.get(...)
records_i_want = group.record_set.most_recent_record_for_every_contact()
Once I get this figured out, I just want to be able to throw a filter(record_date__lt=some_date) on the queryset, and get the information as it existed at some_date.
Anybody have any ideas?
edit: It seems I'm not really making myself clear. Using models like these, I want a way to do the following with pure django ORM (no extra()):
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
Putting the subquery in the where clause is only one strategy for solving this problem, the others are pretty well covered by the first link I gave above. I know where-clause subselects are not possible without using extra(), but I thought perhaps one of the other ways was made possible by the new aggregation features.

It sounds like you want to keep records of changes to objects in Django.
Pro Django has a section in chapter 11 (Enhancing Applications) in which the author shows how to create a model that uses another model as a client that it tracks for inserts/deletes/updates.The model is generated dynamically from the client definition and relies on signals. The code shows most_recent() function but you could adapt this to obtain the object state on a particular date.
I assume it is the tracking in Django that is problematic, not the SQL to obtain this, right?

First of all, I'll point out that:
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
will not get you the same effect as:
records_i_want = group.record_set.most_recent_record_for_every_contact()
The first query returns every record associated with a particular group (or associated with any of the contacts of a particular group) that has a record_date less than the date/ time specified in the extra. Run this on the shell and then do this to review the query django created:
from django.db import connection
connection.queries[-1]
which reveals:
'SELECT "contacts_record"."id", "contacts_record"."contact_id", "contacts_record"."group_id", "contacts_record"."record_date", "contacts_record"."name", "contacts_record"."email" FROM "contacts_record" WHERE "contacts_record"."group_id" = 1 AND record_date = (select max(record_date) from contacts_record r where r.id=contacts_record.id and r.record_date <= \'2009-07-18\')
Not exactly what you want, right?
Now the aggregation feature is used to retrieve aggregated data and not objects associated with aggregated data. So if you're trying to minimize number of queries executed using aggregation when trying to obtain group.record_set.most_recent_record_for_every_contact() you won't succeed.
Without using aggregation, you can get the most recent record for all contacts associated with a group using:
[x.record_set.all().order_by('-record_date')[0] for x in group.contact_set.all()]
Using aggregation, the closest I could get to that was:
group.record_set.values('contact').annotate(latest_date=Max('record_date'))
The latter returns a list of dictionaries like:
[{'contact': 1, 'latest_date': somedate }, {'contact': 2, 'latest_date': somedate }]
So one entry for for each contact in a given group and the latest record date associated with it.
Anyway, the minimum query number is probably 1 + # of contacts in a group. If you are interested obtaining the result using a single query, that is also possible, but you'll have to construct your models in a different way. But that's a totally different aspect of your problem.
I hope this will help you understand how to approach the problem using aggregation/ the regular ORM functions.

Related

Complex SQL Query in Rails 4

I have a complicated query I need for a scope in my Rails app and have tried a lot of things with no luck. I've resorted to raw SQL via find_by_sql but wondering if any gurus wanted to take a shot. I will simplify the verbiage a bit for clarity, but the problem should be stated accurately.
I have Users. Users own many Records. One of them is marked current (#is_current = true) and the rest are not. Each CoiRecord has many Relationships. Relationships have a value for when they were active (active_when) which takes four values, [1..4].
Values 1 and 2 are considered recent. Values 3 and 4 are not.
The problem was ultimately to have a scopes (has_recent_relationships and has_no_recent_relationships) on User that filters on whether or not they have recent Relationships on current Record. (Old Records are irrelevant for this.) I tried create a recent and not_recent scope on Relationship, and then building the scopes on Record, combining with checking for is_current == 1. Here is where I failed. I have to move on with the app but have no choice but to use raw SQL and continue the app, hoping to revisit this later. I put that on User, the only context I really need it, and set aside the code for the scopes on the other objects.
The SQL that works, that correctly finds the Users who have recent relationships is below. The other just uses "= 0" instead "> 0" in the HAVING clause.
SELECT * FROM users WHERE `users`.`id` IN (
SELECT
records.owner_id
FROM `coi_records`
LEFT OUTER JOIN `relationships` ON `relationships`.`record_id` = `records`.`id`
WHERE `records`.`is_current` = 1
HAVING (
SELECT count(*)
FROM relationships
WHERE ((record_id = records.id) AND ((active_when = 1) OR (active_when = 2)))
) > 0
)
My instincts tell me this is complicated enough that my modeling probably could be redesigned and simplified, but the individual objects are pretty simple, just getting at this specific data from two objects away has become complicated.
Anyway, I'd appreciate any thoughts. I'm not expecting a full solution because, ick. Just thought the masochists among you might find this amusing.
Have you tried using Arel directly and this website?
Just copy-and-pasting your query you get this:
User.select(Arel.star).where(
User.arel_table[:id].in(
Relationship.select(Arel.star.count).where(
Arel::Nodes::Group.new(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id]).and(
Relationship.arel_table[:active_when].eq(1).or(Relationship.arel_table[:active_when].eq(2))
)
)
).joins(
CoiRecord.arel_table.join(Relationship.arel_table, Arel::Nodes::OuterJoin).on(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id])
).join_sources
).ast
)
)
I managed to find a way to create what I needed which returns ActiveRelationship objects, which simplifies a lot of other code. Here's what I came up with. This might not scale well, but this app will probably not end up with so much data that it will be a problem.
I created two scope methods. The second depends on the first to simplify things:
def self.has_recent_relationships
joins(records_owned: :relationships)
.merge(Record.current)
.where("(active_when = 1) OR (active_when = 2)")
.distinct
end
def self.has_no_recent_relationships
users_with_recent_relationships = User.has_recent_relationships.pluck(:id)
if users_with_recent_relationships.length == 0
User.all
else
User.where("id not in (?)", users_with_recent_relationships.to_a)
end
end
The first finds Users with recent relationships by just joining Record, merging with a scope that selects current records (should be only one), and looks for the correct active_when values. Easy enough.
The second method finds Users who DO have recent relationships (using the first method.) If there are none, then all Users are in the set of those with no recent relationships, and I return User.all (this will really never happen in the wild, but in theory it could.) Otherwise I return the inverse of those who do have recent relationships, using the SQL keywords NOT IN and an array. It's this part that could be non-performant if the array gets large, but I'm going with it for the moment.

Rails .joins doesn't load the association

Helo,
My query:
#county = County.joins(:state)
.where("counties.slug = ? AND states.slug = ?", params[:county_slug])
.select('states.*, counties.*')
.first!
From the log, the SQL looks like this:
SELECT states.*, counties.* FROM "counties" INNER JOIN "states" ON "states"."id" = "counties"."state_id" LIMIT 1
My problem is that is doesn't eager load the data from the associated table (states), because when I do, for example, #county.state.name, it runs another query, although, as you can see from the log, it had already queried the database for the data in that table as well. But it doesn't pre populate #county.state
Any idea how i can get all the data from the database in just ONE query?
Thx
I think you need to use include instead of joins to get the eager loading. There's a good railscasts episode about the differences: http://railscasts.com/episodes/181-include-vs-joins , in particular:
The question we need to ask is “are we using any of the related model’s attributes?” In our case the answer is “yes” as we’re showing the user’s name against each comment. This means that we want to get the users at the same time as we retrieve the comments and so we should be using include here.

Django distinct on a specific field

class A:
name = Char...
class B:
base = ForeignKey(A)
value = Integer..
B.objects.values('a__name','value').distinct('a__name')
As you understand above, I try to get the B objects grouping by its related object's name. However, distinct function doesn't take parameter.
I have tried by annotation and aggregation but I couldn't group by a__name
I have also tried values_list with flat=True but it only takes one column name but I need both a__name and value fields.
How can I do that in Django?
Thanks
First, you need Django 1.4+. If you're running a lesser version, you're out of luck. Then, you must be using PostgreSQL. Passing a parameter to distinct does not work with other databases.
See the documentation for distinct and pay attention to the "Note" lines.
You could always issue a raw query, I suppose, as well, if you don't meet the above conditions.

Counting occurence of each distinct element in a table

I am writing a log viewer app in ASP.NET / C#. There is a report window, where it will be possible to check some information about the whole database. One kind of information there I want to display on the screen is the number of times each generator (an entity in my domain, not Firebirds sequence) appears in the table. How do I do that using COUNT ?
Do I have to :
Gather the key for each different generator
Run one query for each generator key using count
Display it somehow
Is there any way that I can do it without having to do two queries to the database? The database size can be HUGE, and having to query it "X" times where "X" is the number of generators would just suck.
I am using a Firebird database, is there any way to fetch this information from any metadata schema or there is no such thing available?
Basically, what I want is to count each occurrence of each generator in the table. Result would be something like : GENERATOR A:10 times,GENERATOR B:7 Times,Generator C:0 Times and so on.
If I understand your question correctly, it is a simple matter of using the GROUP BY clause, e.g.:
select
key,
count(*)
from generators
group by key;
Something like the query below should be sufficient (depending on your exact structure and requirements)
SELECT KEY, COUNT(*)
FROM YOUR_TABLE
GROUP BY KEY
I solved my problem using this simple Query:
SELECT GENERATOR_,count(*)
FROM EVENTSGENERAL GROUP BY GENERATOR_;
Thanks for those who helped me.
It took me 8 hours to come back and post the answer,because of the StackOverflow limitation to answer my own questions based in my reputation.

Django query for large number of relationships

I have Django models setup in the following manner:
model A has a one-to-many relationship to model B
each record in A has between 3,000 to 15,000 records in B
What is the best way to construct a query that will retrieve the newest (greatest pk) record in B that corresponds to a record in A for each record in A? Is this something that I must use SQL for in lieu of the Django ORM?
Create a helper function for safely extracting the 'top' item from any queryset. I use this all over the place in my own Django apps.
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
# Extracts a single element collection w/ top item
result = queryset[0:1]
# Return that element or None if there weren't any matches
return result[0] if result else None
This uses a bit of a trick w/ the slice operator to add a limit clause onto your SQL.
Now use this function anywhere you need to get the 'top' item of a query set. In this case, you want to get the top B item for a given A where the B's are sorted by descending pk, as such:
latest = top_or_none(B.objects.filter(a=my_a).order_by('-pk'))
There's also the recently added 'Max' function in Django Aggregation which could help you get the max pk, but I don't like that solution in this case since it adds complexity.
P.S. I don't really like relying on the 'pk' field for this type of query as some RDBMSs don't guarantee that sequential pks is the same as logical creation order. If I have a table that I know I will need to query in this fashion, I usually have my own 'creation' datetime column that I can use to order by instead of pk.
Edit based on comment:
If you'd rather use queryset[0], you can modify the 'top_or_none' function thusly:
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
try:
return queryset[0]
except IndexError:
return None
I didn't propose this initially because I was under the impression that queryset[0] would pull back the entire result set, then take the 0th item. Apparently Django adds a 'LIMIT 1' in this scenario too, so it's a safe alternative to my slicing version.
Edit 2
Of course you can also take advantage of Django's related manager construct here and build the queryset through your 'A' object, depending on your preference:
latest = top_or_none(my_a.b_set.order_by('-pk'))
I don't think Django ORM can do this (but I've been pleasantly surprised before...). If there's a reasonable number of A record (or if you're paging), I'd just add a method to A model that would return this 'newest' B record. If you want to get a lot of A records, each with it's own newest B, I'd drop to SQL.
remeber that no matter which route you take, you'll need a suitable composite index on B table, maybe adding an order_by=('a_fk','-id') to the Meta subclass