Django query for large number of relationships - sql

I have Django models setup in the following manner:
model A has a one-to-many relationship to model B
each record in A has between 3,000 to 15,000 records in B
What is the best way to construct a query that will retrieve the newest (greatest pk) record in B that corresponds to a record in A for each record in A? Is this something that I must use SQL for in lieu of the Django ORM?

Create a helper function for safely extracting the 'top' item from any queryset. I use this all over the place in my own Django apps.
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
# Extracts a single element collection w/ top item
result = queryset[0:1]
# Return that element or None if there weren't any matches
return result[0] if result else None
This uses a bit of a trick w/ the slice operator to add a limit clause onto your SQL.
Now use this function anywhere you need to get the 'top' item of a query set. In this case, you want to get the top B item for a given A where the B's are sorted by descending pk, as such:
latest = top_or_none(B.objects.filter(a=my_a).order_by('-pk'))
There's also the recently added 'Max' function in Django Aggregation which could help you get the max pk, but I don't like that solution in this case since it adds complexity.
P.S. I don't really like relying on the 'pk' field for this type of query as some RDBMSs don't guarantee that sequential pks is the same as logical creation order. If I have a table that I know I will need to query in this fashion, I usually have my own 'creation' datetime column that I can use to order by instead of pk.
Edit based on comment:
If you'd rather use queryset[0], you can modify the 'top_or_none' function thusly:
def top_or_none(queryset):
"""Safely pulls off the top element in a queryset"""
try:
return queryset[0]
except IndexError:
return None
I didn't propose this initially because I was under the impression that queryset[0] would pull back the entire result set, then take the 0th item. Apparently Django adds a 'LIMIT 1' in this scenario too, so it's a safe alternative to my slicing version.
Edit 2
Of course you can also take advantage of Django's related manager construct here and build the queryset through your 'A' object, depending on your preference:
latest = top_or_none(my_a.b_set.order_by('-pk'))

I don't think Django ORM can do this (but I've been pleasantly surprised before...). If there's a reasonable number of A record (or if you're paging), I'd just add a method to A model that would return this 'newest' B record. If you want to get a lot of A records, each with it's own newest B, I'd drop to SQL.
remeber that no matter which route you take, you'll need a suitable composite index on B table, maybe adding an order_by=('a_fk','-id') to the Meta subclass

Related

Django - SQL bulk get_or_create possible?

I am using get_or_create to insert objects to database but the problem is that doing 1000 at once takes too long time.
I tried bulk_create but it doesn't provide functionality I need (creates duplicates, ignores unique value, doesn't trigger post_save signals I need).
Is it even possible to do get_or_create in bulk via customized sql query?
Here is my example code:
related_data = json.loads(urllib2.urlopen(final_url).read())
for item in related_data:
kw = item['keyword']
e, c = KW.objects.get_or_create(KWuser=kw, author=author)
e.project.add(id)
#Add m2m to parent project
related_data cotains 1000 rows looking like this:
[{"cmp":0,"ams":3350000,"cpc":0.71,"keyword":"apple."},
{"cmp":0.01,"ams":3350000,"cpc":1.54,"keyword":"apple -10810"}......]
KW model also sends signal I use to create another parent model:
#receiver(post_save, sender=KW)
def grepw(sender, **kwargs):
if kwargs.get('created', False):
id = kwargs['instance'].id
kww = kwargs['instance'].KWuser
# KeyO
a, b = KeyO.objects.get_or_create(defaults={'keyword': kww}, keyword__iexact=kww)
KW.objects.filter(id=id).update(KWF=a.id)
This works but as you can imagine doing thousands of rows at once takes long time and even crashes my tiny server, what bulk options do I have?
As of Django 2.2, bulk_create has an ignore_conflicts flag. Per the docs:
On databases that support it (all but Oracle), setting the ignore_conflicts parameter to True tells the database to ignore failure to insert any rows that fail constraints such as duplicate unique values
This post may be of use to you:
stackoverflow.com/questions/3395236/aggregating-saves-in-django
Note that the answer recommends using the commit_on_success decorator which is deprecated. It is replaced by the transaction.atomic decorator. Documentation is here:
transactions
from django.db import transaction
#transaction.atomic
def lot_of_saves(queryset):
for item in queryset:
modify_item(item)
item.save()
If I understand correctly, "get_or_create" means SELECT or INSERT on the Postgres side.
You have a table with a UNIQUE constraint or index and a large number of rows to either INSERT (if not yet there) and get the newly create ID or otherwise SELECT the ID of the existing row. Not as simple as it may seem on the outside. With concurrent write load, the matter is even more complicated.
And there are various parameters that need to be defined (how to handle conflicts exactly):
How to use RETURNING with ON CONFLICT in PostgreSQL?

pig - transform data from rows to columns while inserting placeholders for non-existent fields in specific rows

Suppose I have the following flat file on HDFS (let's call this key_value):
1,1,Name,Jack
1,1,Title,Junior Accountant
1,1,Department,Finance
1,1,Supervisor,John
2,1,Title,Vice President
2,1,Name,Ron
2,1,Department,Billing
Here is the output I'm looking for:
(1,1,Department,Finance,Name,Jack,Supervisor,John,Title,Junior Accountant)
(2,1,Department,Billing,Name,Ron,,,Title,Vice President)
In other words, the first two columns form a unique identifier (similar to a composite key in db terminology) and for a given value of this identifier, we want one row in the output (i.e., the last two columns - which are effectively key-value pairs - are condensed onto the same row as long as the identifier is the same). Also notice the nulls in the second row to add placeholders for Supervisor piece that's missing when the unique identifier is (2, 1).
Towards this end, I started putting together this pig script:
data = LOAD 'key_value' USING PigStorage(',') as (i1:int, i2:int, key:chararray, value:chararray);
data_group = GROUP data by (i1, i2);
expected = FOREACH data_group {
sorted = ORDER data BY key, value;
GENERATE FLATTEN(BagToTuple(sorted));
};
dump expected;
The above script gives me the following output:
(1,1,Department,Finance,1,1,Name,Jack,1,1,Supervisor,John,1,1,Title,Junior Accountant)
(2,1,Department,Billing,2,1,Name,Ron,2,1,Title,Vice President)
Notice that the null place holders for missing Supervisor are not represented in the second record (which is expected). If I can get those nulls into place, then it seems just a matter of another projection to get rid of redundant columns (the first two which are replicated multiple times - once per every key value pair).
Short of using a UDF, is there a way to accomplish this in pig using the in-built functions?
UPDATE: As WinnieNicklaus correctly pointed out, the names in the output are redundant. So the output can be condensed to:
(1,1,Finance,Jack,John,Junior Accountant)
(2,1,Billing,Ron,,Vice President)
First of all, let me point out that if for most rows, most of the columns are not filled out, that a better solution IMO would be to use a map. The builtin TOMAP UDF combined with a custom UDF to combine maps would enable you to do this.
I am sure there is a way to solve your original question by computing a list of all possible keys, exploding it out with null values and then throwing away the instances where a non-null value also exists... but this would involve a lot of MR cycles, really ugly code, and I suspect is no better than organizing your data in some other way.
You could also write a UDF to take in a bag of key/value pairs, another bag all possible keys, and generates the tuple you're looking for. That would be clearer and simpler.

What's the reasoning behind result columns being excluded from auto-select statements in PetaPoco

If I have a POCO class with ResultColumn attribute set and then when I do a Single<Entity>() call, my result column isn't mapped. I've set my column to be a result column because its value should always be generated by SQL column's default constraint. I don't want this column to be injected or updated from business layer. What I'm trying to say is that my column's type is a simple SQL data type and not a related entity type (as I've seen ResultColumn being used mostly on those).
Looking at code I can see this line in PetaPoco:
// Build column list for automatic select
QueryColumns = ( from c in Columns
where !c.Value.ResultColumn
select c.Key
).ToArray();
Why are result columns excluded from automatic select statement because as I understand it their nature is to be read only. So used in selects only. I can see this scenario when a column is actually a related entity type (complex). Ok. but then we should have a separate attribute like ComputedColumnAttribute that would always be returned in selects but never used in inserts or updates...
Why did PetaPoco team decide to omit result columns from selects then?
How am I supposed to read result columns then?
I can't answer why the creator did not add them to auto-selects, though I would assume it's because your particular use-case is not the main one that they were considering. If you look at the examples and explanation for that feature on their site, it's more geared towards extra columns you bring back in a join or calculation (like maybe a description from a lookup table for a code value). In these situations, you could not have them automatically added to the select because they are not part of the underlying table.
So if you want to use that attribute, and get a value for the property, you'll have to use your own manual select statement rather than relying on the auto-select.
Of course, the beauty of using PetaPoco is that you can easily modify it to suit your needs, by either creating a new attribute, like you suggest above, or modifying the code you showed to not exclude those fields from the select (assuming you are not using ResultColumn in other join-type situations).

Django: how to filter for rows whose fields are contained in passed value?

MyModel.objects.filter(field__icontains=value) returns all the rows whose field contains value. How to do the opposite? Namely, construct a queryset that returns all the rows whose field is contained in value?
Preferably without using custom SQL (ie only using the ORM) or without using backend-dependent SQL.
field__icontains and similar are coded right into the ORM. The other version simple doesn't exist.
You could use the where param described under the reference for QuerySet.
In this case, you would use something like:
MyModel.objects.extra(where=["%s LIKE CONCAT('%%',field,'%%')"], params=[value])
Of course, do keep in mind that there is no standard method of concatenation across DMBS. So as far as I know, there is no way to satisfy your requirement of avoiding backend-dependent SQL.
If you're okay with working with a list of dictionaries rather than a queryset, you could always do this instead:
qs = MyModel.objects.all().values()
matches = [r for r in qs if value in r[field]]
although this is of course not ideal for huge data sets.

Django - finding the extreme member of each group

I've been playing around with the new aggregation functionality in the Django ORM, and there's a class of problem I think should be possible, but I can't seem to get it to work. The type of query I'm trying to generate is described here.
So, let's say I have the following models -
class ContactGroup(models.Model):
.... whatever ....
class Contact(models.Model):
group = models.ForeignKey(ContactGroup)
name = models.CharField(max_length=20)
email = models.EmailField()
...
class Record(models.Model):
contact = models.ForeignKey(Contact)
group = models.ForeignKey(ContactGroup)
record_date = models.DateTimeField(default=datetime.datetime.now)
... name, email, and other fields that are in Contact ...
So, each time a Contact is created or modified, a new Record is created that saves the information as it appears in the contact at that time, along with a timestamp. Now, I want a query that, for example, returns the most recent Record instance for every Contact associated to a ContactGroup. In pseudo-code:
group = ContactGroup.objects.get(...)
records_i_want = group.record_set.most_recent_record_for_every_contact()
Once I get this figured out, I just want to be able to throw a filter(record_date__lt=some_date) on the queryset, and get the information as it existed at some_date.
Anybody have any ideas?
edit: It seems I'm not really making myself clear. Using models like these, I want a way to do the following with pure django ORM (no extra()):
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
Putting the subquery in the where clause is only one strategy for solving this problem, the others are pretty well covered by the first link I gave above. I know where-clause subselects are not possible without using extra(), but I thought perhaps one of the other ways was made possible by the new aggregation features.
It sounds like you want to keep records of changes to objects in Django.
Pro Django has a section in chapter 11 (Enhancing Applications) in which the author shows how to create a model that uses another model as a client that it tracks for inserts/deletes/updates.The model is generated dynamically from the client definition and relies on signals. The code shows most_recent() function but you could adapt this to obtain the object state on a particular date.
I assume it is the tracking in Django that is problematic, not the SQL to obtain this, right?
First of all, I'll point out that:
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
will not get you the same effect as:
records_i_want = group.record_set.most_recent_record_for_every_contact()
The first query returns every record associated with a particular group (or associated with any of the contacts of a particular group) that has a record_date less than the date/ time specified in the extra. Run this on the shell and then do this to review the query django created:
from django.db import connection
connection.queries[-1]
which reveals:
'SELECT "contacts_record"."id", "contacts_record"."contact_id", "contacts_record"."group_id", "contacts_record"."record_date", "contacts_record"."name", "contacts_record"."email" FROM "contacts_record" WHERE "contacts_record"."group_id" = 1 AND record_date = (select max(record_date) from contacts_record r where r.id=contacts_record.id and r.record_date <= \'2009-07-18\')
Not exactly what you want, right?
Now the aggregation feature is used to retrieve aggregated data and not objects associated with aggregated data. So if you're trying to minimize number of queries executed using aggregation when trying to obtain group.record_set.most_recent_record_for_every_contact() you won't succeed.
Without using aggregation, you can get the most recent record for all contacts associated with a group using:
[x.record_set.all().order_by('-record_date')[0] for x in group.contact_set.all()]
Using aggregation, the closest I could get to that was:
group.record_set.values('contact').annotate(latest_date=Max('record_date'))
The latter returns a list of dictionaries like:
[{'contact': 1, 'latest_date': somedate }, {'contact': 2, 'latest_date': somedate }]
So one entry for for each contact in a given group and the latest record date associated with it.
Anyway, the minimum query number is probably 1 + # of contacts in a group. If you are interested obtaining the result using a single query, that is also possible, but you'll have to construct your models in a different way. But that's a totally different aspect of your problem.
I hope this will help you understand how to approach the problem using aggregation/ the regular ORM functions.