Django: how to effectively use select_related() with Paginator? - sql

I have 2 related models with 10 Million rows each and want to perform an efficient paginated request of 50 000 items of one of them and access related data on the other one:
class RnaPrecomputed(models.Model):
id = models.CharField(max_length=22, primary_key=True)
rna = models.ForeignKey('Rna', db_column='upi', to_field='upi', related_name='precomputed')
description = models.CharField(max_length=250)
class Rna(models.Model):
id = models.IntegerField(db_column='id')
upi = models.CharField(max_length=13, db_index=True, primary_key=True)
timestamp = models.DateField()
userstamp = models.CharField(max_length=30)
As you can see, RnaPrecomputed is related to RNA via a foreign key. Now, I want to fetch a specific page of 50 000 items of RnaPrecomputed and corresponding Rnas related to them. I expect N+1 requests problem, if I do this without select_related() call. Here are the timings:
First, for reference I won't touch the related model at all:
rna_paginator = paginator.Paginator(RnaPrecomputed.objects.all(), 50000)
message = ""
for object in rna_paginator.page(400).object_list:
message = message + str(object.id)
Takes:
real 0m12.614s
user 0m1.073s
sys 0m0.188s
Now, I'll try accessing data on related model:
rna_paginator = paginator.Paginator(RnaPrecomputed.objects.all(), 50000)
message = ""
for object in rna_paginator.page(400).object_list:
message = message + str(object.rna.upi)
it takes:
real 2m27.655s
user 1m20.194s
sys 0m4.315s
Which is a lot, so, probably I have N+1 requests problem.
But now, if I use select_related(),
rna_paginator = paginator.Paginator(RnaPrecomputed.objects.all().select_related('rna'), 50000)
message = ""
for object in rna_paginator.page(400).object_list:
message = message + str(object.rna.upi)
it takes even more:
real 7m9.720s
user 0m1.948s
sys 0m0.337s
So, somehow select_related() made things 3 times slower, instead of making them faster. And probably without it, I have N+1 requests, so for each entry of RnaPrecomputed, Django ORM probably has to do an additional request to the database to fetch the corresponding Rna?
What am I doing wrong and how to make select_related() perform well with paginated queryset?

It's worth checking that you're not missing an index in your database. You have db_index=True for the Rna.upi field, but are you sure the index exists in the database?
If the select_related is making the count() query slow, then you could try doing the select_related on the paginated object_list.
for object in rna_paginator.page(300).object_list.select_related():
message = message + str(object.rna.upi)

Related

Slow active record query rails 5.1, selects all messages to get most recent and count

I have a very slow query that ends up scanning all messages in order to show a users most recent.
def followed_members(limit = 6)
page = helpers.param_helper(:page, 1)
ids = #user.follows_by_type('User')
.to_a
.pluck(:followable_id)
#top_members = User
.where(User.arel_table[:id].in(ids))
.order(ranking_points: :desc)
.includes([:messages])
.offset((page - 1) * limit)
.limit(limit)
.to_a
#next_members = helpers.next_helper(#top_members, limit, page)
if page > 1
#next_page = #next_members
#ajaxcontent = #top_members
end
end
The :messages is only used to populate a user card to show how many messages a user has posted and the date of their most recent message.
user_cards.messages.length > 0 ? user_cards.messages[0].created_at.strftime("%m/%d/%Y") : ""
This is done for all the users follow by a member whenever they go to their following list. This thing is super inefficient but I'm a ruby noob and inherited this legacy project. How would you tackle this so I don't need to grab every message.

Sorting out data from SQL in Rails

So I'm trying to get some specific data out of my database but I've searching online and can't find how to do this (probably because I'm searching for the wrong terms).
I start with getting all the participants with a specific id like this :
contributions = Participant.where(user_id: params[:id])
This will give me a json result like this :
0: {id_request: "1", user_id: "titivermeesch#gmail.com"}
1: {id_request: "2", user_id: "titivermeesch#gmail.com"}
So here I have all the requests (there is a Request class) that have that specific user_id.
Now I want to do this :
all = Request.where(id: id_request)
This obviously don't work but how would I get all those requests that have all those id's that come from the first database query?
So with the results above I should get Request 1 and 2, but how? Can anyone guide me?
How about
contributions = Participant.where(user_id: params[:id])
# Assuming the above is an active record query and id_request is a property of Participant
all = Request.where(id: contributions.map(&:id_request))
This is the equivalent of the SQL
select * from requests where id in (array_of_request_ids)
If You added associations in your model? it's very easy to retrieve the records
This should work:
Request.joins(:participants).where("participants.user_id = ?", params[:id])
Also you might want to read the following part (on joins)

Django Q Queries & on the same field?

So here are my models:
class Event(models.Model):
user = models.ForeignKey(User, blank=True, null=True, db_index=True)
name = models.CharField(max_length = 200, db_index=True)
platform = models.CharField(choices = (("ios", "ios"), ("android", "android")), max_length=50)
class User(AbstractUser):
email = models.CharField(max_length=50, null=False, blank=False, unique=True)
Event is like an analytics event, so it's very possible that I could have multiple events for one user, some with platform=ios and some with platform=android, if a user has logged in on multiple devices. I want to query to see how many users have both ios and android devices. So I wrote a query like this:
User.objects.filter(Q(event__platform="ios") & Q(event__platform="android")).count()
Which returns 0 results. I know this isn't correct. I then thought I would try to just query for iOS users:
User.objects.filter(Q(event__platform="ios")).count()
Which returned 6,717,622 results, which is unexpected because I only have 39,294 users. I'm guessing it's not counting the Users, but counting the Event instances, which seems like incorrect behavior to me. Does anyone have any insights into this problem?
You can use annotations instead:
django.db.models import Count
User.objects.all().annotate(events_count=Count('event')).filter(events_count=2)
So it will filter out any user that has two events.
You can also use chained filters:
User.objects.filter(event__platform='android').filter(event__platform='ios')
Which first filter will get all users with android platform and the second one will get the users that also have iOS platform.
This is generally an answer for a queryset with two or more conditions related to children objects.
Solution: A simple solution with two subqueries is possible, even without any join:
base_subq = Event.objects.values('user_id').order_by().distinct()
user_qs = User.objects.filter(
Q(pk__in=base_subq.filter(platform="android")) &
Q(pk__in=base_subq.filter(platform="ios"))
)
The method .order_by() is important if the model Event has a default ordering (see it in the docs about distinct() method).
Notes:
Verify the only SQL request that will be executed: (Simplified by removing "app_" prefix.)
>>> print(str(user_qs.query))
SELECT user.id, user.email FROM user WHERE (
user.id IN (SELECT DISTINCT U0.user_id FROM event U0 WHERE U0.platform = 'android')
AND
user.id IN (SELECT DISTINCT U0.user_id FROM event U0 WHERE U0.platform = 'ios')
)
The function Q() is used because the same condition parameter (pk__in) can not be repeated in the same filter(), but also chained filters could be used instead: .filter(...).filter(...). (The order of filter conditions is not important and it is outweighed by preferences estimated by SQL server optimizer.)
The temporary variable base_subq is an "alias" queryset only to don't repeat the same part of expression that is never evaluated individually.
One join between User (parent) and Event (child) wouldn't be a problem and a solution with one subquery is also possible, but a join with Event and Event (a join with a repeated children object or with two children objects) should by avoided by a subquery in any case. Two subqueries are nice for readability to demonstrate the symmetry of the two filter conditions.
Another solution with two nested subqueries This non symmetric solution can be faster if we know that one subquery (that we put innermost) has a much more restrictive filter than another necessary subquery with a huge set of results. (example if a number of Android users would be huge)
ios_user_ids = (Event.objects.filter(platform="ios")
.values('user_id').order_by().distinct())
user_ids = (Event.objects.filter(platform="android", user_id__in=ios_user_ids)
.values('user_id').order_by().distinct())
user_qs = User.objects.filter(pk__in=user_ids)
Verify how it is compiled to SQL: (simplified again by removing app_ prefix and ".)
>>> print(str(user_qs.query))
SELECT user.id, user.email FROM user
WHERE user.id IN (
SELECT DISTINCT V0.user_id FROM event V0
WHERE V0.platform = 'ios' AND V0.user_id IN (
SELECT DISTINCT U0.user_id FROM event U0
WHERE U0.platform = 'android'
)
)
(These solutions work also in an old Django e.g. 1.8. A special subquery function Subquery() exists since Django 1.11 for more complicated cases, but we didn't need it for this simple question.)

How can i improve performances when using django queryset?

I'm trying to make a news feed. Each time the page is called, server must send multiple items. One item contain a post, number of likes, number of comments, number of comment children, comments data, comment children data etc.
My problem is, each time my page is called, it takes more than 5 secondes to be loaded. I've already implemented a caching system. But it's still slow.
posts = Posts.objects.filter(page="feed").order_by('-likes')[:'10'].cache()
posts = PostsSerializer(post,many=True)
hasPosted = Posts.objects.filter(page="feed",author="me").cache()
hasPosted = PostsSerializer(hasPosted,many=True)
for post in post.data:
commentsNum = Comments.objects.filter(parent=posts["id"]).cache(ops=['count'])
post["comments"] = len(commentsNum)
comments = Comments.objects.filter(parent=posts["id"]).order_by('-likes')[:'10'].cache()
liked = Likes.objects.filter(post_id=posts["id"],author="me").cache()
comments = CommentsSerializer(comments,many=True)
commentsObj[posts["id"]] = {}
for comment in comments.data:
children = CommentChildren.objects.filter(parent=comment["id"]).order_by('date')[:'10'].cache()
numChildren = CommentChildren.objects.filter(parent=comment["id"]).cache(ops=['count'])
posts["comments"] = posts["comments"] + len(numChildren)
children = CommentChildrenSerializer(children,many=True)
liked = Likes.objects.filter(post_id=comment["id"],author="me").cache()
for child in children.data:
if child["parent"] == comment["id"]:
liked = Liked.objects.filter(post_id=child["id"],author="me").cache()
I'm trying to find a simple method to fetch all these data quicker and without unnecessary database hit. I need to reduce the loading time from 5 secs to less than 1 if possible.
Any suggestion ?
Add the number of children as a integer on the comment field that gets updated every time a comment is added or removed. That way, you won't have to query for that value. You can do this using signals.
Add an ArrayField(if you're using postgres) or something similar on your Profile model that stores all the primary keys of Liked posts. Instead of querying the Likes model, you would be able to do this:
profile = Profile.objects.get(name='me')
liked = True if comment_pk in profile.liked_posts else False
Use select_related to CommentChildren instead of making an extra query for it.
Implementing these 3 items will get rid of all the db queries being executed in the "comment in comments.data" forloop which is probably taking up the majority of the processing time.
If you're interested, check out django-debug-toolbar which enables you to see what queries are being executed on every page.

How to send a table names as parameters to a function which performs a join on them?

Currently I used the following code for joining tables.
Booking.joins(:table1, :table2, :table3, :table4).other_queries
However, the number of tables to be joined with depends on certain conditions. The other_queries also form a very large chain. So, I am duplicating a lot of code just because I need to perform joins differently.
So, I want to implement something like this
def method(params)
Booking.joins(params).other_queries
end
How can this be done?
Maybe just Booking.joins(*params).other_queries is what you need?
Operator * transforms array into list of params, for example:
arr = [1,2,3]
any_method(*arr) # is equal to any_method(1,2,3)
However, if params is smth came from user I recommend you not to trust it, it probably could be security issue. But if you trust it or filter it - why not.
SAFE_JOINS = [:table1, :table2, :table3]
def method(params)
booking = Booking.scoped # or Booking.all if you are rails 5
(params[:joins] & SAFE_JOINS.map(&:to_s)).each do |j|
booking = booking.joins(j.intern)
end
end