rails ancestry pagination - ruby-on-rails-3

I've just followed the Railscast tutorial:
http://railscasts.com/episodes/262-trees-with-ancestry
Is it possible to paginate results from Ancestry which have been arranged?
eg: Given I have the following in my Message controller:
def index
#messages = Message.arrange(:order => :name)
end
Then how would I paginate this as it's going to result in a hash?
Update
I found that if I use .keys then it will paginate, but only the top level not the children.
Message.scoped.arrange(:order => :name).keys
Update
Each message has a code and some content. I can have nested messages
Suppose I have
code - name
1 - Test1
1 - test1 sub1
2 - test1 sub2
2 - Test2
1 - test2 sub1
2 - test2 sub2
3 - test2 sub3
This is how I want to display the listing, but I also want to paginate this sorted tree.

It is possible but I've only managed to do it using two database trips.
The main issue stems from not being able to set limits on a node's children, which leads to either a node's children being truncated or children being orphaned on subsequent pages.
An example:
id: 105, Ancestry: Null
id: 117, Ancestry: 105
id: 118, Ancestry: 105/117
id: 119, Ancestry: 105/117/118
A LIMIT 0,3 (for the sake of the example above) would return the first three records, which will render all but id:119. The subsequent LIMIT 3,3 will return id: 119 which will not render correctly as its parents are not present.
One solution I've employed is using two queries:
The first returns root nodes only. These can be sorted and it is this query that is paginated.
A second query is issued, based on the first, which returns all children of the paginated parents. You should be able to sort children per level.
In my case, I have a Post model (which has_ancestry) . Each post can have any level of replies. Also a post object has a replies count which is a cache counter for its immediate children.
In the controller:
roots = #topic.posts.roots_only.paginate :page => params[:page]
#posts = Post.fetch_children_for_roots(#topic, roots)
In the Post model:
named_scope :roots_only, :conditions => 'posts.ancestry is null'
def self.fetch_children_for_roots(postable, roots)
unless roots.blank?
condition = roots.select{|r|r.replies_count > 0}.collect{|r| "(ancestry like '#{r.id}%')"}.join(' or ')
unless condition.blank?
children = postable.posts.scoped(:from => 'posts FORCE INDEX (index_posts_on_ancestry)', :conditions => condition).all
roots.concat children
end
end
roots
end
Some notes:
MySQL will stop using the ancestry column index if multiple LIKE statements are used. The FORCE INDEX forces mySQL to use the index and prevents a full table scan
LIKE statements are only built for nodes with direct children, so that replies_count column came in handy
What the class method does is appends children to root, which is a WillPaginate::Collection
Finally, these can be managed in your view:
=will_paginate #posts
-Post.arrange_nodes(#posts).each do |post, replies|
=do stuff here
The key method here is arrange_nodes which is mixed in from the ancestry plugin and into your model. This basically takes a sorted Array of nodes and returns a sorted and hierarchical Hash.
I appreciate that this method does not directly address your question but I hope that the same method, with tweaks, can be applied for your case.
There is probably a more elegant way of doing this but overall I'm happy with the solution (until a better one comes along).

Related

Improve query performance in Rails when creating a json

I'm working with a Rails 5 API. I have a simple model of a store, with:
order has_one checkout
checkout has_one transaction
checkout belong_to order
transactions belongs_to checkout
checkout has_many items
1 1 1 1
order -----> checkout ------> transaction
1 *
------> item
I want and endpoint that given an amount of the transactions, it returns a json with data from the transactions.
I have this code that works but it takes a lot of time. For example, a month worth of transactions it's taking 1 minute.
def get_all_transactions
transactions = Transaction.where.not(status: 'error')
data = transactions.map do |transaction|
checkout = transaction.checkout
order = Order.find(checkout.order_id)
checkout.items.map do |item|
{
checkout_id: checkout.id,
order_id: checkout.order_id,
item_id: item.id,
client_name: checkout.client.full_name,
order_created_at: order.created_at
}
end
end
data.flatten!
end
How can I improve this code to have a better performance?
I have also notice that removing for example, the checkout.client.full_name it takes like 20 seconds off.
With full_name being in the client model:
def full_name
"#{first_name} #{last_name}".strip
end
Why would that take 20 seconds?
The problems here is that you have layers upon layers of N+1 queries. Every time you call an association that hasn't been eager loaded you're causing another round trip to the database. Even if you add includes or eager_loads then the next issue is that you're loading tons off data of the tables that you're not using and creating model instances in memory just to use a single attribute off them.
The most efficient way to do this is most likely going to be to simply perform a join and just select the columns you're actually interested in:
sql = Item.joins(order: { checkout: :client })
.select(
Item.arel_table[:id].as('item_id'),
Order.arel_table[:id].as('order_id'),
"TRIM(CONCAT(clients.first_name, ' ', clients.last_name)) AS client_name",
Order.arel_table[:created_at ].as('order_created_at')
)
result = Item.connection.select_all(sql.arel).map(&:to_h)
This avoids creating entire models instances in memory in multiple levels when all you need is a single column.
However its very unclear what the actual expected result is here or why you're basing the query off the Transaction model when you're actually getting an array of items in the result.

Why are all my SQL queries being duplicated 4 times for Django using "Prefetch_related" for nested MPTT children?

I have a Child MPTT model that has a ForeignKey to itself:
class Child(MPTTModel):
title = models.CharField(max_length=255)
parent = TreeForeignKey(
"self", on_delete=models.CASCADE, null=True, blank=True, related_name="children"
)
I have a recursive Serializer as I want to show all levels of children for any given Child:
class ChildrenSerializer(serializers.HyperlinkedModelSerializer):
url = HyperlinkedIdentityField(
view_name="app:children-detail", lookup_field="pk"
)
class Meta:
model = Child
fields = ("url", "title", "children")
def get_fields(self):
fields = super(ChildrenSerializer, self).get_fields()
fields["children"] = ChildrenSerializer(many=True)
return fields
I am trying to reduce the number of duplicate/similar queries made when accessing a Child's DetailView.
The view below works for a depth of 2 - however, the "depth" is not always known or static.
class ChildrenDetailView(generics.RetrieveUpdateDestroyAPIView):
queryset = Child.objects.prefetch_related(
"children",
"children__children",
# A depth of 3 will additionally require "children__children__children",
# A depth of 4 will additionally require "children__children__children__children",
# etc.
)
serializer_class = ChildrenSerializer
lookup_field = "pk"
Note: If I don't use prefetch_related and simply set the queryset as Child.objects.all(), every SQL query is duplicated four times... which I have no idea why.
How do I leverage a Child's depth (i.e. the Child's MPTT level field) to optimize prefetching? Should I be overwriting the view's get_object and/or retrieve?
Does it even matter if I add a ridiculous number of depths to the prefetch? E.g. children__children__children__children__children__children__children__children? It doesn't seem to increase the number of queries for Children objects that don't require that level of depth.
Edit:
Hm, not sure why but when I try to serialize any Child's top parent (i.e. MPTT's get_root), it duplicates the SQL query four times???
class Child(MPTTModel):
...
#property
def top_parent(self):
return self.get_root()
class ChildrenSerializer(serializers.HyperlinkedModelSerializer):
...
top_parent = ParentSerializer()
fields = ("url", "title", "children", "top_parent")
Edit 2
Adding an arbitrary SerializerMethodField confirms it's being queried four times... for some reason? e.g.
class ChildrenSerializer(serializers.HyperlinkedModelSerializer):
...
foo = serializers.SerializerMethodField()
def get_foo(self, obj):
print("bar")
return obj.get_root().title
This will print "bar" four times. The SQL query is also repeated four times according to django-debug-toolbar:
SELECT ••• FROM "app_child" WHERE ("app_child"."parent_id" IS NULL AND "app_child"."tree_id" = '7') LIMIT 21
4 similar queries. Duplicated 4 times.
Are you using DRF's browsable API? It initializes serializer 3 more times for HTML forms, in rest_framework.renderers.BrowsableAPIRenderer.get_context.
If you do the same request with, say, Postman, "bar" should get printed only once.

Get records with no related data using activerecord and RoR3?

I am making scopes for a model that looks something like this:
class PressRelease < ActiveRecord::Base
has_many :publications
end
What I want to get is all press_releases that does not have publications, but from a scope method, so it can be chained with other scopes. Any ideas?
Thanks!
NOTE: I know that there are methods like present? or any? and so on, but these methods does not return an ActiveRecord::Relation as scope does.
NOTE: I am using RoR 3
Avoid eager_loading if you do not need it (it adds overhead). Also, there is no need for subselect statements.
scope :without_publications, -> { joins("LEFT OUTER JOIN publications ON publications.press_release_id = press_releases.id").where(publications: { id: nil }) }
Explanation and response to comments
My initial thoughts about eager loading overhead is that ActiveRecord would instantiate all the child records (publications) for each press release. Then I realized that the query will never return press release records with publications. So that is a moot point.
There are some points and observations to be made about the way ActiveRecord works. Some things I had previously learned from experience, and some things I learned exploring your question.
The query from includes(:publications).where(publications: {id: nil}) is actually different from my example. It will return all columns from the publications table in addition to the columns from press_releases. The publication columns are completely unnecessary because they will always be null. However, both queries ultimately result in the same set of PressRelease objects.
With the includes method, if you add any sort of limit, for example chaining .first, .last or .limit(), then ActiveRecord (4.2.4) will resort to executing two queries. The first query returns IDs, and the second query uses those IDs to get results. Using the SQL snippet method, ActiveRecord is able to use just one query. Here is an example of this from one of my applications:
Profile.includes(:positions).where(positions: { id: nil }).limit(5)
# SQL (0.8ms) SELECT DISTINCT "profiles"."id" FROM "profiles" LEFT OUTER JOIN "positions" ON "positions"."profile_id" = "profiles"."id" WHERE "positions"."id" IS NULL LIMIT 5
# SQL (0.8ms) SELECT "profiles"."id" AS t0_r0, ..., "positions"."end_year" AS t1_r11 FROM "profiles" LEFT OUTER JOIN "positions" ON "positions"."profile_id" = "profiles"."id" # WHERE "positions"."id" IS NULL AND "profiles"."id" IN (107, 24, 7, 78, 89)
Profile.joins("LEFT OUTER JOIN positions ON positions.profile_id = profiles.id").where(positions: { id: nil }).limit(5)
# Profile Load (1.0ms) SELECT "profiles".* FROM "profiles" LEFT OUTER JOIN positions ON positions.profile_id = profiles.id WHERE "positions"."id" IS NULL LIMIT 5
Most importantly
eager_loading and includes were not intended to solve the problem at hand. And for this particular case I think you are much more aware of what is needed than ActiveRecord is. You can therefore make better decisions about how to structure the query.
you can de the following in your PressRelease:
scope :your_scope, -> { where('id NOT IN(select press_release_id from publications)') }
this will return all PressRelease record without publications.
Couple ways to do this, first one requires two db queries:
PressRelease.where.not(id: Publications.uniq.pluck(:press_release_id))
or if you don't want to hardcode association foreign key:
PressRelease.where.not(id: PressRelease.uniq.joins(:publications).pluck(:id))
Another one is to do a left join and pick those without associated elements - you get a relation object, but it will be tricky to work with it as it already has a join on it:
PressRelease.eager_load(:publications).where(publications: {id: nil})
Another one is to use counter_cache feature. You will need to add publication_count column to your press_releases table.
class Publications < ActiveRecord::Base
belongs_to :presss_release, counter_cache: true
end
Rails will keep this column in sync with a number of records associated to given mode, so then you can simply do:
PressRelease.where(publications_count: [nil, 0])

Rails ignores columns from second table when using .select

By example:
r = Model.arel_table
s = SomeOtherModel.arel_table
Model.select(r[:id], s[:othercolumn].as('othercolumn')).
joins(:someothermodel)
Will product the sql:
`SELECT `model`.`id`, `someothermodel`.`othercolumn` AS othercolumn FROM `model` INNER JOIN `someothermodel` ON `model`.`id` = `someothermodel`.`model_id`
Which is correct. However, when the models are loaded, the attribute othercolumn is ignored because it is not an attribute of Model.
It's similar to eager loading and includes, but I don't want all columns, only the one specified so include is no good.
There must be an easy way of getting columns from other models? I'd preferably have the items return as instances of Model than simple arrays/hashes
When you do a select with joins or includes, you will be returned an ActiveRecordRelation. This ActiveRecordRelation is composed of only the objects of the class which you use to call select on. The selected columns from the joined models are added to the objects returned. Because these attributes are not Model's attribute they don't show up when you inspect these objects, and I believe this is the primary reason for confusion.
You could try this out in your rails console:
> result = Model.select(r[:id], s[:othercolumn].as('othercolumn')).joins(:someothermodel)
=> #<ActiveRecord::Relation [#<Model id: 1>]>
# "othercolumn" is not shown in the result but doing the following will yield correct result
> result.first.othercolumn
=> "myothercolumnvalue"

Rails/Sql - order/group search results such that repetition of entities occurs only after appearance of others

In my application, say, animals have many photos. I'm querying photos of animals such that I want all photos of all animals to be displayed. However, I want each animal to appear as a photo before repetition occurs.
Example:
animal instance 1, 'cat', has four photos,
animal instance 2, 'dog', has two photos:
photos should appear ordered as so:
#photo belongs to #animal
tiddles.jpg , cat
fido.jpg dog
meow.jpg cat
rover.jpg dog
puss.jpg cat
felix.jpg, cat (no more dogs so two consecutive cats)
Pagination is required so I can't
order on an array.
Filename
structure/convention provides no
help, though the animal_id exists on
each photo.
Though there are two
types of animal in this example this
is an active record model with
hundreds of records.
Animals may be
selectively queried.
If this isn't possible with active_record then I'll happily use sql; I'm using postgresql.
My brain is frazzled so if anyone can come up with a better title, please go ahead and edit it or suggest in comments.
Here is a PostgreSQL specific solution:
batch_id_sql = "RANK() OVER (PARTITION BY animal_id ORDER BY id ASC)"
Photo.paginate(
:select => "DISTINCT photos.*, (#{batch_id_sql}) batch_id",
:order => "batch_id ASC, photos.animal_id ASC",
:page => 1)
Here is a DB agnostic solution:
batch_id_sql = "
SELECT COUNT(bm.*)
FROM photos bm
WHERE bm.animal_id = photos.animal_id AND
bm.id <= photos.id
"
Photo.paginate(
:select => "photos.*, (#{batch_id_sql}) batch_id",
:order => "batch_id ASC, photos.animal_id ASC",
:page => 1)
Both queries work even when you have a where condition. Benchmark the query using expected data set to check if it meets the expected throughput and latency requirements.
Reference
PostgreSQL Window function
Having no experience in activerecord. Using plain PostgreSQL I would try something like this:
Define a window function over all previous rows which counts how many time the current animal has appeared, then order by this count.
SELECT
filename,
animal_id,
COUNT(*) OVER (PARTITION BY animal_id ORDER BY filename) AS cnt
FROM
photos
ORDER BY
cnt,
animal_id,
filename
Filtering on certain animal_id's will work. This will always order the same way. I don't know if you want something random in there, but it should be easily added.
New solution
Add an integer column called batch_id to the animals table.
class AddBatchIdToPhotos < ActiveRecord::Migration
def self.up
add_column :photos, :batch_id, :integer
set_batch_id
change_column :photos, :batch_id, :integer, :nil => false
add_index :photos, :batch_id
end
def self.down
remove_column :photos, :batch_id
end
def self.set_batch_id
# set the batch id to existing rows
# implement this
end
end
Now add a before_create on the Photo model to set the batch id.
class Photo
belongs_to :animal
before_create :batch_photo_add
after_update :batch_photo_update
after_destroy :batch_photo_remove
private
def batch_photo_add
self.batch_id = next_batch_id_for_animal(animal_id)
true
end
def batch_photo_update
return true unless animal_id_changed?
batch_photo_remove(batch_id, animal_id_was)
batch_photo_add
end
def batch_photo_remove(b_id=batch_id, a_id=animal_id)
Photo.update_all("batch_id = batch_id- 1",
["animal_id = ? AND batch_id > ?", a_id, b_id])
true
end
def next_batch_id_for_animal(a_id)
(Photo.maximum(:batch_id, :conditions => {:animal_id => a_id}) || 0) + 1
end
end
Now you can get the desired result by issuing simple paginate command
#animal_photos = Photo.paginate(:page => 1, :per_page => 10,
:order => :batch_id)
How does this work?
Let's consider we have data set as given below:
id Photo Description Batch Id
1 Cat_photo_1 1
2 Cat_photo_2 2
3 Dog_photo_1 1
2 Cat_photo_3 3
4 Dog_photo_2 2
5 Lion_photo_1 1
6 Cat_photo_4 4
Now if we were to execute a query ordered by batch_id we get this
# batch 1 (cat, dog, lion)
Cat_photo_1
Dog_photo_1
Lion_photo_1
# batch 2 (cat, dog)
Cat_photo_2
Dog_photo_2
# batch 3,4 (cat)
Cat_photo_3
Cat_photo_4
The batch distribution is not random, the animals are filled from the top. The number of animals displayed in a page is governed by per_page parameter passed to paginate method (not the batch size).
Old solution
Have you tried this?
If you are using the will_paginate gem:
# assuming you want to order by animal name
animal_photos = Photo.paginate(:include => :animal, :page => 1,
:order => "animals.name")
animal_photos.each do |animal_photo|
puts animal_photo.file_name
puts animal_photo.animal.name
end
I'd recommend something hybrid/corrected based on KandadaBoggu's input.
First off, the correct way to do it on paper is with row_number() over (partition by animal_id order by id). The suggested rank() will generate a global row number, but you want the one within its partition.
Using a window function is also the most flexible solution (in fact, the only solution) if you want to plan to change the sort order here and there.
Take note that this won't necessarily scale well, however, because in order to sort the results you'll need to:
fetch the whole result set that matches your criteria
sort the whole result set to create the partitions and obtain a rank_id
top-n sort/limit over the result set a second time to get them in their final order
The correct way to do this in practice, if your sort order is immutable, is to maintain a pre-calculated rank_id. KandadaBoggu's other suggestion points in the correct direction in this sense.
When it comes to deletes (and possibly updates, if you don't want them sorted by id), you may run into issues because you end up trading faster reads for slower writes. If deleting the cat with an index of 1 leads to updating the next 50k cats, you're going to be in trouble.
If you've very small sets, the overhead might be very acceptable (don't forget to index animal_id).
If not, there's a workaround if you find the order in which specific animals appear is irrelevant. It goes like this:
Start a transaction.
If the rank_id is going to change (i.e. insert or delete), obtain an advisory lock to ensure that two sessions can't impact the rank_id of the same animal class, e.g.:
SELECT pg_try_advisory_lock('the_table'::regclass, the_animal_id);
(Sleep for .05s if you don't obtain it.)
On insert, find max(rank_id) for that animal_id. Assign it rank_id + 1. Then insert it.
On delete, select the animal with the same animal_id and the largest rank_id. Delete your animal, and assign its old rank_id to the fetched animal (unless you were deleting the last one, of course).
Release the advisory lock.
Commit the work.
Note that the above will make good use of an index on (animal_id, rank_id) and can be done using plpgsql triggers:
create trigger "__animals_rank_id__ins"
before insert on animals
for each row execute procedure lock_animal_id_and_assign_rank_id();
create trigger "_00_animals_rank_id__ins"
after insert on animals
for each row execute procedure unlock_animal_id();
create trigger "__animals_rank_id__del"
before delete on animals
for each row execute procedure lock_animal_id();
create trigger "_00_animals_rank_id__del"
after delete on animals
for each row execute procedure reassign_rank_id_and_unlock_animal_id();
You can then create a multi-column index on your sort criteria if you're not joining all over them place, e.g. (rank_id, name). And you'll end up with a snappy site for reads and writes.
You should be able to get the pictures (or filenames, anyway) using ActiveRecord, ordered by name.
Then you can use Enumerable#group_by and Enumerable#zip to zip all the arrays together.
If you give me more information about how your filenames are really arranged (i.e., are they all for sure with an underscore before the number and a constant name before the underscore for each "type"? etc.), then I can give you an example. I'll write one up momentarily showing how you'd do it for your current example.
You could run two sorts and build one array as follows:
result1= The first of each animal type only. use the ruby "find" method for this search.
result2= All animals, sorted by group. Use "find" to again find the first occurrence of each animal and then use "drop" to remove those "first occurrences" from result2.
Then:
markCustomResult = result1 + result2
Then:
You can use willpaginate on markCustomResult