How can I speed up this query in a Rails app? - sql

I need help optimizing a series of queries in a Rails 5 app. The following explains what I am doing, but if it isn't clear let me know and I will try to go into better detail.
I have the following methods in my models:
In my IncomeReport model:
class IncomeReport < ApplicationRecord
def self.net_incomes_2015_totals_collection
all.map(&:net_incomes_2015).compact
end
def net_incomes_2015
(incomes) - producer.expenses_2015
end
def incomes
total_yield * 1.15
end
end
In my Producer model I have the following:
class Producer < ApplicationRecord
def expenses_2015
expenses.sum(&:expense_per_ha)
end
end
In the Expense model I have:
class Expense < ApplicationRecord
def expense_per_ha
total_cost / area
end
end
In the controller I have this
(I am using a gem called descriptive_statistics to get min, max, quartiles, etc in case you are wondering about that part at the end)
#income_reports_2015 = IncomeReport.net_incomes_2015_totals_collection.extend(DescriptiveStatistics)
Then in my view I use
<%= #income_reports_2015.descriptive_statistics[:min] %>
This code works when there are only a few objects in the database. However, now that there are thousands the query takes forever to give a result. It takes so long that it times out!
How can I optimize this to get the most performant outcome?

One approach might be to architecture your application differently. I think a service-oriented architecture might be of use in this circumstance.
Instead of querying when the user goes to this view, you might want to use a worker to query intermittently, then write to a CSV. Thus, a user navigates to this view and you could read from the CSV instead. This would run much faster because instead of doing a query then & there(when the user navigates to this page) you're simply reading from a file that was created before as a background process.
Obviously, this has its own set of challenges, but I've done this in the past to solve a similar problem. I wrote an app that fetched data from 10 different external API's once a minute. The 10 different fetches resulted in 10 objects in the db. 10 * 60 * 24 = 14,400 records in the DB per day. When a user would load the page requiring this data, they would load 7 days worth of records, 100,800 database rows. I ran into the same problem where the query being done at runtime resulted in a timeout, I wrote to a CSV and read it as a workaround.

What's the structure of IncomeReport? By looking at the code your problem lies in all from net_incomes_2015_totals_collection. all hits the database and returns all records then you map them. Overkill. Try to filter the data, query less, select less and get all the info you want directly with ActiveRecord. Ruby loops slows things down.
So, without know the table structure and its data, I'd do the following:
def self.net_incomes_2015_totals_collection
where(created_at: 2015_start_of_year..2015_end_of_year).where.not(net_incomes_2015: nil).pluck(:net_incomes_2015)
end
Plus I'd make sure there's a composide index for created_at and net_incomes_2015.
It will probably be slow but better than it is now. You should think about aggregating the data in the background (resque, sidekiq, etc) at midnight (and cache it?).
Hope it helps.

It looks like you have a few n+1 queries here. Each report grabs its producer in an an individual query. Then, each producer grabs each of its expenses in a different query.
You could avoid the first issue by throwing a preload(:producer) instead of the all. However, the sums later will be harder to avoid since sum will automatically fire a query.
You can avoid that issue with something like
def self.net_incomes_2015_totals_collection
joins(producer: :expenses).
select(:id, 'income_reports.total_yield * 1.15 - SUM(expenses.total_cost/expenses.area) AS net_incomes_2015').
group(:id).
map(&:net_incomes_2015).
compact
end
to get everything in one query.

Related

How can I make these Active Record queries faster?

I'm using Active Record to send a few queries to a postgres database.
I have a User model, a Business model and a joiner named UserBiz.
The queries I mentioned go through the entire UserBiz collection and then filter out the businesses that match user provided categories and searches.
if !params[:category].blank?
dataset = UserBiz.all.includes(:business, :user).select{|ub| (ub.business.category.downcase == params[:category].downcase) }
else
dataset = UserBiz.all.includes(:business, :user)
end
if !params[:search].blank?
dataset = dataset.select{|ub| (ub.business.summary.downcase.include? params[:search].downcase) || (ub.business.name.downcase.include? params[:search].downcase) }
end
These "work" but the problem is when I threw a quarter million UserBizs into my database to see what happens, one search or category change takes 15 seconds. How can I make these queries faster?
select in your code loads everything into memory which is very bad for performance when dealing with a lot of records.
You have to do the filtering in the database, with something like this.
UserBiz
.includes(:business, :user)
.where("LOWER(businesses.category) = LOWER(?)", params[:category])
it's slow because you selecting all data from UserBiz.
Try out pagination. pagy, will_paginate etc.

Better way to pull N queries after limit with ActiveRecord

So, here is my problem.
I've got a database which imports data from CSV that is huge. It contains around 32000 entries, but has around 200 header columns, hence the standard select is slow.
When I do:
MyModel.all or MyModel.eager_load.all it takes anywhere from 45 seconds up to a minute to load all the entries.
The idea was to use limit to pull maybe 1000 entries like:
my_model = MyModel.limit(1000)
This way I can get the last id like:
last_id = my_model.last.id
To load next 1000 queries I literally use
my_model.where('id > ?', last_entry).limit(1000)
# then I set last_entry again, and keep repeating the process
last_entry = my_model.last.id
But this seems like an overkill, and doesn't seem right.
Is there any better or easier way to do this?
Thank you in advance.
Ruby on Rails has the find_each method that does exactly what you try to do manually. It loads all records from the database in batches of 1000.
MyModel.find_each do |instance|
# do something with this instance, for example, write into the CVS file
end
Rails has an offset method that you can combine with limit.
my_model = MyModel.limit(1000).offset(1000)
You can see the API documentation here: https://apidock.com/rails/v6.0.0/ActiveRecord/QueryMethods/offset
Hope that helps :)

Rails - speed and time

I have a question about the "speed of logic" between MVC.
Suppose to have the same code, something like the follow, in a model, in a view and in a controller.
1) The speed for "compiling" the logic and the query is the same in all three (M-V-C)?
Pseudocode
x = model.where(:a > 3, b < 9).a.first
y = model.sum(:a)
z = (x / y) * 2310.0
Date.today - 5
This is a "stupid" pseudocode, but I want to know the performance of line of code most used by my app (calling a where query, calling a sum (aggregate) query, do some math, playing with date)
The problem, is that my pages are a bit so slowly to load. I have deplaced all that manage queries in the Models and add indexes. Maybe adding the caching can solve a little the problem (but I use Highcharts that I think can't be cached).
2) How can I find where is the code bottleneck (that slow the loading of pages)?
You can use some known tools to profile your controllers/actions/views/models.
NewRelic (a good agent to track your time in distributed manner..I'd prefer this one)
Librato (An agent to which you can pass on your metrics whenever a controller/action is hit and it can give you a result over a period of time)
Rails console outputs the distribution of time spent in controller, views and active record. You can definitely track some good things here. (Please see the attached screenshot).

Memory leak in Rails 3.0.11 migration

A migration contains the following:
Service.find_by_sql("select
service_id,
registrations.regulator_given_id,
registrations.regulator_id
from
registrations
order by
service_id, updated_at desc").each do |s|
this_service_id = s["service_id"]
if this_service_id != last_service_id
Service.find(this_service_id).update_attributes!(:regulator_id => s["regulator_id"],
:regulator_given_id => s["regulator_given_id"])
last_service_id = this_service_id
end
end
and it is eating up memory, to the point where it will not run in the 512MB allowed in Heroku (the registrations table has 60,000 items). Is there a known problem? Workaround? Fix in a later version of Rails?
Thanks in advance
Edit following request to clarify:
That is all the relevant source - the rest of the migration creates the two new columns that are being populated. The situation is that I have data about services from multiple sources (regulators of the services) in the registrations table. I have decided to 'promote' some of the data ([prime]regulator_id and [prime]regulator_given_key) into the services table for the prime regulators to speed up certain queries.
This will load all 60000 items in one go and keep those 60000 AR objects around, which will consume a fair amount of memory. Rails does provide a find_each method for breaking down a query like that into chunks of 1000 objects at a time, but it doesn't allow you to specify an ordering as you do.
You're probably best off implementing your own paging scheme. Using limit/offset is a possibility however large OFFSET values are usually inefficient because the database server has to generate a bunch of results that it then discards.
An alternative is to add conditions to your query that ensures that you don't return already processed items, for example specifying that service_id be less than the previously returned values. This is more complicated if when compared in this matter some items are equal. With both of these paging type schemes you probably need to think about what happens if a row gets inserted into your registrations table while you are processing it (probably not a problem with migrations, assuming you run them with access to the site disabled)
(Note: OP reports this didn't work)
Try something like this:
previous = nil
Registration.select('service_id, regulator_id, regulator_given_id')
.order('service_id, updated_at DESC')
.each do |r|
if previous != r.service_id
service = Service.find r.service_id
service.update_attributes(:regulator_id => r.regulator_id, :regulator_given_id => r.regulator_given_id)
previous = r.service_id
end
end
This is a kind of hacky way of getting the most recent record from regulators -- there's undoubtedly a better way to do it with DISTINCT or GROUP BY in SQL all in a single query, which would not only be a lot faster, but also more elegant. But this is just a migration, right? And I didn't promise elegant. I also am not sure it will work and resolve the problem, but I think so :-)
The key change is that instead of using SQL, this uses AREL, meaning (I think) the update operation is performed once on each associated record as AREL returns them. With SQL, you return them all and store in an array, then update them all. I also don't think it's necessary to use the .select(...) clause.
Very interested in the result, so let me know if it works!

Rails : How to build statistics per day/month/year or How database agnostic SQL functions are missing (ex. : STRFTIME, DATE_FORMAT, DATE_TRUNC)

I have been searching all over the web and I have no clue.
Suppose you have to build a dashboard in the admin area of your Rails app and you want to have the number of subscriptions per day.
Suppose that you are using SQLite3 for development, MySQL for production (pretty standard setup)
Basically, there are two options :
1) Retrieve all rows from the database using Subscriber.all and aggregate by day in the Rails app using the Enumerable.group_by :
#subscribers = Subscriber.all
#subscriptions_per_day = #subscribers.group_by { |s| s.created_at.beginning_of_day }
I think this is a really bad idea. Retrieving all rows from the database can be acceptable for a small application, but it will not scale at all. Database aggregate and date functions to the rescue !
2) Run a SQL query in the database using aggregate and date functions :
Subscriber.select('STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions').group('day')
Which will run in this SQL query :
SELECT STRFTIME("%Y-%m-%d", created_at) AS day, COUNT(*) AS subscriptions
FROM subscribers
GROUP BY day
Much better. Now aggregates are done in the database which is optimized for this kind of task, and only one row per day is returned from the database to the Rails app.
... but wait... now the app has to go live in my production env which uses MySQL !
Replace STRFTIME() with DATE_FORMAT().
What if tomorrow I switch to PostgreSQL ?
Replace DATE_FORMAT() with DATE_TRUNC().
I like to develop with SQLite. Simple and easy.
I also like the idea that Rails is database agnostic.
But why Rails doesn't provide a way to translate SQL functions that do the exact same thing, but have different syntax in each RDBMS (this difference is really stupid, but hey, it's too late to complain about it) ?
I can't believe that I find so few answers on the Web for such a basic feature of a Rails app : count the subscriptions per day, month or year.
Tell me I'm missing something :)
EDIT
It's been a few years since I posted this question.
Experience has shown that I should use the same DB for dev and prod. So I now consider the database agnostic requirement irrelevant.
Dev/prod parity FTW.
I ended up writing my own gem. Check it out and feel free to contribute:
https://github.com/lakim/sql_funk
It allows you to make calls like:
Subscriber.count_by("created_at", :group_by => "day")
You speak of some pretty difficult problems that Rails, unfortunately, completely overlooks. The ActiveRecord::Calculations docs are written like they're all you ever need, but databases can do much more advanced things. As Donal Fellows mentioned in his comment, the problem is much trickier than it seems.
I've developed a Rails application over the last two years that makes heavy use of aggregation, and I've tried a few different approaches to the problem. I unfortunately don't have the luxary of ignoring things like daylight savings because the statistics are "only trends". The calculations I generate are tested by my customers to exact specifications.
To expand upon the problem a bit, I think you'll find that your current solution of grouping by dates is inadequate. It seems like a natural option to use STRFTIME. The primary problem is that it doesn't let you group by arbitrary time periods. If you want to do aggregation by year, month, day, hour, and/or minute, STRFTIME will work fine. If not, you'll find yourself looking for another solution. Another huge problem is that of aggregation upon aggregation. Say, for example, you want to group by month, but you want to do it starting from the 15th of every month. How would you do it using STRFTIME? You'd have to group by each day, and then by month, but then someone account for the starting offset of the 15th day of each month. The final straw is that grouping by STRFTIME necessitates grouping by a string value, which you'll find very slow when performing aggregation upon aggregation.
The most performant and best designed solution I've come to is one based upon integer time periods. Here is an excerpt from one of my mysql queries:
SELECT
field1, field2, field3,
CEIL((UNIX_TIMESTAMP(CONVERT_TZ(date, '+0:00', ##session.time_zone)) + :begin_offset) / :time_interval) AS time_period
FROM
some_table
GROUP BY
time_period
In this case, :time_interval is the number of seconds in the grouping period (e.g. 86400 for daily) and :begin_offset is the number of seconds to offset the period start. The CONVERT_TZ() business accounts for the way mysql interprets dates. Mysql always assumes that the date field is in the mysql local time zone. But because I store times in UTC, I must convert it from UTC to the session time zone if I want the UNIX_TIMESTAMP() function to give me a correct response. The time period ends up being an integer that describes the number of time intervals since the start of unix time. This solution is much more flexible because it lets you group by arbitrary periods and doesn't require aggregation upon aggregation.
Now, to get to my real point. For a robust solution, I'd recommend that you consider not using Rails at all to generate these queries. The biggest issue is that the performance characteristics and subtleties of aggregation are different across the databases. You might find one design that works well in your development environment but not in production, or vice-versa. You'll jump through a lot of hoops to get Rails to play nicely with both databases in query construction.
Instead I'd recommend that you generate database-specific views in your chosen database and bring those along to the correct environment. Try to model the view as you would any other ActiveRecord table (id's and all), and of course make the fields in the view identical across databases. Because these statistics are read-only queries, you can use a model to back them and pretend like they're full-fledged tables. Just raise an exception if somebody tries to save, create, update, or destroy.
Not only will you get simplified model management by doing things the Rails way, you'll also find that you can write units tests for your aggregation features in ways you wouldn't dream of in pure SQL. And if you decide to switch databases, you'll have to rewrite those views, but your tests will tell you where you're wrong, and make life so much easier.
I just released a gem that allows you to do this easily with MySQL. https://github.com/ankane/groupdate
You should really try to run MySQL in development, too. Your development and production environments should be as close as possible - less of a chance for something to work on development and totally break production.
If db agnosticism is what you're after, I can think of a couple of options:
Create a new field (we'll call it day_str) for the Subscriber that stores either the formatted date or a timestamp and use ActiveRecord.count:
daily_subscriber_counts = Subscriber.count(:group => "day_str")
The trade-off is of course a slightly larger record size, but this would all but eliminate performance worries.
You could also, depending on how granular the data that's being visualized is, just call .count several times with the date set as desired...
((Date.today - 7)..Date.today).each |d|
daily_subscriber_counts[d] = Subscriber.count(:conditions => ["created_at >= ? AND created_at < ?", d.to_time, (d+1).to_time)
end
This could also be customized to account for varying granularities (per month, per year, per day, per hour). It's not the most efficient solution in the case that you wanted to group by day on all of your subscribers (haven't had a chance to run it either), but I would imagine you'd want to group by month, day, hour if you're viewing the a years worth, months worth or days worth of data respectively.
If you're willing to commit to mysql and sqlite you could use...
daily_subscriber_counts = Subscriber.count(:group => "date(created_at)")
...as they share similar date() functions.
I'd refine/expand PBaumann's answer slightly, and include a Dates table in your database. You'd need a join in your query:
SELECT D.DateText AS Day, COUNT(*) AS Subscriptions
FROM subscribers AS S
INNER JOIN Dates AS D ON S.created_at = D.Date
GROUP BY D.DateText
...but you'd have a nicely-formatted value available without calling any functions. With a PK on Dates.Date, you can merge join and it should be very fast.
If you have an international audience, you could use DateTextUS, DateTextGB, DateTextGer, etc., but obviously this would not be a perfect solution.
Another option: cast the date to text on the database side using CONVERT(), which is ANSI and may be available across databases; I'm too lazy to confirm that right now.
Here's how I do it:
I have a class Stat which allows storing raw events.
(Code is from the first few weeks I started coding in Ruby so excuse some of it :-))
class Stat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :source_url, :referral_url, :temp_user_guid
# you can replace this with a cron job for better performance
# the reason I have it here is because I care about real-time stats
after_save :aggregate
def aggregate
aggregateinterval(1.hour)
#aggregateinterval(10.minutes)
end
# will aggregate an interval with the following properties:
# take t = 1.hour as an example
# it's 5:21 pm now, it will aggregate everything between 5 and 6
# and put them in the interval with start time 5:00 pm and 6:00 pm for today's date
# if you wish to create a cron job for this, you can specify the start time, and t
def aggregateinterval(t=1.hour)
aggregated_stat = AggregatedStat.where('start_time = ? and end_time = ? and statable_id = ? and statable_type = ? and statable_stattype_id = ?', Time.now.utc.floor(t), Time.now.utc.floor(t) + t, self.statable_id, self.statable_type, self.statable_stattype_id)
if (aggregated_stat.nil? || aggregated_stat.empty?)
aggregated_stat = AggregatedStat.new
else
aggregated_stat = aggregated_stat.first
end
aggregated_stat.statable_id = self.statable_id
aggregated_stat.statable_type = self.statable_type
aggregated_stat.statable_stattype_id = self.statable_stattype_id
aggregated_stat.start_time = Time.now.utc.floor(t)
aggregated_stat.end_time = Time.now.utc.floor(t) + t
# in minutes
aggregated_stat.interval_size = t / 60
if (!aggregated_stat.count)
aggregated_stat.count = 0
end
aggregated_stat.count = aggregated_stat.count + 1
aggregated_stat.save
end
end
And here's the AggregatedStat class:
class AggregatedStat < ActiveRecord::Base
belongs_to :statable, :polymorphic => true
attr_accessible :statable_id, :statable_type, :statable_stattype_id, :start_time, :end_time
Every statable item that gets added to the db has a statable_type and a statable_stattype_id and some other generic stat data. The statable_type and statable_stattype_id are for the polymorphic classes and can hold values like (the string) "User" and 1, which means you're storing stats about User number 1.
You can add more columns and have mappers in the code extract the right columns when you need them. Creating multiple tables make it harder to manage.
In the code above, StatableStattypes is just a table that contains "events" you'd like to log... I use a table because prior experience taught me that I don't want to look for what type of stats a number in the database refers to.
class StatableStattype < ActiveRecord::Base
attr_accessible :name, :description
has_many :stats
end
Now go to the classes you'd like to have some stats for and do the following:
class User < ActiveRecord::Base
# first line isn't too useful except for testing
has_many :stats, :as => :statable, :dependent => :destroy
has_many :aggregated_stats, :as => :statable, :dependent => :destroy
end
You can then query the aggregated stats for a certain User (or Location in the example below) with this code:
Location.first.aggregated_stats.where("start_time > ?", DateTime.now - 8.month)