Better way to pull N queries after limit with ActiveRecord - sql

So, here is my problem.
I've got a database which imports data from CSV that is huge. It contains around 32000 entries, but has around 200 header columns, hence the standard select is slow.
When I do:
MyModel.all or MyModel.eager_load.all it takes anywhere from 45 seconds up to a minute to load all the entries.
The idea was to use limit to pull maybe 1000 entries like:
my_model = MyModel.limit(1000)
This way I can get the last id like:
last_id = my_model.last.id
To load next 1000 queries I literally use
my_model.where('id > ?', last_entry).limit(1000)
# then I set last_entry again, and keep repeating the process
last_entry = my_model.last.id
But this seems like an overkill, and doesn't seem right.
Is there any better or easier way to do this?
Thank you in advance.

Ruby on Rails has the find_each method that does exactly what you try to do manually. It loads all records from the database in batches of 1000.
MyModel.find_each do |instance|
# do something with this instance, for example, write into the CVS file
end

Rails has an offset method that you can combine with limit.
my_model = MyModel.limit(1000).offset(1000)
You can see the API documentation here: https://apidock.com/rails/v6.0.0/ActiveRecord/QueryMethods/offset
Hope that helps :)

Related

How can I speed up this query in a Rails app?

I need help optimizing a series of queries in a Rails 5 app. The following explains what I am doing, but if it isn't clear let me know and I will try to go into better detail.
I have the following methods in my models:
In my IncomeReport model:
class IncomeReport < ApplicationRecord
def self.net_incomes_2015_totals_collection
all.map(&:net_incomes_2015).compact
end
def net_incomes_2015
(incomes) - producer.expenses_2015
end
def incomes
total_yield * 1.15
end
end
In my Producer model I have the following:
class Producer < ApplicationRecord
def expenses_2015
expenses.sum(&:expense_per_ha)
end
end
In the Expense model I have:
class Expense < ApplicationRecord
def expense_per_ha
total_cost / area
end
end
In the controller I have this
(I am using a gem called descriptive_statistics to get min, max, quartiles, etc in case you are wondering about that part at the end)
#income_reports_2015 = IncomeReport.net_incomes_2015_totals_collection.extend(DescriptiveStatistics)
Then in my view I use
<%= #income_reports_2015.descriptive_statistics[:min] %>
This code works when there are only a few objects in the database. However, now that there are thousands the query takes forever to give a result. It takes so long that it times out!
How can I optimize this to get the most performant outcome?
One approach might be to architecture your application differently. I think a service-oriented architecture might be of use in this circumstance.
Instead of querying when the user goes to this view, you might want to use a worker to query intermittently, then write to a CSV. Thus, a user navigates to this view and you could read from the CSV instead. This would run much faster because instead of doing a query then & there(when the user navigates to this page) you're simply reading from a file that was created before as a background process.
Obviously, this has its own set of challenges, but I've done this in the past to solve a similar problem. I wrote an app that fetched data from 10 different external API's once a minute. The 10 different fetches resulted in 10 objects in the db. 10 * 60 * 24 = 14,400 records in the DB per day. When a user would load the page requiring this data, they would load 7 days worth of records, 100,800 database rows. I ran into the same problem where the query being done at runtime resulted in a timeout, I wrote to a CSV and read it as a workaround.
What's the structure of IncomeReport? By looking at the code your problem lies in all from net_incomes_2015_totals_collection. all hits the database and returns all records then you map them. Overkill. Try to filter the data, query less, select less and get all the info you want directly with ActiveRecord. Ruby loops slows things down.
So, without know the table structure and its data, I'd do the following:
def self.net_incomes_2015_totals_collection
where(created_at: 2015_start_of_year..2015_end_of_year).where.not(net_incomes_2015: nil).pluck(:net_incomes_2015)
end
Plus I'd make sure there's a composide index for created_at and net_incomes_2015.
It will probably be slow but better than it is now. You should think about aggregating the data in the background (resque, sidekiq, etc) at midnight (and cache it?).
Hope it helps.
It looks like you have a few n+1 queries here. Each report grabs its producer in an an individual query. Then, each producer grabs each of its expenses in a different query.
You could avoid the first issue by throwing a preload(:producer) instead of the all. However, the sums later will be harder to avoid since sum will automatically fire a query.
You can avoid that issue with something like
def self.net_incomes_2015_totals_collection
joins(producer: :expenses).
select(:id, 'income_reports.total_yield * 1.15 - SUM(expenses.total_cost/expenses.area) AS net_incomes_2015').
group(:id).
map(&:net_incomes_2015).
compact
end
to get everything in one query.

tableUnavailable dependent upon size of search

I'm experiencing something rather strange with some queries that I'm performing in BigQuery.
Firstly, I'm using an externally backed table (csv.gz) with about 35 columns. The total data in the location is around 5Gb, with an average file size of 350mb. The reason I'm doing this, is that I continually add data and remove to the table on a rolling basis - to give me a view of the last 7 days of our activity.
When querying, if perform something simple like:
select * from X limit 10
everything works fine. It continues to work fine if you increase the limit up to 1 million rows. As soon as you up the limit to ten million:
select * from X limit 10000000
I end up with a tableUnavailable error "Something went wrong with the table you queried. Contact the table owner for assistance. (error code: tableUnavailable)"
Now according to to any literature on this, this usually results from using some externally owned table (I'm not). I can't find any other enlightening information for this error code.
Basically, If I do anything slightly complex on the data, I get the same result. There's a column called event that has maybe a couple hundred of different values in the entire dataset. If I perform the following:
select eventType, count(1) from X group by eventType
I get the same error.
I'm getting the feeling that this might be related to limits on external tables? Can anybody clarify or shed any light on this?
Thanks in advance!
Doug

Pig: how to loop through all fields/columns?

I'm new to Pig. I need to do some calculation for all fields/columns in a table. However, I can't find a way to do it by searching online. It would be great if someone here can give some help!
For example: I have a table with 100 fields/columns, most of them are numeric. I need to find the average of each field/column, is there an elegant way to do it without repeat AVERAGE(column_xxx) for 100 times?
If there's just one or two columns, then I can do
B = group A by ALL;
C = foreach B generate AVERAGE(column_1), AVERAGE(columkn_2);
However, if there's 100 fields, it's really tedious to repeatedly write AVERAGE for 100 times and it's easy to have errors.
One way I can think of is embed Pig in Python and use Python to generate a string like that and put into compile. However, that still sounds weird even if it works.
Thank you in advance for help!
I don't think there is a nice way to do this with pig. However, this should work well enough and can be done in 5 minutes:
Describe the table (or alias) in question
Copy the output, and reorgaize it manually into the script part you need (for example with excel)
Finish and store the script
If you need to be able with columns that can suddenly change etc. there is probably no good way to do it in pig. Perhaps you could read it in all columns (in R for example) and do your operation there.

Need help in optimizing the query in ruby on rails

I have some 30,000 records in my Raw_deals table and some raw_cities table has some 30 records and each deal is linked with some 5-8 cities.
Now i want to fetch any random deal within some specific cities.
List of those cities can be fetched like this:
#raw_cities = RawCity.where('disabled = ?', 0).map(&:id)
Now i need a deal. I wrote a query but its taking too much time.
#raw_deal = RawDeal.order("RAND()").find(:first,:joins=>[:raw_cities], :conditions=>["raw_cities.id IN (?)",#raw_cities])
The order("RAND()") is probably what's slowing your query down, and since you're only looking for one single deal, you can use a combination of limit and offset to simulate a random order.
Try something like this:
#raw_deal = RawDeal.offset(rand(RawDeal.count)).
joins(:raw_cities).
where(raw_cities: #raw_cities).
first

When Does Django Perform the Database Lookup?

From the following code:
dvdList = Dvd.objects.filter(title = someDvdTitle)[:10]
for dvd in dvdList:
result = "Title: "+dvd.title+" # "+dvd.price+"."
When does Django do the lookup? Maybe it's just paranoia, but it seems if I comment out the for loop, it returns a lot quicker. Is the first line setting up a filter and then the for loop executes it, or am I completely muddled up? What actually happens with those lines of code?
EDIT:
What would happen if I limited the objects.filter to '1000' and then implemented a counter in the for loop that broke out of it after 10 iterations. Would that effectively only get 10 values or 1000?
Django querysets are evaluated lazily, so yes, the query won't actually be executed until you try and get values out of it (as you're doing in the for loop).
From the docs:
You can evaluate a QuerySet in the following ways:
Iteration. A QuerySet is iterable, and
it executes its database query the
first time you iterate over it. For
example, this will print the headline
of all entries in the database:
for e in Entry.objects.all():
print e.headline
...(snip)...
See When Querysets are evaluated.
Per your edit:
If you limited the filter to 1000 and then implemented a counter in the for loop that broke out of it after 10 iterations, then you'd hit the database for all 1000 rows - Django has no way of knowing ahead of time exactly what you're going to do with the Queryset - it just knows that you want some data out of it, so evaluates the query string it's built up.
It may be also good to evaluate all at once using list() or any other method of eval of the query. I find it to boost performance sometimes (no paying for the DB connections every time).
Find more info about when django evaluates here.