Active Record: delete_all with limit - ruby-on-rails-3

Trying to get a definitive answer on whether it's possible to limit a delete_all to X number of records.
I'm trying the following:
Model.where(:account_id => account).order(:id).limit(1000).delete_all
but it doesn't seem to respect the limit and instead just deletes all Model where :account_id => account.
I would expect it to generate the following:
delete from model where account_id = ? order by id limit 1000
This seems to work fine when using destroy_all but I want to delete in bulk.

This one also worked pretty well to me (and my needs)
Model.connection.exec_delete('DELETE FROM models ORDER BY id LIMIT 10000', 'DELETE', [])
I know it might seem a bit cumbersome but it'll return the affected rows AND also will log the query through the rails logger. ;)

Try:
Model.delete(Model.where(:account_id => account).order(:id).limit(1000).pluck(:id))

This was the best solution I used to delete millions of rows:
sql = %{ DELETE FROM model WHERE where_clause LIMIT 1000 }
results = 1
while results > 0 do
results = ActiveRecord::Base.connection.exec_delete(sql)
end
This performed much faster than deleting in batches where the IDs were being used in the SQL.

ActiveRecord::Base.connection.send(:delete_sql,'delete from table where account_id = <account_id> limit 1000')
You have to use send because 'delete_sql' is protected, but this works.
I found that removing the 'order by' significantly sped it up too.
I do think it's weird that using .limit works with destroy_all but not delete_all

Or Model.where(:account_id => account).order(:id).limit(1000).map(&:delete), although it is not the best approach if you have thousands of records to delete/destroy.
Model.delete_all() seems to be the best option as it delegates to SQL the task of selecting the records and mass delete them.

Related

Complex SQL Query in Rails 4

I have a complicated query I need for a scope in my Rails app and have tried a lot of things with no luck. I've resorted to raw SQL via find_by_sql but wondering if any gurus wanted to take a shot. I will simplify the verbiage a bit for clarity, but the problem should be stated accurately.
I have Users. Users own many Records. One of them is marked current (#is_current = true) and the rest are not. Each CoiRecord has many Relationships. Relationships have a value for when they were active (active_when) which takes four values, [1..4].
Values 1 and 2 are considered recent. Values 3 and 4 are not.
The problem was ultimately to have a scopes (has_recent_relationships and has_no_recent_relationships) on User that filters on whether or not they have recent Relationships on current Record. (Old Records are irrelevant for this.) I tried create a recent and not_recent scope on Relationship, and then building the scopes on Record, combining with checking for is_current == 1. Here is where I failed. I have to move on with the app but have no choice but to use raw SQL and continue the app, hoping to revisit this later. I put that on User, the only context I really need it, and set aside the code for the scopes on the other objects.
The SQL that works, that correctly finds the Users who have recent relationships is below. The other just uses "= 0" instead "> 0" in the HAVING clause.
SELECT * FROM users WHERE `users`.`id` IN (
SELECT
records.owner_id
FROM `coi_records`
LEFT OUTER JOIN `relationships` ON `relationships`.`record_id` = `records`.`id`
WHERE `records`.`is_current` = 1
HAVING (
SELECT count(*)
FROM relationships
WHERE ((record_id = records.id) AND ((active_when = 1) OR (active_when = 2)))
) > 0
)
My instincts tell me this is complicated enough that my modeling probably could be redesigned and simplified, but the individual objects are pretty simple, just getting at this specific data from two objects away has become complicated.
Anyway, I'd appreciate any thoughts. I'm not expecting a full solution because, ick. Just thought the masochists among you might find this amusing.
Have you tried using Arel directly and this website?
Just copy-and-pasting your query you get this:
User.select(Arel.star).where(
User.arel_table[:id].in(
Relationship.select(Arel.star.count).where(
Arel::Nodes::Group.new(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id]).and(
Relationship.arel_table[:active_when].eq(1).or(Relationship.arel_table[:active_when].eq(2))
)
)
).joins(
CoiRecord.arel_table.join(Relationship.arel_table, Arel::Nodes::OuterJoin).on(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id])
).join_sources
).ast
)
)
I managed to find a way to create what I needed which returns ActiveRelationship objects, which simplifies a lot of other code. Here's what I came up with. This might not scale well, but this app will probably not end up with so much data that it will be a problem.
I created two scope methods. The second depends on the first to simplify things:
def self.has_recent_relationships
joins(records_owned: :relationships)
.merge(Record.current)
.where("(active_when = 1) OR (active_when = 2)")
.distinct
end
def self.has_no_recent_relationships
users_with_recent_relationships = User.has_recent_relationships.pluck(:id)
if users_with_recent_relationships.length == 0
User.all
else
User.where("id not in (?)", users_with_recent_relationships.to_a)
end
end
The first finds Users with recent relationships by just joining Record, merging with a scope that selects current records (should be only one), and looks for the correct active_when values. Easy enough.
The second method finds Users who DO have recent relationships (using the first method.) If there are none, then all Users are in the set of those with no recent relationships, and I return User.all (this will really never happen in the wild, but in theory it could.) Otherwise I return the inverse of those who do have recent relationships, using the SQL keywords NOT IN and an array. It's this part that could be non-performant if the array gets large, but I'm going with it for the moment.

Pagination with SQLite using LIMIT

I'm writing my own SQLIteBrowser and I have one final problem, which apparently is quite often discussed on the web, but it doesn't seem to have a good general solution.
So currently I store the SQL which the user entered. Whenever I need to fetch rows, I execute the SQL by adding "Limit n, m` at the end of the SQL.
For normal SQLs, which I mostly use, this seems good enough. However if I want to use limit myself in the query, this will obviously give an error, because the resulting sql can look like this:
select * from table limit 30 limit 1,100
which is obviously wrong. Is there some better way to do this?
My idea was, that I could scan the SQL and check if there is a limit clause already used and then ignore it. Of course it's not as simlpe as that, because if I have an sql like this:
select * from a where a.b = ( select x from z limit 1)
it obviously should still apply my limit in such a case, so I could scan the string from the end and look if there is a limit somehwere. My question now is, how feasable this is. As I don't know who the SQL parser works, I'm not sure if LIMIT has to be at the end of SQL or if there can be other commands at the end as well.
I tested it with order byand group by and I get SQL errors if limit is not at the end, so my assumption seems to be true.
I found now a much better solution which is quite simple and doesn't require me to parse the SQL.
The user can enter an arbitrary sql. The result is loaded into a table. Since we don't want to load the whole result at once, as this can return millions of records, only N records are retriueved. When the user scroll to the bottom of the table the next N items are fetched and loaded into the table.
The solution is, to wrapt the SQL into an outer sql with my page size limits.
select * from (arbitrary UserSQL) limit PageSize, CurrentOffset
I tested it with SQLs I regularly use, and this seem to work quite nicely and is also fast enough for my purpose.
However, I don't know wether SQLite has a mechanism to fetch the new rows faster, or if the sql has to be rerun every time. In that case it might not be a good solution fo rrealy complex queries with a long response time.

Ruby followers query

We have an SQL query in our Rails 3 app.
#followers returns an array of IDs of users following the current_user.
#followers = current_user.following
#feed_items = Micropost.where("belongs_to_id IN (?)", #followers)
Is there a more efficient way to do this query?
The query you have can't really be optimized anymore than it is. It could be made faster by adding an index to belongs_to_id (which you should almost always do for foreign keys anyway), but that doesn't change the actual query.
There is a cleaner way to write IN queries though:
Micropost.where(:belongs_to_id => #followers)
where #followers is an array of values for belongs_to_id.
It looks good to me.
However if you're looking for real minimum numer of characters in the code, you could change:
Micropost.where("belongs_to_id IN (?)", #followers)
to
Micropost.where("belongs_to_id = ?", #followers)
which reads a little easier.
Rails will see the array and do the IN.
As always the main goal of the ruby language is readability so little improvements help.
As for query being inefficent, you shuld check into indexs on that field.
They tend to be a little more specific for each db - you have only specified generic sql. in your question.

Rails/Active Record .save! efficiency question

New to rails/ruby (using rails 3 and ruby 1.9.2), and am trying to get rid of some unnecessary queries being executed.
When I'm running an each do:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save!
end
and I check the sql LOG, I see three select statements followed by one insert statement. The select statements seem completely unnecessary. For example, they're something like:
SELECT Fruit.* from Fruits where Fruit.ID = 5 LIMIT 1;
SELECT Color.* from Colors where Color.ID = 6 LIMIT 1;
SELECT TreeType.* from TreeTypes where TreeType.ID = 7 LIMIT 1;
INSERT into Apples (Fruit_id, color_id, treetype_id) values (6, 7, 8) RETURNING "id";
Seemingly, this wouldnt' take much time, but when I've got 70k inserts to run, I'm betting those three selects for each insert will take up a decent amount of time.
So I'm wondering the following:
Is this typical of ActiveRecord/Rails .save! method, or did the previous developer add some sort of custom code?
Would those three select statements, being executed for each item, cause a noticeable amount of extra time?
If it is built into rails/active record, would it be easily bypassed, if that would make it run more efficiently?
You must be validating your associations on save for such a thing to occur, something like this:
class Apple < ActiveRecord::Base
validates :fruit,
:presence => true
end
In order to validate that the relationship, the record must be loaded, and this needs to happen for each validation individually, for each record in turn. That's the standard behavior of save!
You could save without validations if you feel like living dangerously:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save(:validate => false)
end
The better approach is to manipulate the records directly in SQL by doing a mass insert if your RDBMS supports it. For instance, MySQL will let you insert thousands of rows with one INSERT call. You can usually do this by making use of the Apple.connection access layer which allows you to make arbitrary SQL calls with things like execute
I'm guessing that there is a before_save EDIT: (or a validation as suggested above) method being called that is looking up the color and type of the fruit and storing that with the rest of the attributes when the fruit is saved - in which case these lookups are necessary ...
Normally I wouldn't expect activerecord to do unnecessary lookups - though that does not mean it is always efficient ...

In which order Rails does the DB queries

In Select n objects randomly with condition in Rails Anurag kindly proposed this answer to randomly select n posts with votes >= x
Post.all(:conditions => ["votes >= ?", x], :order => "rand()", :limit => n)
my concern is that the number of posts that have more than x votes is very big.
what is the order the DB apply this criteria to the query?
Does it
(a) select n posts with votes > x and then randomises it? or
(b) select all posts with votes > x and then randomises and then selects the n first posts?
(c) other?
The recommendation to check the development log is very useful.
However, in this case, the randomisation is happening on the MySQL end, not inside Active Record. In order to see how the query is being run inside MySQL, you can copy the query from the log and paste it into your MySQL tool of choice (console, GUI, whatever) and add "EXPLAIN" to the front of it.
You should end up with something like:
EXPLAIN SELECT * FROM posts WHERE votes >= 'x' ORDER BY rand() LIMIT n
When I try a similar query in MySQL, I am told:
Select Type: SIMPLE
Using where; Using temporary; Using filesort
Then you should do a search for some of the excellent advice on SO on how to optimise MySQL queries. If there is an issue, adding an index on the votes column may improve performance. situation.
As Toby already pointed out, this is purely up to SQL server, everything being done in the query itself.
However, I am afraid that you can't get truly randomized output unless the database gets the whole resultset first, and then randomises it. Although, you should check the EXPLAIN result anyway.
Look in development.log for the generated query, should give you a clue.