Rails 3 how to reduce queries - ruby-on-rails-3

c = Conversation.joins(:messages).random
>> Conversation Load (8.1ms) SELECT `conversations`.* FROM `conversations` INNER JOIN `messages` ON `messages`.`conversation_id` = `conversations`.`id` ORDER BY created_at DESC, RAND() LIMIT 1
Count:
c.messages.count
>> (5.7ms) SELECT COUNT(*) FROM `messages` WHERE `messages`.`conversation_id` = 74
Length:
c.messages.length
>> Message Load (1.2ms) SELECT `messages`.* FROM `messages` WHERE `messages`.`conversation_id` = 82
How to not perform additional queries above? I thought I have already joined messages for the random conversation using INNER JOIN, and yet a new query is performed to count those results?
The second issue is attempting to get all joined messages with a specific user_id:
u = User with id 27 # has messages in 'c' results above
c.messages.where('user_id = ?', u.id).all
>> Message Load (5.3ms) SELECT `messages`.* FROM `messages` WHERE `messages`.`conversation_id` = 82 AND (user_id = 27)
I used select to do this without additional queries:
c.messages.select { |msg| msg.user_id == u.id }
>> returns messages without query logged
I'd still appreciate how I might reduce or optimize these queries.

Try using includes instead of joins to avoid the n+1 problem.
Railscast #181 By Ryan Bates, http://railscasts.com/episodes/181-include-vs-joins?view=asciicast, explains the difference in an easy way.

Related

Using LATERAL joins in Ecto v2.0

I'm trying to join the latest comment on a post record, like so:
comment = from c in Comment, order_by: [desc: c.inserted_at], limit: 1
post = Repo.all(
from p in Post,
where: p.id == 123,
join: c in subquery(comment), on: c.post_id == p.id,
select: [p.title, c.body],
limit: 1
)
Which generates this SQL:
SELECT p0."title",
c1."body"
FROM "posts" AS p0
INNER JOIN (SELECT p0."id",
p0."body",
p0."inserted_at",
p0."updated_at"
FROM "comments" AS p0
ORDER BY p0."inserted_at" DESC
LIMIT 1) AS c1
ON c1."post_id" = p0."id"
WHERE ( p0."id" = 123 )
LIMIT 1
It just returns nil. If I remove the on: c.post_id == p.id it'll return data, but obviously it'll return the lastest comment for all posts, not the post in question.
What am I doing wrong? A fix could be to use a LATERAL join subquery, but I can't figure out whether it's possible to pass the p reference into a subquery.
Thanks!
The issue was caused by the limit: 1 here:
comment = from c in Comment, order_by: [desc: c.inserted_at], limit: 1
Since the resulting query was SELECT * FROM "comments" AS p0 ORDER BY p0."inserted_at" DESC LIMIT 1, it was only returning the most recent comment on ANY post, not the post I was querying against.
FYI the query was >150ms with ~200,000 comment rows, but that was brought down to ~12ms with a simple index:
create index(:comments, ["(inserted_at::date) DESC"])
It's worth noting that while this query works in returning the post in question and only the most recent comment, it'll actually return $number_of_comments rows if you remove the limit: 1. So say if you wanted to retrieve all 100 posts in your database with the most recent comment of each, and you had 200,000 comments in the database, this query would return 200,000 rows. Instead you should use a LATERAL join as discussed below.
.
Update
Unfortunately ecto doesn't support LATERAL joins right now.
An ecto fragment would work great here, however the join query wraps the fragment in additional parentheses (i.e. INNER JOIN (LATERAL (SELECT …))), which isn't valid SQL, so you'd have to use raw SQL for now:
sql = """
SELECT p."title",
c."body"
FROM "posts" AS p
INNER JOIN LATERAL (SELECT c."id",
c."body",
c."inserted_at"
FROM "comments" AS c
WHERE ( c."post_id" = p."id" )
ORDER BY c."inserted_at" DESC
LIMIT 1) AS c
ON true
WHERE ( p."id" = 123 )
LIMIT 1
"""
res = Ecto.Adapters.SQL.query!(Repo, sql, [])
This query returns in <1ms on the same database.
Note this doesn't return your Ecto model struct, just the raw response from Postgrex.

Bad performance of SQL query due to ORDER BY clause

I have a query joining 4 tables with a lot of conditions in the WHERE clause. The query also includes ORDER BY clause on a numeric column. It takes 6 seconds to return which is too long and I need to speed it up. Surprisingly I found that if I remove the ORDER BY clause it takes 2 seconds. Why the order by makes so massive difference and how to optimize it? I am using SQL server 2005. Many thanks.
I cannot confirm that the ORDER BY makes big difference since I am clearing the execution plan cache. However can you shed light at how to speed this up a little bit? The query is as follows (for simplicity there is "SELECT *" but I am only selecting the ones I need).
SELECT *
FROM View_Product_Joined j
INNER JOIN [dbo].[OPR_PriceLookup] pl on pl.siteID = NodeSiteID and pl.skuid = j.skuid
LEFT JOIN [dbo].[OPR_InventoryRules] irp on irp.ID = pl.SkuID and irp.InventoryRulesType = 'Product'
LEFT JOIN [dbo].[OPR_InventoryRules] irs on irs.ID = pl.siteID and irs.InventoryRulesType = 'Store'
WHERE (((((SiteName = N'EcommerceSite') AND (Published = 1)) AND (DocumentCulture = N'en-GB')) AND (NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%')) AND ((NodeSKUID IS NOT NULL) AND (SKUEnabled = 1) AND pl.PriceLookupID in (select TOP 1 PriceLookupID from OPR_PriceLookup pl2 where pl.skuid = pl2.skuid and (pl2.RoleID = -1 or pl2.RoleId = 13) order by pl2.RoleID desc)))
ORDER BY NodeOrder ASC
Why the order by makes so massive difference and how to optimize it?
The ORDER BY needs to sort the resultset which may take long if it's big.
To optimize it, you may need to index the tables properly.
The index access path, however, has its drawbacks so it can even take longer.
If you have something other than equijoins in your query, or the ranged predicates (like <, > or BETWEEN, or GROUP BY clause), then the index used for ORDER BY may prevent the other indexes from being used.
If you post the query, I'll probably be able to tell you how to optimize it.
Update:
Rewrite the query:
SELECT *
FROM View_Product_Joined j
LEFT JOIN
[dbo].[OPR_InventoryRules] irp
ON irp.ID = j.skuid
AND irp.InventoryRulesType = 'Product'
LEFT JOIN
[dbo].[OPR_InventoryRules] irs
ON irs.ID = j.NodeSiteID
AND irs.InventoryRulesType = 'Store'
CROSS APPLY
(
SELECT TOP 1 *
FROM OPR_PriceLookup pl
WHERE pl.siteID = j.NodeSiteID
AND pl.skuid = j.skuid
AND pl.RoleID IN (-1, 13)
ORDER BY
pl.RoleID desc
) pl
WHERE SiteName = N'EcommerceSite'
AND Published = 1
AND DocumentCulture = N'en-GB'
AND NodeAliasPath LIKE N'/Products/Cats/Computers/Computer-servers/%'
AND NodeSKUID IS NOT NULL
AND SKUEnabled = 1
ORDER BY
NodeOrder ASC
The relation View_Product_Joined, as the name suggests, is probably a view.
Could you please post its definition?
If it is indexable, you may benefit from creating an index on View_Product_Joined (SiteName, Published, DocumentCulture, SKUEnabled, NodeOrder).

How would I do this JOIN in Rails?

Here's my SQL statement:
SELECT *
FROM `message_users`
LEFT JOIN `messages` ON message_users.message_id = messages.id
WHERE (message_users.user_id = 1 AND message_users.hidden = 0) AND message_users.last_read_at > messages.updated_at
ORDER BY messages.updated_at DESC LIMIT 0, 20
How would I pull that off with proper Rails joins/includes/whatever?
You don't typically load all the data from multiple tables into one model in rails. More common is to replace the joins below with include, which will preload the associated model so you hit the cache when calling message.message_users. At any rate, this should duplicate what your sql was doing, as long as there are no column name clashes between messages and messages_users.
If you don't need the data from messages_users after the query is performed, you can remove the select fragment.
Message.find(:all,
:joins=>:message_users,
:select=>"message_users.*, messages.*",
:conditions=>['message_users.user_id = ? and message_users.hidden = ? and message_users.last_read_at > messages.updated_at', 1,0],
:order=>"messages.updated_at desc",
:limit=>20)
There's a good screencast about the difference between joins and include here

LINQ To SQL Paging

I've been using .Skip() and .Take() extension methods with LINQ To SQL for a while now with no problems, but in all the situations I've used them it has always been for a single table - such as:
database.Users.Select(c => c).Skip(10).Take(10);
My problem is that I am now projecting a set of results from multiple tables and I want to page on the overall set (and still get the benefit of paging at the DB).
My entity model looks like this:
A campaign [has many] groups, a group [has many] contacts
this is modelled through a relationship in the database like
Campaign -> CampaignToGroupMapping -> Group -> GroupToContactMapping -> Contact
I need to generate a data structure holding the details of a campaign and also a list of each contact associated to the campaign through the CampaignToGroupMapping, i.e.
Campaign
CampaignName
CampaignFrom
CampaignDate
Recipients
Recipient 1
Recipient 2
Recipient n...
I had tried to write a LINQ query using .SelectMany to project the set of contacts from each group into one linear data set, in the hope I could .Skip() .Take() from that.
My attempt was:
var schedule = (from c in database.Campaigns
where c.ID == highestPriority.CampaignID
select new PieceOfCampaignSchedule
{
ID = c.ID,
UserID = c.UserID,
Name = c.Name,
Recipients = c.CampaignGroupsMappings.SelectMany(d => d.ContactGroup.ContactGroupMappings.Select(e => new ContactData() { /*Contact Data*/ }).Skip(c.TotalSent).Take(totalRequired)).ToList()
}).SingleOrDefault();
The problem is that the paging (with regards to Skip() and Take()) is happening for each group, not the entire data set.
This means if I use the value 200 for the parameter totalRequired (passed to .Take()) and I have 3 groups associated with this campaign, it will take 200 from each group - not 200 from the total data from each group associated with the campaign.
In SQL, I could achieve this with a query such as:
select * from
(
select [t1].EmailAddress, ROW_NUMBER() over(order by CampaignID desc) as [RowNumber] from contacts as [t1]
inner join contactgroupmapping as [t2] on [t1].ID = [t2].ContactID
inner join campaigngroupsmapping as [t3] on [t3].ContactGroupID = [t2].GroupID
where [t3].CampaignID = #HighestPriorityCampaignID
) as [Results] where [Results].[RowNumber] between 500 and 3000
With this query, I'm paging over the combined set of contacts from each group associated with the particular campaign. So my question is, how can I achieve this using LINQ To SQL syntax instead?
To mimic the SQL query you provided you would do this:
var schedule = (from t1 in contacts
join t2 in contactgroupmapping on t1.ID equals t2.GroupID
join t3 in campaigngroupsmapping on t3.ContactGroupID = t2.GroupID
where t3.CampaignID = highestPriority.CampaignID
select new PieceOfCampaignSchedule
{
Email = t1.EmailAddress
}).Skip(500).Take(2500).ToList()
Are you trying to page over campaigns, recipients, or both?
Use a view to aggregate the results from the multiple tables and then use LINQ over the view
I think your attempt is really close; Maybe I'm missing something, but I think you just need to close your SelectMany() before the Skip/Take:
Recipients = c.CampaignGroupsMappings.SelectMany(d => d.ContactGroup.ContactGroupMappings.Select(e => new ContactData() { /*Contact Data*/ })).Skip(c.TotalSent).Take(totalRequired).ToList()
Note: added ")" after "/* Contact Data */ })" and removed ")" from after ".Take(totalRequired)"

Problem with adding custom sql to finder condition

I am trying to add the following custom sql to a finder condition and there is something not quite right.. I am not an sql expert but had this worked out with a friend who is..(yet they are not familiar with rubyonrails or activerecord or finder)
status_search = "select p.*
from policies p
where exists
(select 0 from status_changes sc
where sc.policy_id = p.id
and sc.status_id = '"+search[:status_id].to_s+"'
and sc.created_at between "+status_date_start.to_s+" and "+status_date_end.to_s+")
or exists
(select 0 from status_changes sc
where sc.created_at =
(select max(sc2.created_at)
from status_changes sc2
where sc2.policy_id = p.id
and sc2.created_at < "+status_date_start.to_s+")
and sc.status_id = '"+search[:status_id].to_s+"'
and sc.policy_id = p.id)" unless search[:status_id].blank?
My find statement:
Policy.find(:all,:include=>[{:client=>[:agent,:source_id,:source_code]},{:status_changes=>:status}],
:conditions=>[status_search])
and I am getting this error message in my log:
ActiveRecord::StatementInvalid (Mysql::Error: Operand should contain 1 column(s): SELECT DISTINCT `policies`.id FROM `policies` LEFT OUTER JOIN `clients` ON `clients`.id = `policies`.client_id WHERE ((((policies.created_at BETWEEN '2009-01-01' AND '2009-03-10' OR policies.created_at = '2009-01-01' OR policies.created_at = '2009-03-10')))) AND (select p.*
from policies p
where exists
(select 0 from status_changes sc
where sc.policy_id = p.id
and sc.status_id = '2'
and sc.created_at between 2009-03-10 and 2009-03-10)
or exists
(select 0 from status_changes sc
where sc.created_at =
(select max(sc2.created_at)
from status_changes sc2
where sc2.policy_id = p.id
and sc2.created_at < 2009-03-10)
and sc.status_id = '2'
and sc.policy_id = p.id)) ORDER BY clients.created_at DESC LIMIT 0, 25):
what is the major malfunction here - why is it complaining about the columns?
The conditions modifier is expecting a condition (e.g. a boolean expression that could go in a where clause) and you are passing it an entire query (a select statement).
It looks as if you are trying to do too much in one go here, and should break it down into smaller steps. A few suggestions:
use the query with find_by_sql and don't mess with the conditions.
use the rails finders and filter the records in the rails code
Also, note that constructing a query this way isn't secure if the values like status_date_start can come from users. Look up "sql injection attacks" to see what the problem is, and read the rails documentation & examples for find_by_sql to see how to avoid them.
Ok, I've managed to retool this so it is more friendly to a conditions modifier and I think it is doing the sql query correctly.. however, it is returning policies that when I try to list the current status (the policy.status_change.last.status) it is set to the same status used in the query - which is not correct
here is my updated condition string..
status_search = "status_changes.created_at between ? and ? and status_changes.status_id = ?) or
(status_changes.created_at = (SELECT MAX(sc2.created_at) FROM status_changes sc2
WHERE sc2.policy_id = policies.id and sc2.created_at < ?) and status_changes.status_id = ?"
is there something obvious to this that is not returning all of the remaining associated status changes once it finds the one in the query?
here is the updated find..
Policy.find(:all,:include=>[{:client=>[:agent,:source_id,:source_code]},:status_changes],
:conditions=>[status_search,status_date_start,status_date_end,search[:status_id].to_s,status_date_start,search[:status_id].to_s])