Complex SQL Query in Rails 4 - sql

I have a complicated query I need for a scope in my Rails app and have tried a lot of things with no luck. I've resorted to raw SQL via find_by_sql but wondering if any gurus wanted to take a shot. I will simplify the verbiage a bit for clarity, but the problem should be stated accurately.
I have Users. Users own many Records. One of them is marked current (#is_current = true) and the rest are not. Each CoiRecord has many Relationships. Relationships have a value for when they were active (active_when) which takes four values, [1..4].
Values 1 and 2 are considered recent. Values 3 and 4 are not.
The problem was ultimately to have a scopes (has_recent_relationships and has_no_recent_relationships) on User that filters on whether or not they have recent Relationships on current Record. (Old Records are irrelevant for this.) I tried create a recent and not_recent scope on Relationship, and then building the scopes on Record, combining with checking for is_current == 1. Here is where I failed. I have to move on with the app but have no choice but to use raw SQL and continue the app, hoping to revisit this later. I put that on User, the only context I really need it, and set aside the code for the scopes on the other objects.
The SQL that works, that correctly finds the Users who have recent relationships is below. The other just uses "= 0" instead "> 0" in the HAVING clause.
SELECT * FROM users WHERE `users`.`id` IN (
SELECT
records.owner_id
FROM `coi_records`
LEFT OUTER JOIN `relationships` ON `relationships`.`record_id` = `records`.`id`
WHERE `records`.`is_current` = 1
HAVING (
SELECT count(*)
FROM relationships
WHERE ((record_id = records.id) AND ((active_when = 1) OR (active_when = 2)))
) > 0
)
My instincts tell me this is complicated enough that my modeling probably could be redesigned and simplified, but the individual objects are pretty simple, just getting at this specific data from two objects away has become complicated.
Anyway, I'd appreciate any thoughts. I'm not expecting a full solution because, ick. Just thought the masochists among you might find this amusing.

Have you tried using Arel directly and this website?
Just copy-and-pasting your query you get this:
User.select(Arel.star).where(
User.arel_table[:id].in(
Relationship.select(Arel.star.count).where(
Arel::Nodes::Group.new(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id]).and(
Relationship.arel_table[:active_when].eq(1).or(Relationship.arel_table[:active_when].eq(2))
)
)
).joins(
CoiRecord.arel_table.join(Relationship.arel_table, Arel::Nodes::OuterJoin).on(
Relationship.arel_table[:record_id].eq(Record.arel_table[:id])
).join_sources
).ast
)
)

I managed to find a way to create what I needed which returns ActiveRelationship objects, which simplifies a lot of other code. Here's what I came up with. This might not scale well, but this app will probably not end up with so much data that it will be a problem.
I created two scope methods. The second depends on the first to simplify things:
def self.has_recent_relationships
joins(records_owned: :relationships)
.merge(Record.current)
.where("(active_when = 1) OR (active_when = 2)")
.distinct
end
def self.has_no_recent_relationships
users_with_recent_relationships = User.has_recent_relationships.pluck(:id)
if users_with_recent_relationships.length == 0
User.all
else
User.where("id not in (?)", users_with_recent_relationships.to_a)
end
end
The first finds Users with recent relationships by just joining Record, merging with a scope that selects current records (should be only one), and looks for the correct active_when values. Easy enough.
The second method finds Users who DO have recent relationships (using the first method.) If there are none, then all Users are in the set of those with no recent relationships, and I return User.all (this will really never happen in the wild, but in theory it could.) Otherwise I return the inverse of those who do have recent relationships, using the SQL keywords NOT IN and an array. It's this part that could be non-performant if the array gets large, but I'm going with it for the moment.

Related

Having vs. Where in SQL, using the ORM in Laravel

I think my question is more related to SQL than to Laravel or its ORM, but I'm having the problem while programming in Laravel, so that's why I tagged it in the question.
My problem is as follows, I have the following model (sorry for the Spanglish):
I have the users table, nothing special here,
Then the juegos (games) tables, in it there's a jornada column (its like the week, to know which games are played in a certain week)
And finally the pronosticos (who the user says will win, which is stored in the diferencia column)
So I want to make a form where the user can make his bet. Basically this form will take its data from the pronosticos table, like this:
$juegos = Juego::where('jornada', $jor)
-> orderBy('expira')
-> get();
This produces what I want, a collection of models that I can iterate to show all the games for a given jornada (week).
Now, if the user has already make its bet, I want to bring also the scores values the user is betting on, with a query, so I thought I could use something like:
$juegos = Juego::where('jornada', $jor)
-> leftJoin('pronosticos', 'juegos.id', '=', 'pronosticos.juego_id')
-> addSelect(['pronosticos.user_id', 'juegos.id', 'expira', 'visitante', 'local', 'diferencia'])
-> having('pronosticos.user_id', $uid)
-> orderBy('expira')
-> get();
Now, the problem is, it is bringing an empty set, and thats quite obvious, if the user has made his bet, it will work, but if he hasn't the having will filter out everything, giving the empty set.
So I think I'm not getting clearly how to make the having or where to work correctly. Maybe what I want is to do a leftJoin not with the pronosticos table, but from the pronosticos table already filtered with a where clause.
Maybe I'm doing everything wrong and should do the leftJoin to a subselect? If that's so, I have no idea how to do it.
Or maybe my expectations are outside what can be done to SQL and I may return two different sets, and process them in the app?
EDIT
This is the query I want to express in Laravel's ORM:
SELECT * from juegos
LEFT JOIN (SELECT * FROM pronosticos WHERE user_id=1) AS p
ON p.juego_id = juegos.id
WHERE jornada = 2 ORDER BY expira

How to simulate ActiveRecord Model.count.to_sql

I want to display the SQL used in a count. However, Model.count.to_sql will not work because count returns a FixNum that doesn't have a to_sql method. I think the simplest solution is to do this:
Model.where(nil).to_sql.sub(/SELECT.*FROM/, "SELECT COUNT(*) FROM")
This creates the same SQL as is used in Model.count, but is it going to cause a problem further down the line? For example, if I add a complicated where clause and some joins.
Is there a better way of doing this?
You can try
Model.select("count(*) as model_count").to_sql
You may want to dip into Arel:
Model.select(Arel.star.count).to_sql
ASIDE:
I find I often want to find sub counts, so I embed the count(*) into another query:
child_counts = ChildModel.select(Arel.star.count)
.where(Model.arel_attribute(:id).eq(
ChildModel.arel_attribute(:model_id)))
Model.select(Arel.star).select(child_counts.as("child_count"))
.order(:id).limit(10).to_sql
which then gives you all the child counts for each of the models:
SELECT *,
(
SELECT COUNT(*)
FROM "child_models"
WHERE "models"."id" = "child_models"."model_id"
) child_count
FROM "models"
ORDER BY "models"."id" ASC
LIMIT 10
Best of luck
UPDATE:
Not sure if you are trying to solve this in a generic way or not. Also not sure what kind of scopes you are using on your Model.
We do have a method that automatically calls a count for a query that is put into the ui layer. I found using count(:all) is more stable than the simple count, but sounds like that does not overlap your use case. Maybe you can improve your solution using the except clause that we use:
scope.except(:select, :includes, :references, :offset, :limit, :order)
.count(:all)
The where clause and the joins necessary for the where clause work just fine for us. We tend to want to keep the joins and where clause since that needs to be part of the count. While you definitely want to remove the includes (which should be removed by rails automatically in my opinion), but the references (much trickier especially in the case where it references a has_many and requires a distinct) that starts to throw a wrench in there. If you need to use references, you may be able to convert these over to a left_join.
You may want to double check the parameters that these "join" methods take. Some of them take table names and others take relation names. Later rails version have gotten better and take relation names - be sure you are looking at the docs for the right version of rails.
Also, in our case, we spend more time trying to get sub selects with more complicated relationships, we have to do some munging. Looks like we are not dealing with where clauses as much.
ref2

SQL, to loop or not to loop?

the problem story goes like:
consider a program to manage bank accounts with balance limits for each customer
{table Customers, table Limits} where for each Customer.id there's one Limit record
then the client said to store a history for the limits' changes, it's not a problem since I've already had date column for Limit but the active/latest limits's view-query needs to be changed
before: Customer-Limit was 1 to 1 so a simple select did the job
now: it would show all the Limits' records which means multiple records for each Customers and I need the latest Limits only so I thought of something like this pseudo code
foreach( id in Customers)
{
select top 1 *
from Limits
where Limits.customer_id = id
order by Limits.date
}
but while looking through SO for similar issues, I came across stuff like
"95% of the time when you need a looping structure in tSQL you are probably doing it wrong"-JohnFx
and
"SQL is primarily a set-orientated language - it's generally a bad idea to use a loop in it."-Mark Bannister
can anyone confirm/explain why is it wrong to loop? and in the explained problem above, what am I getting wrong that I need to loop?
thanks in advance
update : my solution
in light of TomTom's answer & suggested link here and before Dean kindly answered with code I came up with this
SELECT *
FROM Customers c
LEFT JOIN Limits a ON a.customer_id = c.id
AND a.date =
(
SELECT MAX(date)
FROM Limits z
WHERE z.customer_id = a.customer_id
)
thought I'd share :>
thanks for your response,
happy coding
Will this do?
;with l as (
select *, row_number() over(partition by customer_id order by date desc) as rn
from limits
)
select *
from customers c
left join l on c.customer_id = l.customer_id and l.rn = 1
I am assuming that earlier (i.e. before implementing the history functionality) you must be updating the Limits table. Now, for implementing the history functionality you have started inserting new records. Doesnt this trigger a lot of changes in your databases and code?
Instead of inserting new records, how about keeping the original functionality as is and creating a new table say Limits_History which will store all the old values from Limits table before updating it? Then all you need to do is fetch records from this table if you want to show history. This will not cause any changes in your existing SPs and code hence will be less error prone.
To insert record in the Limits_History table, you can simply create an AFTER TRIGGER and use the deleted magic table. Hence you need not worry about calling an SP or something to maintain history. The trigger will do this for you. Good examples of trigger are here
Hope this helps
It is wrong. You can do the same by quyting customers and limits with a subquery limiting to the most recent record on limit.
This is similar in concept to the query presented in Most recent record in a left join
You may have to do so in 2 joins - get most recent date, then get limit for the date. While this may look complex - it is a beginner issue, talk complex when you have sql statements reaching 2 printed pages and more ;)
Now, for an operational system the table design is broken - limits should contain the mos trecent limit, and a LimitHistory table the historical (or: all) entries, allowing fast retrieval of the CURRENT limit (which will be the one to apply to all transaction) without the overhead of the history. The table design you have assumes all limits are identical - that may be the truth (is the truth) for a reporting data warehouse, but is wrong for a transactional system as the history is not transacted.
Confirmation for why loop is wrong is exactly in the quoted parts in your question - SQL is a set-orientated language.
This means when you work on sets there's no reason to loop through the single rows, because you already have the 'result' (set) of data you want to work on.
Then the work you are doing should be done on the set of rows, because otherwise your selection is wrong.
That being said there are of course situations where looping is done in SQL and it will generally be done via cursors if on data, or done via a while loop if calculating stuff. (generally, exceptions always change).
However, as also mentioned in the quotes, often when you feel like using a loop you either shouldn't (it's poor performance) or you're doing logic in the wrong part of your application.
Basically - it is similar to how object orientated languages works on objects and references to said objects. Set based language works on - well, sets of data.
SQL is basically made to function in that manner - query relational data into result sets - so when working with the language, you should let it do what it can do and work on that. Just as if it was Java or any other language.

Rails/Active Record .save! efficiency question

New to rails/ruby (using rails 3 and ruby 1.9.2), and am trying to get rid of some unnecessary queries being executed.
When I'm running an each do:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save!
end
and I check the sql LOG, I see three select statements followed by one insert statement. The select statements seem completely unnecessary. For example, they're something like:
SELECT Fruit.* from Fruits where Fruit.ID = 5 LIMIT 1;
SELECT Color.* from Colors where Color.ID = 6 LIMIT 1;
SELECT TreeType.* from TreeTypes where TreeType.ID = 7 LIMIT 1;
INSERT into Apples (Fruit_id, color_id, treetype_id) values (6, 7, 8) RETURNING "id";
Seemingly, this wouldnt' take much time, but when I've got 70k inserts to run, I'm betting those three selects for each insert will take up a decent amount of time.
So I'm wondering the following:
Is this typical of ActiveRecord/Rails .save! method, or did the previous developer add some sort of custom code?
Would those three select statements, being executed for each item, cause a noticeable amount of extra time?
If it is built into rails/active record, would it be easily bypassed, if that would make it run more efficiently?
You must be validating your associations on save for such a thing to occur, something like this:
class Apple < ActiveRecord::Base
validates :fruit,
:presence => true
end
In order to validate that the relationship, the record must be loaded, and this needs to happen for each validation individually, for each record in turn. That's the standard behavior of save!
You could save without validations if you feel like living dangerously:
apples.to_a.each do |apple|
new_apple = apple.clone
new_apple.save(:validate => false)
end
The better approach is to manipulate the records directly in SQL by doing a mass insert if your RDBMS supports it. For instance, MySQL will let you insert thousands of rows with one INSERT call. You can usually do this by making use of the Apple.connection access layer which allows you to make arbitrary SQL calls with things like execute
I'm guessing that there is a before_save EDIT: (or a validation as suggested above) method being called that is looking up the color and type of the fruit and storing that with the rest of the attributes when the fruit is saved - in which case these lookups are necessary ...
Normally I wouldn't expect activerecord to do unnecessary lookups - though that does not mean it is always efficient ...

Django - finding the extreme member of each group

I've been playing around with the new aggregation functionality in the Django ORM, and there's a class of problem I think should be possible, but I can't seem to get it to work. The type of query I'm trying to generate is described here.
So, let's say I have the following models -
class ContactGroup(models.Model):
.... whatever ....
class Contact(models.Model):
group = models.ForeignKey(ContactGroup)
name = models.CharField(max_length=20)
email = models.EmailField()
...
class Record(models.Model):
contact = models.ForeignKey(Contact)
group = models.ForeignKey(ContactGroup)
record_date = models.DateTimeField(default=datetime.datetime.now)
... name, email, and other fields that are in Contact ...
So, each time a Contact is created or modified, a new Record is created that saves the information as it appears in the contact at that time, along with a timestamp. Now, I want a query that, for example, returns the most recent Record instance for every Contact associated to a ContactGroup. In pseudo-code:
group = ContactGroup.objects.get(...)
records_i_want = group.record_set.most_recent_record_for_every_contact()
Once I get this figured out, I just want to be able to throw a filter(record_date__lt=some_date) on the queryset, and get the information as it existed at some_date.
Anybody have any ideas?
edit: It seems I'm not really making myself clear. Using models like these, I want a way to do the following with pure django ORM (no extra()):
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
Putting the subquery in the where clause is only one strategy for solving this problem, the others are pretty well covered by the first link I gave above. I know where-clause subselects are not possible without using extra(), but I thought perhaps one of the other ways was made possible by the new aggregation features.
It sounds like you want to keep records of changes to objects in Django.
Pro Django has a section in chapter 11 (Enhancing Applications) in which the author shows how to create a model that uses another model as a client that it tracks for inserts/deletes/updates.The model is generated dynamically from the client definition and relies on signals. The code shows most_recent() function but you could adapt this to obtain the object state on a particular date.
I assume it is the tracking in Django that is problematic, not the SQL to obtain this, right?
First of all, I'll point out that:
ContactGroup.record_set.extra(where=["history_date = (select max(history_date) from app_record r where r.id=app_record.id and r.history_date <= '2009-07-18')"])
will not get you the same effect as:
records_i_want = group.record_set.most_recent_record_for_every_contact()
The first query returns every record associated with a particular group (or associated with any of the contacts of a particular group) that has a record_date less than the date/ time specified in the extra. Run this on the shell and then do this to review the query django created:
from django.db import connection
connection.queries[-1]
which reveals:
'SELECT "contacts_record"."id", "contacts_record"."contact_id", "contacts_record"."group_id", "contacts_record"."record_date", "contacts_record"."name", "contacts_record"."email" FROM "contacts_record" WHERE "contacts_record"."group_id" = 1 AND record_date = (select max(record_date) from contacts_record r where r.id=contacts_record.id and r.record_date <= \'2009-07-18\')
Not exactly what you want, right?
Now the aggregation feature is used to retrieve aggregated data and not objects associated with aggregated data. So if you're trying to minimize number of queries executed using aggregation when trying to obtain group.record_set.most_recent_record_for_every_contact() you won't succeed.
Without using aggregation, you can get the most recent record for all contacts associated with a group using:
[x.record_set.all().order_by('-record_date')[0] for x in group.contact_set.all()]
Using aggregation, the closest I could get to that was:
group.record_set.values('contact').annotate(latest_date=Max('record_date'))
The latter returns a list of dictionaries like:
[{'contact': 1, 'latest_date': somedate }, {'contact': 2, 'latest_date': somedate }]
So one entry for for each contact in a given group and the latest record date associated with it.
Anyway, the minimum query number is probably 1 + # of contacts in a group. If you are interested obtaining the result using a single query, that is also possible, but you'll have to construct your models in a different way. But that's a totally different aspect of your problem.
I hope this will help you understand how to approach the problem using aggregation/ the regular ORM functions.