How to select each model which has the maximum value of an attribute for any given value of another attribute? - sql

I have a Work model with a video_id, a user_id and some other simple fields. I need to display the last 12 works on the page, but only take 1 per user. Currently I'm trying to do it like this:
def self.latest_works_one_per_user(video_id=nil)
scope = self.includes(:user, :video)
scope = video_id ? scope.where(video_id: video_id) : scope.where.not(video_id: nil)
scope = scope.order(created_at: :desc)
user_ids = works = []
scope.each do |work|
next if user_ids.include? work.user_id
user_ids << work.user_id
works << work
break if works.size == 12
end
works
end
But I'm damn sure there is a more elegant and faster way of doing it especially when the number of works gets bigger.

Here's a solution that should work for any SQL database with minimal adjustment. Whether one thinks it's elegant or not depends on how much you enjoy SQL.
def self.latest_works_one_per_user(video_id=nil)
scope = includes(:user, :video)
scope = video_id ? scope.where(video_id: video_id) : scope.where.not(video_id: nil)
scope.
joins("join (select user_id, max(created_at) created_at
from works group by created at) most_recent
on works.user_id = most_recent.user_id and
works.created_at = most_recent.created_at").
order(created_at: :desc).limit(12)
end
It only works if the combination of user_id and created_at is unique, however. If that combination isn't unique you'll get more than 12 rows.
It can be done more simply in MySQL. The MySQL solution doesn't work in Postgres, and I don't know a better solution in Postgres, although I'm sure there is one.

Related

sqlalchemy: paginate does not return the expected number of elements

I am using flask-sqlalchemy together with a sqlite database. I try to get all votes below date1
sub_query = models.VoteList.query.filter(models.VoteList.vote_datetime < date1)
sub_query = sub_query.filter(models.VoteList.group_id == selected_group.id)
sub_query = sub_query.filter(models.VoteList.user_id == g.user.id)
sub_query = sub_query.subquery()
old_votes = models.Papers.query.join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id).paginate(1, 4, False)
where the database model for VoteList looks like this
class VoteList(db.Model):
id = db.Column(db.Integer, primary_key=True)
user_id = db.Column(db.Integer, db.ForeignKey('user.id'))
group_id = db.Column(db.Integer, db.ForeignKey('groups.id'))
arxiv_id = db.Column(db.String(1000), db.ForeignKey('papers.arxiv_id'))
vote_datetime = db.Column(db.DateTime)
group = db.relationship("Groups", backref=db.backref('vote_list', lazy='dynamic'))
user = db.relationship("User", backref=db.backref('votes', lazy='dynamic'), foreign_keys=[user_id])
def __repr__(self):
return '<VoteList %r>' % (self.id)
I made sure that the 'old_votes' selection above has 20 elements. If I use .all() instead of .paginate() I get the expected 20 result?
Since I used a max results value of 4 in the example above I would expect that old_votes.items has 4 elements. But it has only 2? If I increase the max results value the number of elements also increases, but it is always below the max result value? Paginate seems to mess up something here?
any ideas?
thanks
carl
EDIT
I noticed that it works fine if I apply the paginate() function on add_columns(). So if I add (for no good reason) a column with
old_votes = models.Papers.query.join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id)
old_votes = old_votes.add_columns(sub_query.c.vote_datetime).paginate(page, VOTES_PER_PAGE, False)
it works fine? But since I don't need that column it would still be interesting to know what goes wrong with my example above?
Looks to me that for the 4 rows returned (and filtered) by the query, there are 4 rows representing 4 different rows of the VoteList table, but they refer/link/belong to only 2 different Papers models. When model instances are created, duplicates are filtered out, and therefore you get less rows. When you add a column from a subquery, the results are tuples of (Papers, vote_datetime), and in this case no duplicates are removed.
I encountered the same issue and I applied van's answer but it did not work. However I agree with van's explanation so I added .distinct() to the query like this:
old_votes = models.Papers.query.distinct().join(sub_query, sub_query.c.arxiv_id == models.Papers.arxiv_id).paginate(1, 4, False)
It worked as I expected.

Filtering model with HABTM relationship

I have 2 models - Restaurant and Feature. They are connected via has_and_belongs_to_many relationship. The gist of it is that you have restaurants with many features like delivery, pizza, sandwiches, salad bar, vegetarian option,… So now when the user wants to filter the restaurants and lets say he checks pizza and delivery, I want to display all the restaurants that have both features; pizza, delivery and maybe some more, but it HAS TO HAVE pizza AND delivery.
If I do a simple .where('features IN (?)', params[:features]) I (of course) get the restaurants that have either - so or pizza or delivery or both - which is not at all what I want.
My SQL/Rails knowledge is kinda limited since I'm new to this but I asked a friend and now I have this huuuge SQL that gets the job done:
Restaurant.find_by_sql(['SELECT restaurant_id FROM (
SELECT features_restaurants.*, ROW_NUMBER() OVER(PARTITION BY restaurants.id ORDER BY features.id) AS rn FROM restaurants
JOIN features_restaurants ON restaurants.id = features_restaurants.restaurant_id
JOIN features ON features_restaurants.feature_id = features.id
WHERE features.id in (?)
) t
WHERE rn = ?', params[:features], params[:features].count])
So my question is: is there a better - more Rails even - way of doing this? How would you do it?
Oh BTW I'm using Rails 4 on Heroku so it's a Postgres DB.
This is an example of a set-iwthin-sets query. I advocate solving these with group by and having, because this provides a general framework.
Here is how this works in your case:
select fr.restaurant_id
from features_restaurants fr join
features f
on fr.feature_id = f.feature_id
group by fr.restaurant_id
having sum(case when f.feature_name = 'pizza' then 1 else 0 end) > 0 and
sum(case when f.feature_name = 'delivery' then 1 else 0 end) > 0
Each condition in the having clause is counting for the presence of one of the features -- "pizza" and "delivery". If both features are present, then you get the restaurant_id.
How much data is in your features table? Is it just a table of ids and names?
If so, and you're willing to do a little denormalization, you can do this much more easily by encoding the features as a text array on restaurant.
With this scheme your queries boil down to
select * from restaurants where restaurants.features #> ARRAY['pizza', 'delivery']
If you want to maintain your features table because it contains useful data, you can store the array of feature ids on the restaurant and do a query like this:
select * from restaurants where restaurants.feature_ids #> ARRAY[5, 17]
If you don't know the ids up front, and want it all in one query, you should be able to do something along these lines:
select * from restaurants where restaurants.feature_ids #> (
select id from features where name in ('pizza', 'delivery')
) as matched_features
That last query might need some more consideration...
Anyways, I've actually got a pretty detailed article written up about Tagging in Postgres and ActiveRecord if you want some more details.
This is not "copy and paste" solution but if you consider following steps you will have fast working query.
index feature_name column (I'm assuming that column feature_id is indexed on both tables)
place each feature_name param in exists():
select fr.restaurant_id
from
features_restaurants fr
where
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'pizza')
and
exists(select true from features f where fr.feature_id = f.feature_id and f.feature_name = 'delivery')
group by
fr.restaurant_id
Maybe you're looking at it backwards?
Maybe try merging the restaurants returned by each feature.
Simplified:
pizza_restaurants = Feature.find_by_name('pizza').restaurants
delivery_restaurants = Feature.find_by_name('delivery').restaurants
pizza_delivery_restaurants = pizza_restaurants & delivery_restaurants
Obviously, this is a single instance solution. But it illustrates the idea.
UPDATE
Here's a dynamic method to pull in all filters without writing SQL (i.e. the "Railsy" way)
def get_restaurants_by_feature_names(features)
# accepts an array of feature names
restaurants = Restaurant.all
features.each do |f|
feature_restaurants = Feature.find_by_name(f).restaurants
restaurants = feature_restaurants & restaurants
end
return restaurants
end
Since its an AND condition (the OR conditions get dicey with AREL). I reread your stated problem and ignoring the SQL. I think this is what you want.
# in Restaurant
has_many :features
# in Feature
has_many :restaurants
# this is a contrived example. you may be doing something like
# where(name: 'pizza'). I'm just making this condition up. You
# could also make this more DRY by just passing in the name if
# that's what you're doing.
def self.pizza
where(pizza: true)
end
def self.delivery
where(delivery: true)
end
# query
Restaurant.features.pizza.delivery
Basically you call the association with ".features" and then you use the self methods defined on features. Hopefully I didn't misunderstand the original problem.
Cheers!
Restaurant
.joins(:features)
.where(features: {name: ['pizza','delivery']})
.group(:id)
.having('count(features.name) = ?', 2)
This seems to work for me. I tried it with SQLite though.

Django sql order by

I'm really struggling on this one.
I need to be able to sort my user by the number of positive vote received on their comment.
I have a table userprofile, a table comment and a table likeComment.
The table comment has a foreign key to its user creator and the table likeComment has a foreign key to the comment liked.
To get the number of positive vote a user received I do :
LikeComment.objects.filter(Q(type = 1), Q(comment__user=user)).count()
Now I want to be able to get all the users sorted by the ones that have the most positive votes. How do I do that ? I tried to use extra and JOIN but this didn't go anywhere.
Thank you
It sounds like you want to perform a filter on an annotation:
class User(models.Model):
pass
class Comment(models.Model):
user = models.ForeignKey(User, related_name="comments")
class Like(models.Model):
comment = models.ForeignKey(Comment, related_name="likes")
type = models.IntegerField()
users = User \
.objects \
.all()
.extra(select = {
"positive_likes" : """
SELECT COUNT(*) FROM app_like
JOIN app_comment on app_like.comment_id = app_comment.id
WHERE app_comment.user_id = app_user.id AND app_like.type = 1 """})
.order_by("positive_likes")
models.py
class UserProfile(models.Model):
.........
def like_count(self):
LikeComment.objects.filter(comment__user=self.user, type=1).count()
views.py
def getRanking( anObject ):
return anObject.like_count()
def myview(request):
users = list(UserProfile.objects.filter())
users.sort(key=getRanking, reverse=True)
return render(request,'page.html',{'users': users})
Timmy's suggestion to use a subquery is probably the simplest way to solve this kind of problem, but subqueries almost never perform as well as joins, so if you have a lot of users you may find that you need better performance.
So, re-using Timmy's models:
class User(models.Model):
pass
class Comment(models.Model):
user = models.ForeignKey(User, related_name="comments")
class Like(models.Model):
comment = models.ForeignKey(Comment, related_name="likes")
type = models.IntegerField()
the query you want looks like this in SQL:
SELECT app_user.id, COUNT(app_like.id) AS total_likes
FROM app_user
LEFT OUTER JOIN app_comment
ON app_user.id = app_comment.user_id
LEFT OUTER JOIN app_like
ON app_comment.id = app_like.comment_id AND app_like.type = 1
GROUP BY app_user.id
ORDER BY total_likes DESCENDING
(If your actual User model has more fields than just id, then you'll need to include them all in the SELECT and GROUP BY clauses.)
Django's object-relational mapping system doesn't provide a way to express this query. (As far as I know—and I'd be very happy to be told otherwise!—it only supports aggregation across one join, not across two joins as here.) But when the ORM isn't quite up to the job, you can always run a raw SQL query, like this:
sql = '''
SELECT app_user.id, COUNT(app_like.id) AS total_likes
# etc (as above)
'''
for user in User.objects.raw(sql):
print user.id, user.total_likes
I believe this can be achieved with Django's queryset:
User.objects.filter(comments__likes__type=1)\
.annotate(lks=Count('comments__likes'))\
.order_by('-lks')
The only problem here is that this query will miss users with 0 likes. Code from #gareth-rees, #timmy-omahony and #Catherine will include also 0-ranked users.

Making a Safe sql query

The code below is from a Sinatra app (that uses DataMappe), which I am trying to convert to a Rails 3 application. It is a class method in the Visit class.
def self.count_by_date_with(identifier,num_of_days)
visits = repository(:default).adapter.query("SELECT date(created_at) as date, count(*) as count FROM visits where link_identifier = '#{identifier}' and created_at between CURRENT_DATE-#{num_of_days} and CURRENT_DATE+1 group by date(created_at)")
dates = (Date.today-num_of_days..Date.today)
results = {}
dates.each { |date|
visits.each { |visit| results[date] = visit.count if visit.date == date }
results[date] = 0 unless results[date]
}
results.sort.reverse
end
My problem is with this part
visits = repository(:default).adapter.query("SELECT date(created_at) as date, count(*) as count FROM visits where link_identifier = '#{identifier}' and created_at between CURRENT_DATE-#{num_of_days} and CURRENT_DATE+1 group by date(created_at)")
Rails (as far as I know) doesn't have this repository method, and I would expect a query to be called on an object of some sort, such as Visit.find
Can anyone give me a hint how this would best be written for a Rails app?
Should I do
Visit.find_by_sql("SELECT date(created_at) as date, count(*) as count FROM visits where link_identifier = '#{identifier}' and created_at between CURRENT_DATE-#{num_of_days} and CURRENT_DATE+1 group by date(created_at)")
Model.connection.execute "YOUR SQL" should help you. Something like
class Visit < Activerecord::Base
class << self
def trigger(created_at,identifier,num_of_days)
sql = "SELECT date(created_at) as date, count(*) as count FROM visits where link_identifier = '#{identifier}' and created_at between CURRENT_DATE-#{num_of_days} and CURRENT_DATE+1 group by date(created_at)"
connection.execute sql
end
end
end
I know you already accepted an answer, but you asked for the best way to do what you asked in Rails. I'm providing this answer because Rails does not recommend building conditions as pure query strings.
Building your own conditions as pure strings can leave you vulnerable to SQL injection exploits. For example, Client.where("first_name LIKE '%#{params[:first_name]}%'") is not safe.
Fortunately, Active Record is incredibly powerful and can build very complex queries. For instance, your query can be recreated with four method calls while still being easy to read and safe.
# will return a hash with the structure
# {"__DATE__" => __COUNT__, ...}
def self.count_by_date_with(identifier, num_of_days)
where("link_identifier = ?", identifier)
.where(:created_at => (num_of_days.to_i.days.ago)..(1.day.from_now))
.group('date(created_at)')
.count
end
Active Record has been built to turn Ruby objects into valid SQL selectors and operators. What makes this so cool is that Rails can turn a Ruby Range into a BETWEEN operator or an Array into an IN expression.
For more information on Active Record check out the guide. It explains what Active Record is capable of and how to use it.

Rails (or maybe SQL): Finding and deleting duplicate AR objects

ActiveRecord objects of the class 'Location' (representing the db-table Locations) have the attributes 'url', 'lat' (latitude) and 'lng' (longitude).
Lat-lng-combinations on this model should be unique. The problem is, that there are a lot of Location-objects in the database having duplicate lat-lng-combinations.
I need help in doing the following
Find objects that share the same
lat-lng-combination.
If the 'url' attribute of the object
isn't empty, keep this object and delete the
other duplicates. Otherwise just choose the
oldest object (by checking the attribute
'created_at') and delete the other duplicates.
As this is a one-time-operation, solutions in SQL (MySQL 5.1 compatible) are welcome too.
If it's a one time thing then I'd just do it in Ruby and not worry too much about efficiency. I haven't tested this thoroughly, check the sorting and such to make sure it'll do exactly what you want before running this on your db :)
keep = []
locations = Location.find(:all)
locations.each do |loc|
# get all Locations's with the same coords as this one
same_coords = locations.select { |l| l.lat == loc.lat and \
l.lng == loc.lng }
with_urls = same_coords.select { |l| !l.url.empty? }
# decide which list to use depending if there were any urls
same_coords = with_urls.any? ? with_urls : same_coords
# pick the best one
keep << same_coords.sort { |a,b| b.created_at <=> a.created_at }.first.id
end
# only keep unique ids
keep.uniq!
# now we just delete all the rows we didn't decide to keep
locations.each do |loc|
loc.destroy unless keep.include?( loc.id )
end
Now like I said, this is definitely poor, poor code. But sometimes just hacking out the thing that works is worth the time saved in thinking up something 'better', especially if it's just a one-off.
If you have 2 MySQL columns, you can use the CONCAT function.
SELECT * FROM table1 GROUP BY CONCAT(column_lat, column_lng)
If you need to know the total
SELECT COUNT(*) AS total FROM table1 GROUP BY CONCAT(column_lat, column_lng)
Or, you can combine both
SELECT COUNT(*) AS total, table1.* FROM table1
GROUP BY CONCAT(column_lat, column_lng)
But if you can explain more on your question, perhaps we can have more relevant answers.