Delete everything but the first n items (Rails, PGSQL) - ruby-on-rails-3

Assume I have a table called post, that belongs to user.
class Post < ActiveRecord::Base
belongs_to :user
end
class User < activeRecord::Base
has_many :posts
end
Such that the posts table contains a user_id column.
How can I periodically, delete all but the first n posts from the posts table for each user?
Assume n = 20
If a user has 15 posts, then all 15 are kept.
If a user has 21 posts, then 1 post is deleted. The 1 being the oldest one.
Basically I need to do a FIFO but with grouping to the user_id column.
How can this be achieved via raw PostgreSQL sql?

You can rank the post using ROW_NUMBER() in a common table expression, ranking the newest per user_id as 1 (and increasing for older posts). Then just use a simple delete to delete all rows where the rank is greater than 20;
WITH cte AS (
SELECT id, ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY id DESC) rn
FROM post
)
DELETE FROM post
WHERE id IN (SELECT id FROM cte WHERE rn>20);
A simple SQLfiddle to test with.
And as always, back up your data before running potentially destructive SQL from random people on the Internet.

Related

Performance of search query bottlenecked 98% by mutual friends despite caching

So on my social networking website, similar to facebook, my search speed is bottlenecked like 98% by this one part. I want to rank the results based on the number of mutual friends the searching user has, with all of the results (we can assume they are users)
My friends table has 3 columns -
user_id (person who sends the request)
friend_id (person who receives the request)
pending (boolean to indicate if the request was accepted or not)
user_id and friend_id are both foreign keys that reference users.id
Finding friend_ids of a user is simple, it looks like this
def friends
Friend.where(
'(user_id = :id OR friend_id = :id) AND pending = false',
id: self.id
).pluck(:user_id, :friend_id)
.flatten
.uniq
.reject { |id| id == self.id }
end
So, after getting the results that match the search query, ranking the results by mutual friends, requires following steps -
Get user_ids of all the searching user's friends - Set(A). Above mentioned friends method does this
Loop over each of the ids in Set(A) -
Get user_ids of all the friends of |id| - Set (B). Again, done by friends method
Find length of intersection of set A and set B
Order in descending order of length of intersections for all results
The most expensive operation over here obviously getting friend_ids of of hundreds of users. So I cached the friend_ids of all the users to speed it up. The difference in performance was amazing, but I'm curious if it can be further improved.
I'm wondering if there is a way that I can get friend_ids of all the desired users in a single query, that is efficient. Something like -
SELECT user_id, [array of friend_ids of the user with id = user_id]
FROM friends
....
Can someone help me write a fast SQL or ActiveRecord query for this?
That way I can store the user_ids of all the search results and their corresponding friend_ids in a hash or some other fast data structure, and then perform the same operation of ranking (that I mentioned above). Since I won't be hitting the cache for thousands of users and their friend_ids, I think it'll speed up the process significantly
Caching your friends table in RAM is not a viable approach if you expect your site to grow to large numbers of users, but I'm sure it does great for a smallish number of users.
It is to your advantage to get the most work you can out of the database with as few calls as possible. It is inefficient to issue large numbers of queries, as the overhead per query as comparatively large. Moreover, databases are built for the kind of task you're trying to perform. I think you are doing far too much work on the Ruby side, and you ought to let the database do the kind of work it does best.
You did not give many details, so I decided to start by defining a minimal model DB:
create table users (
user_id int not null primary key,
nick varchar(32)
);
create table friends (
user_id int not null,
friend_id int not null,
pending bool,
primary key (user_id, friend_id),
foreign key (user_id) references users(user_id),
foreign key (friend_id) references users(user_id),
check (user_id < friend_id)
);
The check constraint on friends avoids the same pair of users being listed in the table in both orders, and of course the PK prevents the same pair from being enrolled multiple times in the same order. The PK also automatically has a unique index associated with it.
Since I suppose the 'is a friend of' relation is supposed to be logically symmetric, it is convenient to define a view that presents that symmetry:
create view friends_symmetric (user_id, friend_id) as (
select user_id, friend_id from friends where not pending
union all
select friend_id, user_id from friends where not pending
);
(If friendship is not symmetric then you can drop the check constraint and the view, and use table friends in place of friends_symmetric in what follows.)
As a model query whose results you want to rank, then, I take this:
select * from users where nick like 'Sat%';
The objective is to return result rows in descending order of the number of friends each hit has in common with User1, the user on whose behalf the query is run. You might do that like so:
(update: modified this query to filter out duplicate results)
select *
from (
select
u.*,
count(mutual.shared_friend_id) over (partition by u.user_id) as num_shared,
row_number() over (partition by u.user_id) as copy_num
from
users u
left join (
select
f1.friend_id as shared_friend_id,
f2.friend_id as friend_id
from friends_symmetric f1
join friends_symmetric f2
on f1.friend_id = f2.user_id
where f1.user_id = ?
and f2.friend_id != f1.user_id
) mutual
on u.user_id = mutual.friend_id
where u.nick like 'Sat%'
) all_rows
where copy_num = 1
order by num_shared desc
where the ? is a placeholder for a parameter containing the ID of the User1.
Edited to add:
I have structured this query with window functions instead of an aggregate query with the idea that such a structure will be easier for the query planner to optimize. Nevertheless, the inline view "mutual" could instead be structured as an aggregate query that computes the number of shared friends that the searching user has with every user that shares at least one friend, and that would permit one level of inline view to be avoided. If performance of the provided query is or becomes inadequate, then it would be worthwhile to test that variant.
There are other ways to approach the problem of performing the sorting in the DB, some of which may perform better, and there may be ways to improve the performance of each by tweaking the database (adding indexes or constraints, modifying table definitions, computing db statistics, ...).
I cannot predict whether that query will outperform what you're doing now, but I assure you that it scales better, and it is easier to maintain.
Assuming that you want a relation of the User model whose primary key is id, you should be able to join onto a subquery that calculates the number of mutual friends:
class User < ActiveRecord::Base
def other_users_ordered_by_mutual_friends
self.class.select("users.*, COALESCE(f.friends_count, 0) AS friends_count").joins("LEFT OUTER JOIN (
SELECT all_friends.user_id, COUNT(DISTINCT all_friends.friend_id) AS friends_count FROM (
SELECT f1.user_id, f1.friend_id FROM friends f1 WHERE f1.pending = false
UNION ALL
SELECT f2.friend_id AS user_id, f2.user_id AS friend_id FROM friends f2 WHERE f2.pending = false
) all_friends INNER JOIN (
SELECT DISTINCT f1.friend_id AS user_id FROM friends f1 WHERE f1.user_id = #{id} AND f1.pending = false
UNION ALL
SELECT DISTINCT f2.user_id FROM friends f2 WHERE f2.friend_id = #{id} AND f2.pending = false
) user_friends ON user_friends.user_id = all_friends.friend_id GROUP BY all_friends.user_id
) f ON f.user_id = users.id").where.not(id: id).order("friends_count DESC")
end
end
The subquery selects all user IDs with associated friends and inner joins that to another select with all of the current user's friends' IDs. Since it groups by the user_id and selects the count, we get the number of mutual friends for each user_id. I have not tested this since I don't have any sample data, but it should work.
Since this returns a scope, you can chain other scopes/conditions to the relation:
current_user.other_users_ordered_by_mutual_friends.where(attribute1: value1).reorder(:attribute2)
The select scope as written will also give you access to the field friends_count on instances within the relation:
<%- current_user.other_users_ordered_by_mutual_friends.each do |user| -%>
<p>User <%= user.id -%> has <%= user.friends_count -%> mutual friends.</p>
<%- end -%>
John had a great idea with the friends_symetric view. With two filtered indexes (one on (friend_id,user_id and the other on (user_id,friend_id) ) it's gonna work great.
However the query can be a bit simpler
WITH user_friends AS(
SELECT user_id, array_agg(friend_id) AS friends
FROM friends_symmetric
WHERE user_id = :user_id -- id of our user
GROUP BY user_id
)
SELECT u.*
,array_agg(friend_id) AS shared_friends -- aggregated ids of friends in case they are needed for something
,count(*) AS shared_count
FROM user_friends AS uf
JOIN friends_symmetric AS f
ON f.user_id = ANY(uf.friends) AND f.friend_id = ANY(uf.friends)
JOIN user
ON u.user_id = f.user_id
WHERE u.nick LIKE 'Sat%' --nickname of our user's friend
GROUP BY u.user_id

Rails / SQL Query finding most recent Event

I'm using Rails 4.2 and PostgreSQL 9.4.
I have a basic users, reservations and events schema.
I'd like to return a list of users and the most recent event they attended, along with what date/time this was at.
I've created a query that returns the user and the time of the most recent event. However I need to return the events.id as well.
My application does not allow a user to reserve two events with the same start time, however I appreciate SQL does not know anything about this and thinks there can be multiple events in the result. Hence I am happy for the query to return an appropriate event ID at random in the case of a hypothetical 'tie' for events.starts_at.
User.all.joins(reservations: :event)
.select('users.*, max(events.starts_at)')
.where('reservations.state = ?', "attended")
.where('events.company_id = ?', 1)
.group('users.id')
The corresponding SQL query is:
SELECT users.*, max(events.starts_at) FROM "users" INNER JOIN "reservations" ON "reservations"."user_id" = "users"."id" INNER JOIN "events" ON "events"."id" = "reservations"."event_id" WHERE (reservations.state = 'attended') AND (events.company_id = 1) GROUP BY users.id
The reservations table is very large so loading the entire set into Rails and processing it via Ruby code is undesirable. I'd like to perform the entire query in SQL if it is possible to do so.
My basic model:
User
has_many :reservations
Reservation
belongs_to :user
belongs_to :event
Event
belongs_to :company
has_many :reservations
The generic sql that returns data for the most recent event looks like this:
select yourfields
from yourtables
join
(select someField
, max(datetimefield) maxDateTime
from table1
where whatever
group by someField ) temp on table1.someField = temp.somefield
and table1.dateTimeField = maxDateTime
where whatever
The two "where whatever" things should be the same. All you have to do is adapt this construct into your app. You might consider putting the query into a stored procedure which you then call from your app.
I think your query should focus first to retrieve the most recent reservation.
SELECT MAX(`events.starts_at`),`events"."id`,`user_id` FROM `reservations` WHERE (reservations.state = 'attended')
Then JOIN the Users and Events.
Assuming the results will include every User and Event it may be more efficient to retrieve all users and events and store then in two arrays keyed by id.
The logic behind that is rather than a separate lookup into the user and events table for each resulting reservation by the db engine, it is more efficient to get them all in a single query.
SELECT * FROM Users' WHERE 1 ORDER BYuser_id`
SELECT * FROM Events' WHERE 1 ORDER BYevent_id`
I am not familiar with Rails syntax so cannot give exact code but can show using it in PHP code, the results are put into the array with a single line of code.
while ($row = mysql_fetch_array($results, MYSQL_NUM)){users[$row(user_id)] = $row;}
Then when processing the Reservations you get the user and event data from the arrays.
The Index for reservations is critical and may be worth profiling.
Possible profile choices may be to include and exclude 'attended' in the Index. The events.starts_at should be the first column in the index followed by user_id. But profiling the Index's column order should be profiled.
You may want to use a unique Index to enforce the no duplicate reservations times.

I'm having trouble with a ranking of achievements in mysql

I'm making a system that users earn achievements with certain actions on the site.
The database is divided into three, the user, achievements and table linking the two.
For the ranking what is important is the table that links users and achievements.
Structured like this (called user_achievements):
| id | user_id | achievement_id | created_at | updated_at |
On top ranking should be the person with the most achievements and who won that first.
The query that was assembled:
SELECT user_id,
COUNT(achievement_id) AS 'total',
created_at
FROM `user_achievements`
GROUP BY user_id ORDER BY total DESC, created_at ASC LIMIT 10
This query works, but the problem is the "group by" for who has the most achievements (for the "total"), it does not back the last created_at user. Come the first created_at that user, which does not fit sort who was the first to earn that amount of achievements.
My system uses Ruby on Rails, is there a solution to this query in ActiveRecord, would be of great help.
Thanks a lot!
Sorry for my english...
SELECT user_id,
COUNT(achievement_id) AS 'total',
MAX(created_at) as 'lastTS'
FROM `user_achievements`
GROUP BY user_id
ORDER BY total DESC, lastTS ASC LIMIT 10
I know you've selected your answer, but in case you still want a solution in ActiveRecord...
class User < ActiveRecord::Base
scope :ranked, order('achievements.count DESC, achievements.created_at ASC').limit(10)
end
And then you just need to call up Users.ranked or #users.ranked to get the return you need.

Finding unique records, ordered by field in association, with PostgreSQL and Rails 3?

UPDATE: So thanks to #Erwin Brandstetter, I now have this:
def self.unique_users_by_company(company)
users = User.arel_table
cards = Card.arel_table
users_columns = User.column_names.map { |col| users[col.to_sym] }
cards_condition = cards[:company_id].eq(company.id).
and(cards[:user_id].eq(users[:id]))
User.joins(:cards).where(cards_condition).group(users_columns).
order('min(cards.created_at)')
end
... which seems to do exactly what I want. There are two shortcomings that I would still like to have addressed, however:
The order() clause is using straight SQL instead of Arel (couldn't figure it out).
Calling .count on the query above gives me this error:
NoMethodError: undefined method 'to_sym' for
#<Arel::Attributes::Attribute:0x007f870dc42c50> from
/Users/neezer/.rvm/gems/ruby-1.9.3-p0/gems/activerecord-3.1.1/lib/active_record/relation/calculations.rb:227:in
'execute_grouped_calculation'
... which I believe is probably related to how I'm mapping out the users_columns, so I don't have to manually type in all of them in the group clause.
How can I fix those two issues?
ORIGINAL QUESTION:
Here's what I have so far that solves the first part of my question:
def self.unique_users_by_company(company)
users = User.arel_table
cards = Card.arel_table
cards_condition = cards[:company_id].eq(company.id)
.and(cards[:user_id].eq(users[:id]))
User.where(Card.where(cards_condition).exists)
end
This gives me 84 unique records, which is correct.
The problem is that I need those User records ordered by cards[:created_at] (whichever is earliest for that particular user). Appending .order(cards[:created_at]) to the scope at the end of the method above does absolutely nothing.
I tried adding in a .joins(:cards), but that give returns 587 records, which is incorrect (duplicate Users). group_by as I understand it is practically useless here as well, because of how PostgreSQL handles it.
I need my result to be an ActiveRecord::Relation (so it's chainable) that returns a list of unique users who have cards that belong to a given company, ordered by the creation date of their first card... with a query that's written in Ruby and is database-agnostic. How can I do this?
class Company
has_many :cards
end
class Card
belongs_to :user
belongs_to :company
end
class User
has_many :cards
end
Please let me know if you need any other information, or if I wasn't clear in my question.
The query you are looking for should look like this one:
SELECT user_id, min(created_at) AS min_created_at
FROM cards
WHERE company_id = 1
GROUP BY user_id
ORDER BY min(created_at)
You can join in the table user if you need columns of that table in the result, else you don't even need it for the query.
If you don't need min_created_at in the SELECT list, you can just leave it away.
Should be easy to translate to Ruby (which I am no good at).
To get the whole user record (as I derive from your comment):
SELECT u.*,
FROM user u
JOIN (
SELECT user_id, min(created_at) AS min_created_at
FROM cards
WHERE company_id = 1
GROUP BY user_id
) c ON u.id = c.user_id
ORDER BY min_created_at
Or:
SELECT u.*
FROM user u
JOIN cards c ON u.id = c.user_id
WHERE c.company_id = 1
GROUP BY u.id, u.col1, u.col2, .. -- You have to spell out all columns!
ORDER BY min(c.created_at)
With PostgreSQL 9.1+ you can simply write:
GROUP BY u.id
(like in MySQL) .. provided id is the primary key.
I quote the release notes:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
The SQL standard allows this behavior, and because of the primary key,
the result is unambiguous.
The fact that you need it to be chainable complicates things, otherwise you can either drop down into SQL yourself or only select the column(s) you need via select("users.id") to get around the Postgres issue. Because at the heart of it your query is something like
SELECT users.id
FROM users
INNER JOIN cards ON users.id = cards.user_id
WHERE cards.company_id = 1
GROUP BY users.id, DATE(cards.created_at)
ORDER BY DATE(cards.created_at) DESC
Which in Arel syntax is more or less:
User.select("id").joins(:cards).where(:"cards.company_id" => company.id).group_by("users.id, DATE(cards.created_at)").order("DATE(cards.created_at) DESC")

ruby on rails: What's the equivalent statement of find_by_sql?

What's the equivalent rails statement of the following ?
#temp = Post.find_by_sql("SELECT posts.id,posts.title, comment_count.count FROM posts INNER JOIN (SELECT post_id, COUNT(*) AS count FROM comments GROUP BY post_id) AS comment_count ON comment_count.post_id = posts.id ORDER BY count DESC LIMIT 5;")
Is it possible to convert it to find/where/select statements ?
It's a complex query , i can't get it , but tried something like this,
#temp = Post.select("posts.id, posts.title, comment_count.count").joins(:comments).group("post_id").order("countpages.counts desc").limit(5)
Using something like Arel may clean this up even further, I'm meaning to start using it in my next app (haven't yet).
#temp = Post.find_by_sql("SELECT posts.id,posts.title, comment_count.count FROM posts INNER JOIN (SELECT post_id, COUNT(*) AS count FROM comments GROUP BY post_id) AS comment_count ON comment_count.post_id = posts.id ORDER BY count DESC LIMIT 5;")
Could be (only a bit cleaner, but lets you chain scopes at least):
#temp = Post.joins("JOIN (SELECT post_id, COUNT(*) AS count FROM comments GROUP BY post_id) AS comment_count ON comment_count.post_id = posts.id").order("count desc").limit(5)
I'm pretty sure Arel will let you do a sub-select a little more elegantly through scopes but I'm not familiar enough with it to give you a solution.
EDIT
You may be able to just add a counter_cache to your comments association on your Post model to avoid having to calculate it every time.
class Post < ActiveRecord::Base
has_many :comments, :counter_cache => true
end
Docs from here
:counter_cache
Caches the number of belonging objects on the associate class through
the use of increment_counter and decrement_counter. The counter cache
is incremented when an object of this class is created and decremented
when it’s destroyed. This requires that a column named
#{table_name}_count (such as comments_count for a belonging Comment
class) is used on the associate class (such as a Post class). You can
also specify a custom counter cache column by providing a column name
instead of a true/false value to this option (e.g., :counter_cache =>
:my_custom_counter.) Note: Specifying a counter cache will add it to
that model’s list of readonly attributes using attr_readonly.
You don't need the subquery here, you could simply say:
Post.select("posts.id, posts.title, COUNT(*) count").
joins(:comments).
group('posts.id').
order('count DESC').
limit(5)