Performance of search query bottlenecked 98% by mutual friends despite caching - sql

So on my social networking website, similar to facebook, my search speed is bottlenecked like 98% by this one part. I want to rank the results based on the number of mutual friends the searching user has, with all of the results (we can assume they are users)
My friends table has 3 columns -
user_id (person who sends the request)
friend_id (person who receives the request)
pending (boolean to indicate if the request was accepted or not)
user_id and friend_id are both foreign keys that reference users.id
Finding friend_ids of a user is simple, it looks like this
def friends
Friend.where(
'(user_id = :id OR friend_id = :id) AND pending = false',
id: self.id
).pluck(:user_id, :friend_id)
.flatten
.uniq
.reject { |id| id == self.id }
end
So, after getting the results that match the search query, ranking the results by mutual friends, requires following steps -
Get user_ids of all the searching user's friends - Set(A). Above mentioned friends method does this
Loop over each of the ids in Set(A) -
Get user_ids of all the friends of |id| - Set (B). Again, done by friends method
Find length of intersection of set A and set B
Order in descending order of length of intersections for all results
The most expensive operation over here obviously getting friend_ids of of hundreds of users. So I cached the friend_ids of all the users to speed it up. The difference in performance was amazing, but I'm curious if it can be further improved.
I'm wondering if there is a way that I can get friend_ids of all the desired users in a single query, that is efficient. Something like -
SELECT user_id, [array of friend_ids of the user with id = user_id]
FROM friends
....
Can someone help me write a fast SQL or ActiveRecord query for this?
That way I can store the user_ids of all the search results and their corresponding friend_ids in a hash or some other fast data structure, and then perform the same operation of ranking (that I mentioned above). Since I won't be hitting the cache for thousands of users and their friend_ids, I think it'll speed up the process significantly

Caching your friends table in RAM is not a viable approach if you expect your site to grow to large numbers of users, but I'm sure it does great for a smallish number of users.
It is to your advantage to get the most work you can out of the database with as few calls as possible. It is inefficient to issue large numbers of queries, as the overhead per query as comparatively large. Moreover, databases are built for the kind of task you're trying to perform. I think you are doing far too much work on the Ruby side, and you ought to let the database do the kind of work it does best.
You did not give many details, so I decided to start by defining a minimal model DB:
create table users (
user_id int not null primary key,
nick varchar(32)
);
create table friends (
user_id int not null,
friend_id int not null,
pending bool,
primary key (user_id, friend_id),
foreign key (user_id) references users(user_id),
foreign key (friend_id) references users(user_id),
check (user_id < friend_id)
);
The check constraint on friends avoids the same pair of users being listed in the table in both orders, and of course the PK prevents the same pair from being enrolled multiple times in the same order. The PK also automatically has a unique index associated with it.
Since I suppose the 'is a friend of' relation is supposed to be logically symmetric, it is convenient to define a view that presents that symmetry:
create view friends_symmetric (user_id, friend_id) as (
select user_id, friend_id from friends where not pending
union all
select friend_id, user_id from friends where not pending
);
(If friendship is not symmetric then you can drop the check constraint and the view, and use table friends in place of friends_symmetric in what follows.)
As a model query whose results you want to rank, then, I take this:
select * from users where nick like 'Sat%';
The objective is to return result rows in descending order of the number of friends each hit has in common with User1, the user on whose behalf the query is run. You might do that like so:
(update: modified this query to filter out duplicate results)
select *
from (
select
u.*,
count(mutual.shared_friend_id) over (partition by u.user_id) as num_shared,
row_number() over (partition by u.user_id) as copy_num
from
users u
left join (
select
f1.friend_id as shared_friend_id,
f2.friend_id as friend_id
from friends_symmetric f1
join friends_symmetric f2
on f1.friend_id = f2.user_id
where f1.user_id = ?
and f2.friend_id != f1.user_id
) mutual
on u.user_id = mutual.friend_id
where u.nick like 'Sat%'
) all_rows
where copy_num = 1
order by num_shared desc
where the ? is a placeholder for a parameter containing the ID of the User1.
Edited to add:
I have structured this query with window functions instead of an aggregate query with the idea that such a structure will be easier for the query planner to optimize. Nevertheless, the inline view "mutual" could instead be structured as an aggregate query that computes the number of shared friends that the searching user has with every user that shares at least one friend, and that would permit one level of inline view to be avoided. If performance of the provided query is or becomes inadequate, then it would be worthwhile to test that variant.
There are other ways to approach the problem of performing the sorting in the DB, some of which may perform better, and there may be ways to improve the performance of each by tweaking the database (adding indexes or constraints, modifying table definitions, computing db statistics, ...).
I cannot predict whether that query will outperform what you're doing now, but I assure you that it scales better, and it is easier to maintain.

Assuming that you want a relation of the User model whose primary key is id, you should be able to join onto a subquery that calculates the number of mutual friends:
class User < ActiveRecord::Base
def other_users_ordered_by_mutual_friends
self.class.select("users.*, COALESCE(f.friends_count, 0) AS friends_count").joins("LEFT OUTER JOIN (
SELECT all_friends.user_id, COUNT(DISTINCT all_friends.friend_id) AS friends_count FROM (
SELECT f1.user_id, f1.friend_id FROM friends f1 WHERE f1.pending = false
UNION ALL
SELECT f2.friend_id AS user_id, f2.user_id AS friend_id FROM friends f2 WHERE f2.pending = false
) all_friends INNER JOIN (
SELECT DISTINCT f1.friend_id AS user_id FROM friends f1 WHERE f1.user_id = #{id} AND f1.pending = false
UNION ALL
SELECT DISTINCT f2.user_id FROM friends f2 WHERE f2.friend_id = #{id} AND f2.pending = false
) user_friends ON user_friends.user_id = all_friends.friend_id GROUP BY all_friends.user_id
) f ON f.user_id = users.id").where.not(id: id).order("friends_count DESC")
end
end
The subquery selects all user IDs with associated friends and inner joins that to another select with all of the current user's friends' IDs. Since it groups by the user_id and selects the count, we get the number of mutual friends for each user_id. I have not tested this since I don't have any sample data, but it should work.
Since this returns a scope, you can chain other scopes/conditions to the relation:
current_user.other_users_ordered_by_mutual_friends.where(attribute1: value1).reorder(:attribute2)
The select scope as written will also give you access to the field friends_count on instances within the relation:
<%- current_user.other_users_ordered_by_mutual_friends.each do |user| -%>
<p>User <%= user.id -%> has <%= user.friends_count -%> mutual friends.</p>
<%- end -%>

John had a great idea with the friends_symetric view. With two filtered indexes (one on (friend_id,user_id and the other on (user_id,friend_id) ) it's gonna work great.
However the query can be a bit simpler
WITH user_friends AS(
SELECT user_id, array_agg(friend_id) AS friends
FROM friends_symmetric
WHERE user_id = :user_id -- id of our user
GROUP BY user_id
)
SELECT u.*
,array_agg(friend_id) AS shared_friends -- aggregated ids of friends in case they are needed for something
,count(*) AS shared_count
FROM user_friends AS uf
JOIN friends_symmetric AS f
ON f.user_id = ANY(uf.friends) AND f.friend_id = ANY(uf.friends)
JOIN user
ON u.user_id = f.user_id
WHERE u.nick LIKE 'Sat%' --nickname of our user's friend
GROUP BY u.user_id

Related

Check if inverse relationship exists in table

I'm using a table for signifying friends/friend requests. Basically my idea was to have a table structure like this:
CREATE TABLE friends(
user_id NUMERIC NOT NULL,
friend_user_id NUMERIC NOT NULL
UNIQUE (user_id,friend_user_id)
)
When a user wants to create a friend requests you add a row:
INSERT INTO friends(user_id,friend_user_id) VALUES($1,$2)
Then when the other user accepts the friend request you would simply add the inverse of the previous row (i.e. the recipient would technically be sending a friend request back to the sender thus completing the friend relationship):
INSERT INTO friends(user_id,friend_user_id) VALUES($2,$1)
My question:
If I want to get all the friends of a user I would have to get all the rows with that user's id and I want to inner join that with another table that contains the user's information, how would I check within the query for whether the rows inverse relationship exists?
P.S. I think I could do it pretty easily doing multiple queries but I would rather only have one query if possible.
Solution:
SELECT users.id,users.username,users.avatar FROM users
INNER JOIN friends ON users.id = friends.user_id
WHERE friends.user_id = $1 AND
EXISTS (SELECT 1 FROM friends WHERE friend_user_id = $1 AND user_id = friends.user_id)

Best way to store followed users

I know the title isn't so describing but it's really hard to find something generic to describe my situation. If someone wants to edit, feel free...
So, I have a postgres database, with a users table. I would like to store the users followed by one user, and I really don't see how I could do this. I would like to do like SELECT followed_users FROM users WHERE username='username' and this would return me every usernames, or id, or whatever of each followed users. But I don't see any clean way to do this.
Maybe an example would be more describing: user1 is following user2 and user3.
How to store who user1 is following?
EDIT: I don't know how many users the user will follow.
Thank you for your help.
Expanding on my comment above, since it got wordy:
Create a new table called something like user_follows with columns like
user_id1 | user_id2
or
follower_id | follows_id
Then you can query:
SELECT t1.username as follower_username, t3.username as following_usernae
FROM users t1
INNER JOIN user_follows t2 ON t1.user_id = t2.follower_id
INNER JOIN users t3 ON t2.following_id = t3.user_id
WHERE t1.user_id = <your user>
In the end, think of your tables as "Objects". Then when you are presented with a problem like "How do I add users that are following other users" you can determine if this relationship is a new object, or an attribute of an existing object. Since a user might follow more than one other user than the relationship is not a good attribute for "Users", so it gets its own table user_follows.
Since user_follows is just one type of relationship that two users may have to one another, it might make sense to increase the scope of that object to relationships and store the relationship type as an attribute of the table:
user_id1 | user_id2 | relationship_type
where relationships.relationship_type might have values like follows, student of, sister of etc...
So the new query would be something like:
SELECT t1.username as follower_username, t3.username as following_username
FROM users t1
INNER JOIN relationships t2 ON t1.user_id = t2.user_id1
INNER JOIN users t3 ON t2.user_id2 = t3.user_id
WHERE t1.user_id = <your user> AND t2.relationship_type = 'Follows';
I'd add another table, let's call it following for argument's sake, which saves pairs of users and users they are following:
CREATE TABLE following (
user_id INT NOT NULL REFERENCES users(id),
following_id INT NOT NULL REFERENCES users(id),
PRIMARY KEY (user_id, following_id)
)
Then you could query all the user's a specific user is following by joining with the users table (twice). E.g., to get the names of all the users that I (username "mureinik") am following:
SELECT fu.username
FROM following f
JOIN users u ON f.user_id = u.id
JOIN users fu ON f.user_id = fu.id
WHERE u.username = 'mureinik'

Issues with subqueries for stored procedure

The query I am trying to perform is
With getusers As
(Select userID from userprofspecinst_v where institutionID IN
(select institutionID, professionID from userprofspecinst_v where userID=#UserID)
and professionID IN
(select institutionID, professionID from userprofspecinst_v where userID=#UserID))
select username from user where userID IN (select userID from getusers)
Here's what I'm trying to do. Given a userID and a view which contains the userID and the ID of their institution and profession, I want to get the list of other userID's who also have the same institutionID and and professionID. Then with that list of userIDs I want to get the usernames that correspond to each userID from another table (user). The error I am getting when I try to create the procedure is, "Only one expression can be specified in the select list when the subquery is not introduced with EXISTS.". Am I taking the correct approach to how I should build this query?
The following query should do what you want to do:
SELECT u.username
FROM user AS u
INNER JOIN userprofspecinst_v AS up ON u.userID = up.userID
INNER JOIN (SELECT institutionID, professionID FROM userprofspecinst_v
WHERE userID = #userID) AS ProInsts
ON (up.institutionID = ProInsts.institutionID
AND up.professionID = ProInsts.professionID)
Effectively the crucial part is the last INNER JOIN statement - this creates a table constituting the insitutionsids and professsionids the user id belongs to. We then get all matching items in the view with the same institution id and profession id (the ON condition) and then link these back to the user table on the corresponding userids (the first JOIN).
You can either run this for each user id you are interested in, or JOIN onto the result of a query (your getusers) (it depends on what database engine you are running).
If you aren't familiar with JOIN's, Jeff Atwood's introductory post is a good starting place.
The JOIN statement effectively allows you to explot the logical links between your tables - the userId, institutionID and professionID are all examples of candidates for foreign keys - so, rather than having to constantly subquery each table and piece the results together, you can link all the tables together and filter down to the rows you want. It's usually a cleaner, more maintainable approach (although that is opinion).

Finding unique records, ordered by field in association, with PostgreSQL and Rails 3?

UPDATE: So thanks to #Erwin Brandstetter, I now have this:
def self.unique_users_by_company(company)
users = User.arel_table
cards = Card.arel_table
users_columns = User.column_names.map { |col| users[col.to_sym] }
cards_condition = cards[:company_id].eq(company.id).
and(cards[:user_id].eq(users[:id]))
User.joins(:cards).where(cards_condition).group(users_columns).
order('min(cards.created_at)')
end
... which seems to do exactly what I want. There are two shortcomings that I would still like to have addressed, however:
The order() clause is using straight SQL instead of Arel (couldn't figure it out).
Calling .count on the query above gives me this error:
NoMethodError: undefined method 'to_sym' for
#<Arel::Attributes::Attribute:0x007f870dc42c50> from
/Users/neezer/.rvm/gems/ruby-1.9.3-p0/gems/activerecord-3.1.1/lib/active_record/relation/calculations.rb:227:in
'execute_grouped_calculation'
... which I believe is probably related to how I'm mapping out the users_columns, so I don't have to manually type in all of them in the group clause.
How can I fix those two issues?
ORIGINAL QUESTION:
Here's what I have so far that solves the first part of my question:
def self.unique_users_by_company(company)
users = User.arel_table
cards = Card.arel_table
cards_condition = cards[:company_id].eq(company.id)
.and(cards[:user_id].eq(users[:id]))
User.where(Card.where(cards_condition).exists)
end
This gives me 84 unique records, which is correct.
The problem is that I need those User records ordered by cards[:created_at] (whichever is earliest for that particular user). Appending .order(cards[:created_at]) to the scope at the end of the method above does absolutely nothing.
I tried adding in a .joins(:cards), but that give returns 587 records, which is incorrect (duplicate Users). group_by as I understand it is practically useless here as well, because of how PostgreSQL handles it.
I need my result to be an ActiveRecord::Relation (so it's chainable) that returns a list of unique users who have cards that belong to a given company, ordered by the creation date of their first card... with a query that's written in Ruby and is database-agnostic. How can I do this?
class Company
has_many :cards
end
class Card
belongs_to :user
belongs_to :company
end
class User
has_many :cards
end
Please let me know if you need any other information, or if I wasn't clear in my question.
The query you are looking for should look like this one:
SELECT user_id, min(created_at) AS min_created_at
FROM cards
WHERE company_id = 1
GROUP BY user_id
ORDER BY min(created_at)
You can join in the table user if you need columns of that table in the result, else you don't even need it for the query.
If you don't need min_created_at in the SELECT list, you can just leave it away.
Should be easy to translate to Ruby (which I am no good at).
To get the whole user record (as I derive from your comment):
SELECT u.*,
FROM user u
JOIN (
SELECT user_id, min(created_at) AS min_created_at
FROM cards
WHERE company_id = 1
GROUP BY user_id
) c ON u.id = c.user_id
ORDER BY min_created_at
Or:
SELECT u.*
FROM user u
JOIN cards c ON u.id = c.user_id
WHERE c.company_id = 1
GROUP BY u.id, u.col1, u.col2, .. -- You have to spell out all columns!
ORDER BY min(c.created_at)
With PostgreSQL 9.1+ you can simply write:
GROUP BY u.id
(like in MySQL) .. provided id is the primary key.
I quote the release notes:
Allow non-GROUP BY columns in the query target list when the primary
key is specified in the GROUP BY clause (Peter Eisentraut)
The SQL standard allows this behavior, and because of the primary key,
the result is unambiguous.
The fact that you need it to be chainable complicates things, otherwise you can either drop down into SQL yourself or only select the column(s) you need via select("users.id") to get around the Postgres issue. Because at the heart of it your query is something like
SELECT users.id
FROM users
INNER JOIN cards ON users.id = cards.user_id
WHERE cards.company_id = 1
GROUP BY users.id, DATE(cards.created_at)
ORDER BY DATE(cards.created_at) DESC
Which in Arel syntax is more or less:
User.select("id").joins(:cards).where(:"cards.company_id" => company.id).group_by("users.id, DATE(cards.created_at)").order("DATE(cards.created_at) DESC")

"subquery returns more than 1 row" error.

I am new to web programming and I'm trying to make a twitter-clone. At this point, I have 3 tables:
users (id, name)
id is the auto-generated id
name of the user
tweets (id, content, user_id)
id is the auto-generated id
content is the text of the tweet
user_id is the id of the user that made the post
followers (id, user_id, following_id)
id is the auto-generated id
user_id is the user who is doing the following
following_id is the user that is being followed
So, being new to sql as well, I am trying to build an SQL statement that would return the tweets that the currently-logged in user and of everyone he follows.
I tried to use this statement, which works sometimes, but other times, I get an error that says "Subquery returns more than 1 row". Here is the statement:
SELECT * FROM tweets
WHERE user_id IN
((SELECT following_id FROM followers
WHERE user_id = 1),1) ORDER BY date DESC
I put 1 as an example here, which would be the id of the currently logged-in user.
I haven't had any luck with this statement; any help would be greatly appreciated! Thank you.
In one comment you ask is it generally better to use a subquery or a union. Unfortunately, there is no simple answer, just some information.
Some varieties of SQL have problems optimising the IN clause if the lsit is large, and may perform better in any of the following ways...
SELECT * FROM tweets
INNER JOIN followers ON tweets.user_id = followers.following_id
WHERE followers.user_id = 1
UNION ALL
SELECT * FROM tweets
WHERE user_id = 1
Or...
SELECT
*
FROM
tweets
INNER JOIN
(SELECT following_id FROM followers WHERE user_id = 1 UNION SELECT 1) AS followers
ON tweets.user_id = followers.following_id
Or...
SELECT
*
FROM
tweets
WHERE
EXISTS (SELECT * FROM followers WHERE following_id = tweets.user_id and user_id = 1)
OR user_id = 1
There are many, many alternatives...
Some varieties of SQL struggle to optimise the OR condition, and end up checking every record in the tweets table instead of utilising an INDEX. This would make the UNION option preferrable, because each half of the query will then benefit from an index on the user_id field.
But you CAN actually refactor this corner case out of your code altogether : Make every user a follower of themselves. This would then mean that getting tweets for followers would naturally include the user themselves. Whether this would make sense in all cases is dependant on your design and other functional requirements.
In short, your best bet is to create some representative data and test the options. But I wouldn't really worry about it for now. As long as you encapsulate this code in one place, you can just pick one that you are most comfortable with. Then, when you have the rest of the system hashed out, and you're much more confident that things won't change, you can go back and optimise if necessary.
SELECT *
FROM tweets
WHERE
user_id IN (SELECT following_id FROM followers WHERE user_id = 1)
OR user_id = 1
ORDER BY date DESC
try this
SELECT * FROM tweets WHERE user_id = [YourUser]
UNION
SELECT * FROM tweets WHERE user_id in (SELECT following_id FROM followers WHERE user_id ? [YourUser]
shall work even if you've got no followers for your user
There's also a solution with joins, but actually I'm in a hurry. Will try to write the query as soon as I have the time to. Some other will probably answer by then. Sorry.