'Autosuggestion' feature implementation for a webapp - sql

I'm developing a web application and have two models (among others) - users and items with many-to-many association. So I have have tables 'users', 'items' and 'items_users' with primary key 'id' and foreign keys user_id and item_id.
What I'm going to have is an 'autosuggestion' feature. If, say, I'm as a user mark a certain item as good, the system is supposed to suggest n items I most probably would also mark as good. The reasonable criteria for autosuggestion is how many users who liked the first item like another one. If all users who like tea also like a teapot - then the teapot is in top position for autosuggestion.
This is basic functionality, I'll also filter some results but the rest doesn't matter. I'm thinking about some kind of an auxiliary table for fast calculation on demand or scheduling a separate process to calculate n suggestions.
Thank you for any related information!
UPD
The question sounded unclear. I have sql db and sinatra with sequel orm. I'm asking about how to calculate most similar items dataset (cheapest, least resourse consuming approach). How would you implement it?

So, generally you want to select all users that liked the same products then get the products they like by counting the numer of likes for each product and output the most liked products.
So how would this look in SQL?
Let's see how would this look in SQL:
Step 1: Get the id's of your favourites
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
Step 2: Get the users who like the same items
SELECT u.id FROM `item_users` it, `users` u WHERE it.item_id IN (
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
) AND it.user_id != %current_user% AND u.id = it.user_id GROUP BY it.user_id
Step 3: Get their favourites
And the entire SQL query would look like this:
SELECT i.* FROM `items` i, `item_users` it WHERE it.user_id IN (
SELECT u.id FROM `item_users` it, `users` u WHERE it.item_id IN (
SELECT it.item_id FROM `item_users` it WHERE it.user_id = %current_user%
) AND it.user_id != %current_user% AND u.id = it.user_id GROUP BY it.user_id
) AND i.id = it.item_id GROUP BY i.id ORDER BY count(*) DESC
Your task is to add limiting of the results...
UPDATE:
I guess that you would like to get the most populat products first. I've changed the query to add that functionality (added ORDER BY count(*) DESC to the end)
This is a complex query and using ActiveRecord to implement it would be quite slow and even more complicated, so I would recommend you using the query as is.

Use your link table to join users and items.
Apply following filters in your WHERE-Clause:
- users that liked the item ("marked it as good")
- items, that the current user did not already mark as good
Sort descending by the number of likes (you'll need to group by the item id and count the users).

Related

SQL query to exclude records that are part of a group

I can't believe this hasn't been answered elsewhere, but I don't seem to know the right words to convey what I'm trying to do. I'm using Ruby/Rails and PostgreSQL.
I have a bunch of Users in the DB that I'm trying to add to a Group based on a name search. I need to return Users that do not belong to a particular Group, but there is a join table as well (UserGroups, with the appropriate FKs).
Is there a simple way to use this configuration to perform this query without having to result to grabbing all the Users from which belong to the group and doing something like .where.not(id: users_in_group.pluck(:id)) (these groups can be pretty huge, so I don't want to send that query to the DB on a text search as the user types).
I need to return Users that do not belong to a particular Group
SELECT *
FROM users u
WHERE username ~ 'some pattern' -- ?
AND NOT EXISTS (
SELECT FROM user_groups ug
WHERE ug.group_id = 123 -- your group_id to exclude here
AND ug.user_id = u.id
);
See:
Select rows which are not present in other table

Performance of search query bottlenecked 98% by mutual friends despite caching

So on my social networking website, similar to facebook, my search speed is bottlenecked like 98% by this one part. I want to rank the results based on the number of mutual friends the searching user has, with all of the results (we can assume they are users)
My friends table has 3 columns -
user_id (person who sends the request)
friend_id (person who receives the request)
pending (boolean to indicate if the request was accepted or not)
user_id and friend_id are both foreign keys that reference users.id
Finding friend_ids of a user is simple, it looks like this
def friends
Friend.where(
'(user_id = :id OR friend_id = :id) AND pending = false',
id: self.id
).pluck(:user_id, :friend_id)
.flatten
.uniq
.reject { |id| id == self.id }
end
So, after getting the results that match the search query, ranking the results by mutual friends, requires following steps -
Get user_ids of all the searching user's friends - Set(A). Above mentioned friends method does this
Loop over each of the ids in Set(A) -
Get user_ids of all the friends of |id| - Set (B). Again, done by friends method
Find length of intersection of set A and set B
Order in descending order of length of intersections for all results
The most expensive operation over here obviously getting friend_ids of of hundreds of users. So I cached the friend_ids of all the users to speed it up. The difference in performance was amazing, but I'm curious if it can be further improved.
I'm wondering if there is a way that I can get friend_ids of all the desired users in a single query, that is efficient. Something like -
SELECT user_id, [array of friend_ids of the user with id = user_id]
FROM friends
....
Can someone help me write a fast SQL or ActiveRecord query for this?
That way I can store the user_ids of all the search results and their corresponding friend_ids in a hash or some other fast data structure, and then perform the same operation of ranking (that I mentioned above). Since I won't be hitting the cache for thousands of users and their friend_ids, I think it'll speed up the process significantly
Caching your friends table in RAM is not a viable approach if you expect your site to grow to large numbers of users, but I'm sure it does great for a smallish number of users.
It is to your advantage to get the most work you can out of the database with as few calls as possible. It is inefficient to issue large numbers of queries, as the overhead per query as comparatively large. Moreover, databases are built for the kind of task you're trying to perform. I think you are doing far too much work on the Ruby side, and you ought to let the database do the kind of work it does best.
You did not give many details, so I decided to start by defining a minimal model DB:
create table users (
user_id int not null primary key,
nick varchar(32)
);
create table friends (
user_id int not null,
friend_id int not null,
pending bool,
primary key (user_id, friend_id),
foreign key (user_id) references users(user_id),
foreign key (friend_id) references users(user_id),
check (user_id < friend_id)
);
The check constraint on friends avoids the same pair of users being listed in the table in both orders, and of course the PK prevents the same pair from being enrolled multiple times in the same order. The PK also automatically has a unique index associated with it.
Since I suppose the 'is a friend of' relation is supposed to be logically symmetric, it is convenient to define a view that presents that symmetry:
create view friends_symmetric (user_id, friend_id) as (
select user_id, friend_id from friends where not pending
union all
select friend_id, user_id from friends where not pending
);
(If friendship is not symmetric then you can drop the check constraint and the view, and use table friends in place of friends_symmetric in what follows.)
As a model query whose results you want to rank, then, I take this:
select * from users where nick like 'Sat%';
The objective is to return result rows in descending order of the number of friends each hit has in common with User1, the user on whose behalf the query is run. You might do that like so:
(update: modified this query to filter out duplicate results)
select *
from (
select
u.*,
count(mutual.shared_friend_id) over (partition by u.user_id) as num_shared,
row_number() over (partition by u.user_id) as copy_num
from
users u
left join (
select
f1.friend_id as shared_friend_id,
f2.friend_id as friend_id
from friends_symmetric f1
join friends_symmetric f2
on f1.friend_id = f2.user_id
where f1.user_id = ?
and f2.friend_id != f1.user_id
) mutual
on u.user_id = mutual.friend_id
where u.nick like 'Sat%'
) all_rows
where copy_num = 1
order by num_shared desc
where the ? is a placeholder for a parameter containing the ID of the User1.
Edited to add:
I have structured this query with window functions instead of an aggregate query with the idea that such a structure will be easier for the query planner to optimize. Nevertheless, the inline view "mutual" could instead be structured as an aggregate query that computes the number of shared friends that the searching user has with every user that shares at least one friend, and that would permit one level of inline view to be avoided. If performance of the provided query is or becomes inadequate, then it would be worthwhile to test that variant.
There are other ways to approach the problem of performing the sorting in the DB, some of which may perform better, and there may be ways to improve the performance of each by tweaking the database (adding indexes or constraints, modifying table definitions, computing db statistics, ...).
I cannot predict whether that query will outperform what you're doing now, but I assure you that it scales better, and it is easier to maintain.
Assuming that you want a relation of the User model whose primary key is id, you should be able to join onto a subquery that calculates the number of mutual friends:
class User < ActiveRecord::Base
def other_users_ordered_by_mutual_friends
self.class.select("users.*, COALESCE(f.friends_count, 0) AS friends_count").joins("LEFT OUTER JOIN (
SELECT all_friends.user_id, COUNT(DISTINCT all_friends.friend_id) AS friends_count FROM (
SELECT f1.user_id, f1.friend_id FROM friends f1 WHERE f1.pending = false
UNION ALL
SELECT f2.friend_id AS user_id, f2.user_id AS friend_id FROM friends f2 WHERE f2.pending = false
) all_friends INNER JOIN (
SELECT DISTINCT f1.friend_id AS user_id FROM friends f1 WHERE f1.user_id = #{id} AND f1.pending = false
UNION ALL
SELECT DISTINCT f2.user_id FROM friends f2 WHERE f2.friend_id = #{id} AND f2.pending = false
) user_friends ON user_friends.user_id = all_friends.friend_id GROUP BY all_friends.user_id
) f ON f.user_id = users.id").where.not(id: id).order("friends_count DESC")
end
end
The subquery selects all user IDs with associated friends and inner joins that to another select with all of the current user's friends' IDs. Since it groups by the user_id and selects the count, we get the number of mutual friends for each user_id. I have not tested this since I don't have any sample data, but it should work.
Since this returns a scope, you can chain other scopes/conditions to the relation:
current_user.other_users_ordered_by_mutual_friends.where(attribute1: value1).reorder(:attribute2)
The select scope as written will also give you access to the field friends_count on instances within the relation:
<%- current_user.other_users_ordered_by_mutual_friends.each do |user| -%>
<p>User <%= user.id -%> has <%= user.friends_count -%> mutual friends.</p>
<%- end -%>
John had a great idea with the friends_symetric view. With two filtered indexes (one on (friend_id,user_id and the other on (user_id,friend_id) ) it's gonna work great.
However the query can be a bit simpler
WITH user_friends AS(
SELECT user_id, array_agg(friend_id) AS friends
FROM friends_symmetric
WHERE user_id = :user_id -- id of our user
GROUP BY user_id
)
SELECT u.*
,array_agg(friend_id) AS shared_friends -- aggregated ids of friends in case they are needed for something
,count(*) AS shared_count
FROM user_friends AS uf
JOIN friends_symmetric AS f
ON f.user_id = ANY(uf.friends) AND f.friend_id = ANY(uf.friends)
JOIN user
ON u.user_id = f.user_id
WHERE u.nick LIKE 'Sat%' --nickname of our user's friend
GROUP BY u.user_id

SQL Nearest Neighbor Query (Movie Recommendation Algorithm)

Need help making this (sort of) working query more dynamic.
I have three tables myShows, TVShows and Users
myShows
ID (PK)
User (FK to Users)
Show (FK to TVShows)
Would like to take this query and change it to a stored procedure that I can send a User ID into and have it do the rest...
SELECT showId, name, Count(1) AS no_users
FROM
myShows LEFT OUTER JOIN
tvshows ON myShows.Show = tvshows.ShowId
WHERE
[user] IN (
SELECT [user]
FROM
myShows
WHERE
show ='1' or show='4'
)
AND
show <> '1' and show <> '4'
GROUP BY
showId, name
ORDER BY
no_users DESC
This right now works. But as you can see the problem lies within the WHERE (show ='1' or show='4') and the AND (show <> '1' and show <> '4') statements which is currently hard-coded values, and that's what I need to be dynamic, being I have no idea if the user has 3 or 30 shows I need to check against.
Also how inefficient is this process? this will be used for a iPad application that might get a lot of users. I currently run a movie API (IMDbAPI.com) that gets about 130k hits an hour and had to do a lot of database/code optimization to make it run fast. Thanks again!
If you want the database schema for testing let me know.
This will meet your requirements
select name, count(distinct [user]) from myshows recommend
inner join tvshows on recommend.show = tvshows.showid
where [user] in
(
select other.[user] from
( select show from myshows where [User] = #user ) my,
( select show, [user] from myshows where [user] <> #user ) other
where my.show = other.show
)
and show not in ( select show from myshows where [User] = #user )
group by name
order by count(distinct [user]) desc
If your SQL platform supports WITH Common Table Expressions, the above can be optimized to use them.
Will it be efficient as the data sizes increase? No.
Will it be effective? No. If just one user shares a show with your selected user, and they watch a popular show, then that popular show will rise to the top of the ranking.
I'd recommend
a) reviewing your thinking of what recommends a show
b) periodically calculating the results rather than performing it on demand.

How to combine data from 2 tables under circumstances?

I have 2 tables. One table contains posts and the other contains votes for the posts. Each member can vote (+ or -) for each post.
(Structure example:)
Posts table: pid, belongs, userp, text.
Votes table: vid, userv, postid, vote.
Also one table which contains the info for the users.
What I want is: Supposing I am a logged-in member. I want to show all the posts, and at those I've already voted, not let me vote again. (and show me what I have voted + or -)
What I have done til now is very bad as it will do a lot of queries:
SELECT `posts`.*, `users`.`username`
FROM `posts`,`users`
WHERE `posts`.belongs=$taken_from_url AND `users`.`usernumber`=`posts`.`userp`
ORDER BY `posts`.`pid` DESC;
and then:
foreach ($query as $result) {if (logged_in) {select vote from votes....etc} }
So, this means that if I am logged in and it shows 30 posts, then it will do 30 queries to check if at each post I have voted and what I've voted. My question is, can I do it shorter with a JOIN (I guess) and how? (I already tried something, but didn't succeed)
Firstly I'll say that if you're going to have significantly different output for users logged in versus those that aren't, just have two queries rather than trying to create something really complicated.
Secondly, this should do something like what you want:
SELECT p.*, u.username,
(SELECT SUM(vote) FROM votes WHERE postid = p.pid) total_votes,
(SELECT vote FROM votes WHERE postid = p.pid AND userv = $logged_in_user_id) my_vote
FROM posts p
JOIN users u ON p.userp = u.usernumber
WHERE p.belongs = $taken_from_url
ORDER BY p.pid DESC
Note: You don't say what the values of the votes table are. I'm assuming it's either +1 (up) or -1 (down) so you can easily find the total votes by adding them up. If you're not doing it this way I suggest you do to make your life easier.
The first correlated subquery can be eliminated by doing a JOIN and GROUP BY but I tend to find the above form much more readable.
So what this does is it joins users to posts, much like you were doing except that it uses JOIN syntax (which again comes down to readability). Then it has two subqueries: the first finds the total votes for that particular post and the second finds out what a particular user's vote was:
+1: up vote;
-1: down vote;
NULL: no vote.

selecting and displaying ranked items and a user's votes, a la reddit, digg, et al

when selecting ranked objects from a database (eg, articles users have voted on), what is the best way to show:
the current page of items
the user's rating, per item (if they've voted)
rough schema:
articles: id, title, content, ...
user: id, username, ...
votes: id, user_id, article_id, vote_value
is it better/ideal to:
select the current page of items
select the user's vote, limiting them to the page of items with an 'IN' clause
or
select the current page of items and just 'JOIN' vote data from the table of user votes
or, something entirely different?
this is theoretically in a high-traffic environment, and using an rdbms like mysql. fwiw, i see this on the side of "thinking it out before doing" and not "premature optimization."
thanks!
The JOIN would be faster; it would save a round trip to the database.
However, I wouldn't worry at all about this until you actually get some traffic. Many people have spoken out against premature optimization, I'll quote a random one:
More computing sins are committed in
the name of efficiency (without
necessarily achieving it) than for any
other single reason - including blind
stupidity.
If you need to order on votes, use this:
SELECT *
FROM (
SELECT a.*, (
SELECT SUM(vote_value)
FROM votes v
WHERE v.article_id = a.id
) AS votes
FROM article a
)
ORDER BY
votes DESC
LIMIT 100, 110
This will count the votes and paginate in a single query.
If you want to show only the user's own votes, use LEFT JOIN:
SELECT a.*, vote_value
FROM articles a
LEFT JOIN
votes v
ON v.user_id = #current_user
AND v.article_id = a.id
ORDER BY
a.timestamp DESC
LIMIT 100, 110
Having an index on (vote_user, vote_item) will greatly improve this query.
Note that you can make (vote_user, vote_item) a PRIMARY KEY for votes, which will improve this query even more.