Get Common Rows Within The Same Table - sql

I've had a bit of a search, but didn't find anything quite like what I'm trying to achieve.
Basically, I'm trying to find a similarity between two users' voting habits.
I have a table storing each individual vote made, which stores:
voteID
itemID (the item the vote is attached to)
userID (the user who voted)
direction (whether the user voted the post up, or down)
I'm aiming to calculate the similarity between, say, users A and B, by finding out two things:
The number of votes they have in common. That is, the number of times they've both voted on the same post (the direction does not matter at this point).
The number of times they've voted in the same direction, on common votes.
(Then simply to calculate #2 as a percentage of #1, to achieve a crude similarity rating).
My question is, how do I find the intersection between the two users' sets of votes? (i.e. how do I calculate point #1 adequately, without looping over every vote in a highly inefficient way.) If they were in different tables, an INNER JOIN would suffice, I'd imagine... but that obviously won't work on the same table (or will it?).
Any ideas would be greatly appreciated.

Something like this:
SELECT COUNT(*)
FROM votes v1
INNER JOIN votes v2 ON (v1.item_id = v2.item_id)
WHERE v1.userID = 'userA'
AND v2.userUD = 'userB'

In case you want to do this for a single user (rather than knowing both users at the start) to find to whom they are the closest match:
SELECT
v2.userID,
COUNT(*) AS matching_items,
SUM(CASE WHEN v2.direction = v1.direction THEN 1 ELSE 0 END) AS matching_votes
FROM
Votes v1
INNER JOIN Votes v2 ON
v2.userID <> v1.userID AND
v2.itemID = v1.itemID
WHERE
v1.userID = #userID
GROUP BY
v2.userID
You can then limit that however you see fit (return the top 10, top 20, all, etc.)
I haven't tested this yet, so let me know if it doesn't act as expected.

Here's an example that should get you closer:
SELECT COUNT(*)
FROM (
SELECT u1.userID
FROM vote u1, vote u2
WHERE u1.itemID = u2.itemID
AND u1.userID = user1
AND u2.userID = user2)

Assuming userID 1 being compared to userID 2
For finding how many votes they have in common:
SELECT COUNT(*)
FROM Votes AS v1
INNER JOIN Votes AS v2 ON (v2.userID = 2
AND v2.itemID = v1.itemID)
WHERE v1.userID = 1;
For finding when they also voted the same:
SELECT COUNT(*)
FROM Votes AS v1
INNER JOIN Votes AS v2 ON (v2.userID = 2
AND v2.itemID = v1.itemID
AND v2.direction = v1.direction)
WHERE v1.userID = 1;

A self join is in order. Here it is with all you asked:
SELECT v1.userID user1, v2.userID user2,
count(*) n_votes_in_common,
sum(case when v1.direction = v2.direction then 1 else 0 end) n_votes_same_direction,
(n_votes_same_direction * 100.0 / n_votes_in_common) crude_similarity_percent
FROM votes v1
INNER JOIN votes v2
ON v1.item_id = v2.item_id

You most certainly can join a table to itself. In fact, that's what you're going to have to do. You must use aliasing when joining a table to itself. If your table doesn't have a PK or FK, you'll have to use Union instead. Union will remove duplicates and Union All will not.

Related

Trying to count the number of occurences that 3 columns from 2 tables have on my organizations table? I need the occurrences joined in one table

-- 2. In one table, show how many private topics, admins, and standard users each organization has.
SELECT organizations.name, COUNT(topics.privacy) AS private_topic, COUNT(users.type) AS user_admin, COUNT(users.type) AS user_standard
FROM organizations
LEFT JOIN topics
ON organizations.id=topics.org_id
AND topics.privacy='private'
LEFT JOIN users
ON users.org_id=organizations.id
AND users.type='admin'
LEFT JOIN users
ON users.org_id=organizations.id
AND users.type='standard'
GROUP BY organizations.name
;
org_id is the foreign key that reals both the users table and topics table. It keeps giving me the wrong result by only either counting the number of admins or standard users and putting that for all rows in the each column. Any help is really appreciated as I have been stuck on this for a while now!
So, I am getting an error when I do as you said which is that the users table cannot be specified more than once. I updated the code to how you said to write it but still nothing. They really don't give me any sample data either but I just made some queries and saw the number of times there are private topics for example, which is in the privacy column of the topics table. When I dont get this error as I said, the joins seem to overwrite themselves where each row for all the columns is the same as the last join.
It appears to me that topics and users have no relationship. You're just trying to get the result together in a single query. There are other and possibly better ways to accomplish that but I think this will fix what you've got already (assuming you have id columns for each table.)
SELECT
organizations.name,
COUNT(DISTINCT topics.id) AS private_topic,
COUNT(DISTINCT users.id) FILTER (WHERE users.type = 'admin') AS user_admin,
COUNT(DISTINCT users.id) FILTER (WHERE users.type = 'standard') AS user_standard`
FROM organizations
LEFT JOIN topics
ON organizations.id = topics.org_id AND topics.privacy = 'private'
LEFT JOIN users
ON users.org_id = organizations.id
GROUP BY organizations.name;
I propose this as a more straightforward way:
SELECT
min(o.name) as "name",
(
select count(*) from topics t
where t.org_id = o.id AND t.privacy = 'private'
) as private_topics,
(
select count(*) from users u
where u.org_id = o.id and u.type = 'admin'
) AS user_admin,
(
select count(*) from users u
where u.org_id = o.id and u.type = 'standard'
) AS user_standard
FROM organizations o
GROUP BY o.id;

Only way to write this SQL JOIN question?

I wrote this sql query and it seems to work great but im not sure if it is the correct way to write it or if there is another better way to write it:
SELECT
art.artid, users.userid
FROM
art LEFT JOIN users
ON
art.userid = users.userid
WHERE
(SELECT COUNT(1) FROM art WHERE art.userid = users.userid) > 5 AND
users.active = '1' AND
art.active = '1' AND
art.status = '0' AND
art.pricesek > 0 GROUP BY users.userid ORDER BY RAND()
It gets the users from users table that are active and has 5 or more artworks in the art table. It also checks to see that artwork is active, status of artwork is set to 0 "for sale" and price is more then 0. Then it groups results by userid in a random order.
Is this the correct way to write this or is there another way.
"All input is hardcoded so no userinput will be sent into database, so not worried about injections (should i be worried even if its hardcoded?)."
I made a small change in your code. Instead of using (SELECT COUNT (1) FROM art WHERE art.userid = users.userid)> 5 I put it in Having clause.
SELECT art.artid, users.userid
FROM art LEFT JOIN users ON art.userid = users.userid
WHERE users.active = '1' AND art.active = '1' AND
art.status = '0' AND art.pricesek > 0
GROUP BY users.userid, art.artid
HAVING COUNT(users.userid) > 5
ORDER BY RAND()
Your query has problems at many levels. The most obvious is that the GROUP BY clause is inconsistent with the SELECT. That should be generating an error.
It gets the users from users table that are active and has 5 or more artworks in the art table.
I would instead suggest aggregating the art table before joining:
SELECT u.userid
FROM users u JOIN
(SELECT a.userid, COUNT(*) as cnt
FROM art a
WHERE a.active = 1 AND
a.status = 0 AND
a.pricesek > 0
GROUP BY a.userid
) a
ON a.userid = u.userid
WHERE a.cnt > 5 AND u.active = 1
ORDER BY RAND();
Notes:
LEFT JOIN is not appropriate. In order to count the number of artworks, the JOIN must find at least 1 (really 6) matching rows.
It makes no sense to return a.artid. If you need an example, you could use min(a.artid) in the subquery. If you want all of them, then you would need to specify how to return them, but a JSON, array, or string aggregation function would be used n the subquery.
The values "1" and "0" look like numbers, so I removed the single quotes, so I assume the columns are numbers. Compare numbers to numbers and strings to strings. Try to avoid mixing the two.

SQL ORDER BY something else than one of the table's columns

I have a table with posts in them. Website visitors can upvote or downvote such a post. I want to order a certain sql query by the score of the post, but my posts table doesn't have a score column - I keep the upvotes and downvotes in a different votes table, because that tells me who voted on what. I could add a score column to by posts table, and update it every time someone votes on a post, but I'd rather not do this, as the score is something I can work out by subtracting the downvotes from the upvotes anyways.
Do you have any suggestions? Or should I just go ahead and add a score column to my table?
Edit
My posts table has a post_id column (among other irrelevant columns) and my votes table has columns post_id, user_id and positive (the latter is a BOOLEAN, being 1 when the vote is an upvote and 0 when the vote is a downvote).
I can easily determine the score of a post 'by hand', by first querying the number of upvotes of that post, then the number of downvotes, and calculating their difference. However, I would like to query my posts table and order by the score of that post, so I want to know how/if I can query the votes table in the ORDER BY command while querying the posts table.
No, you do not have to create a score column. You can order by the calculated score, as below:
Since you do have the upvotes and downvotes in a different table, you need to join, as Tim Schmelter has explained.
SELECT p.*
FROM Post p
INNER JOIN Votes v
ON p.PostID = v.PostID
ORDER BY (v.upvotes - v.downvotes);
If you want to get the query to perform better, you could add a function-based index for (v.upvotes - v.downvotes).
EDIT:
Based on the updated information about the posts and the votes table, the following query can be used. The score is calculated within an inline view using a CASE statement. Then, this inline view is joined with the posts table, ordering the rows by the score. Note that an INNER JOIN is used, so only posts that have votes would be listed. To list all posts, a LEFT JOIN could be used instead.
SELECT p.*
FROM posts p
INNER JOIN
(
SELECT
post_id,
SUM
(
CASE
WHEN positive = 0 THEN -1
ELSE 1
END
) score
FROM votes v
GROUP BY post_id
) scores
ON p.post_id = scores.post_id
ORDER BY scores.score;
You have to link both tables via JOIN. Presuming that the Score-table has a column PostID:
SELECT p.*, Score = s.Upvotes- s.DownVotes
FROM Post p
INNER JOIN Score s
ON p.PostID = s.PostID
ORDER BY Score
Presumably, your data has a scores table with a column for each vote and an indicator of whether it is an up vote or down vote. If so, you need to aggregate this information and then you can use it for ordering:
select p.*, (NumUpVotes - NumDownVotes) as NetVotes
from posts p left outer join
(select PostId, sum(case when IsUpVote = 'Y' then 1 else 0 end) as NumUpvotes,
sum(case when IsDownVote = 'Y' then 1 else 0 end) as NumDownVotes
from scores s
group by PostId
) s
on p.postId = scores.PostId
order by (NumUpVotes - NumDownVotes);
You don't specify what database you are using so this uses standard SQL that should work in any database. You can adapt the logic for your particular data structure.

Using SQL(ite) how do I find the lowest unique child for each parent in a one to many relationship during a JOIN?

I have two tables with a many to one relationship which represent lots and bids within an auction system. Each lot can have zero or more bids associated with it. Each bid is associated with exactly one lot.
My table structure (with irrelevant fields removed) looks something like this:
For one type of auction the winning bid is the lowest unique bid for a given lot.
E.g. if there are four bids for a given lot: [1, 1, 2, 4] the lowest unique bid is 2 (not 1).
So far I have been able to construct a query which will find the lowest unique bid for a single specific lot (assuming the lot ID is 123):
SELECT id, value FROM bid
WHERE lot = 123
AND amount = (
SELECT value FROM bid
WHERE lot = 123
GROUP BY value HAVING COUNT(*) = 1
ORDER BY value
)
This works as I would expect (although I'm not sure it's the most graceful approach).
I would now like to construct a query which will get the lowest unique bids for all lots at once. Essentially I want to perform a JOIN on the two tables where one column is the result of something similar to the above query. I'm at a loss as to how to use the same approach for finding the lowest unique bid in a JOIN though.
Am I on the wrong track with this approach to finding the lowest unique bid? Is there another way I can achieve the same result?
Can anyone help me expand this query into a JOIN?
Is this even possible in SQL or will I have to do it in my application proper?
Thanks in advance.
(I am using SQLite 3.5.9 as found in Android 2.1)
You can use group by with a "having" condition to find the set of bids without duplicate amounts for each lot.
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
You can in turn make that query an inline view and select the lowest bid from it for each lot.
select lotname, min(amt)
from
(
select lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lotname, amt having count(*) = 1
) as X
group by X.lotname
EDIT: Here's how to get the bid id using this approach, using nested inline views:
select bid.id as WinningBidId, Y.lotname, bid.amt
from
bid
join
(
select x.lotid, lotname, min(amt) as TheMinAmt
from
(
select lot.id as lotid, lotname, amt
from lot inner join bid on lot.id = bid.lotid
group by lot.id, lotname, amt
having count(*)=1
) as X
group by x.lotid, x.lotname
) as Y
on Y.lotid = bid.lotid and Y.TheMinAmt = Bid.amt
I think you need some subqueries to get to your desired data:
SELECT [b].[id] AS [BidID], [l].[id] AS [LotID],
[l].[Name] AS [Lot], [b].[value] AS [BidValue]
FROM [bid] [b]
INNER JOIN [lot] [l] ON [b].[lot] = [l].[id]
WHERE [b].[id] =
(SELECT TOP 1 [min].[id]
FROM [bid] [min]
WHERE [min].[lot] = [b].[lot]
AND NOT EXISTS(SELECT *
FROM [bid] [check]
WHERE [check].[lot] = [min].[lot]
AND [check].[value] = [min].[value]
AND [check].[id] <> [min].[id])
ORDER BY [min].[value] ASC)
The most inner query (within the exists) checks if there are no other bids on that lot, having the same value.
The query in the middle (top 1) determines the minimum bid of all unique bids on that lot.
The outer query makes this happen for all lots, that have bids.
SELECT lot.name, ( SELECT MIN(bid.value) FROM bid Where bid.lot = lot.ID) AS MinBid
FROM Lot INNER JOIN
bid on lot.ID = bid.ID
If I understand you correctly this will give you everylot and their smallest bid

SQL Joins, Count(), and group by to sort 'posts' by # of yes/no 'votes'

I have posts, votes, and comments tables. Each post can have N 'yes votes', N 'no votes' and N comments. I am trying to get a set of posts sorted by number of yes votes.
I have a query that does exactly this, but is running far too slowly. On a data set of 1500 posts and 15K votes, it's take .48 seconds on my dev machine. How can I optimize this?
select
p.*,
v.yes,
x.no
from
posts p
left join (select post_id, vote_type_id, count(1) as yes from votes where (vote_type_id = 1) group by post_id) v on v.post_id = p.id
left join (select post_id, vote_type_id, count(1) as no from votes where (vote_type_id = 2) group by post_id) x on x.post_id = p.id
left join (select post_id, count(1) as comment_count from comments group by post_id) p on p.confession_id = p.id
order by
yes desc
limit
0, 10
EDIT:
Votes and Comments both have a post_id FK
Adding an index on vote_type_id and post_id in the votes table shaved .1sec off the query execution.
Add a 'yes_count' column and use a trigger to update the vote count for each post when the vote is made. You can index this column, then it should be very fast.
Use explain for checking the query execution plan so you can see why it is slow, usually it is enough to see the plan and later create appropriate indexes. The 1.5k and 15k tables are really small so that query should be much faster.
Why don't you add a column yes and no ? Rather than adding a new entry at every post, just increment the count.
If I misunderstood your database or you can't modify it, at least do you have a foreign key on votes.post_id to post.id? Foreign keys are crutial if you do any join.
First off, your current query shouldn't compile, as it uses p as an alias for both the comments and the posts table.
Second, you're joining votes twice: once for no, and once for yes. Using a CASE statement, you can compute the sums of both with a single join. Here's a sample query:
select
p.*,
sum(case when v.vote_type_id = 1 then 1 else 0 end) as yes,
sum(case when v.vote_type_id = 2 then 1 else 0 end) as no,
count(c.id) as comment_count
from posts p
left join votes v on v.post_id = p.id
left join comments c on c.post_id = p.id
order by yes desc
limit 0, 10
Third, you could verify that the proper foreign keys exists for the relations between posts, votes and comments. An (post_id, vote_type_id) index on the votes could also help.