SELECT to pick users who both viewed a page - sql

I have a table that logs page views of each user:
+--------------+--------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+--------------+--------------+------+-----+---------+----------------+
| view_id | int(11) | NO | PRI | NULL | auto_increment |
| page_id | int(11) | YES | MUL | NULL | |
| user_id | int(11) | YES | MUL | NULL | |
+--------------+--------------+------+-----+---------+----------------+
For every pair of users, I would like to generate a count of how many pages they have both looked at.
I simply do not know how to do this. : ) I am using mysql, in case it has a non-standard feature that makes this a breeze.

select u1.user_id, u2.user_id, count(distinct u1.page_id) as NumPages
from logtable u1
join
logtable u2
on u1.page_id = u2.page_id
and u1.user_id < u2.user_id /* This avoids counting pairs twice */
group by u1.user_id, u2.user_id;
But you should consider filtering this somewhat...
(Edited above to put u1.page_id, it was originally just page_id, which is really bad of me)

SELECT DISTINCT page_id
FROM logtable
WHERE user_id = 1 OR user_id = 2
GROUP BY page_id
HAVING COUNT(DISTINCT user_id) = 2
This table returns all pages they both have looked at. If you want the count, then just make this a subquery and count the rows.
SELECT COUNT(*) FROM (the query above) s;
Update, let's do it for all pairs of users then.
SELECT u1.user_id, u2.user_id, COUNT(DISTINCT u1.page_id)
FROM logtable u1, logtable u2
WHERE u1.user_id < u2.user_id
AND u1.page_id = u2.page_id
GROUP BY u1.user_id, u2.user_id

For users_ids 100 and 200.
SELECT
page_id
FROM table1
WHERE user_id IN (100, 200)
GROUP BY page_id
HAVING MAX(CASE WHEN user_id = 100 THEN 1 ELSE 0 END) = 1
AND MAX(CASE WHEN user_id = 200 THEN 1 ELSE 0 END) = 1;

select a.user_id as user1, b.user_id as user2, count(distinct a.page_id) as views
from yourtable a, yourtable b
where a.page_id = b.page_id
and a.user_id < b.user_id
group by a.user_id, b.user_id
change yourtable to the name of your table ..

Related

SUM CASE when DISTINCT?

Joining two tables and grouping, we're trying to get the sum of a user's value but only include a user's value once if that user is represented in a grouping multiple times.
Some sample tables:
user table:
| id | net_worth |
------------------
| 1 | 100 |
| 2 | 1000 |
visit table:
| id | location | user_id |
-----------------------------
| 1 | mcdonalds | 1 |
| 2 | mcdonalds | 1 |
| 3 | mcdonalds | 2 |
| 4 | subway | 1 |
We want to find the total net worth of users visiting each location. User 1 visited McDonalds twice, but we don't want to double count their net worth. Ideally we can use a SUM but only add in the net worth value if that user hasn't already been counted for at that location. Something like this:
-- NOTE: Hypothetical query
SELECT
location,
SUM(CASE WHEN DISTINCT user.id then user.net_worth ELSE 0 END) as total_net_worth
FROM visit
JOIN user on user.id = visit.user_id
GROUP BY 1;
The ideal output being:
| location | total_net_worth |
-------------------------------
| mcdonalds | 1100 |
| subway | 100 |
This particular database is Redshift/PostgreSQL, but it would be interesting if there is a generic SQL solution. Is something like the above possible?
You don't want to consider duplicate entries in the visits table. So, select distinct rows from the table instead.
SELECT
v.location,
SUM(u.net_worth) as total_net_worth
FROM (SELECT DISTINCT location, user_id FROM visit) v
JOIN user u on u.id = v.user_id
GROUP BY v.location
ORDER BY v.location;
You can use a window function to get the unique users, then join that to the user table:
select v.location, sum(u.net_worth)
from "user" u
join (
select location, user_id,
row_number() over (partition by location, user_id) as rn
from visit
order by user_id, location, id
) v on v.user_id = u.id and v.rn = 1
group by v.location;
The above is standard ANSI SQL, in Postgres this can also be expressed using distinct on ()
select v.location, sum(u.net_worth)
from "user" u
join (
select distinct on (user_id, location) *
from visit
order by user_id, location, id
) v on v.user_id = u.id
group by v.location;
You can join the user table with distinct values of location & user id combination like the below generic SQL.
SELECT v.location, SUM(u.net_worth)
FROM (SELECT location, user_id FROM visit GROUP BY location, user_id) v
JOIN user u on u.id = v.user_id
GROUP BY v.location;

Sum of two counts from one table with additional data from another table

I have two tables as follows:
TABLE A
| id | col_a | col_b | user_id |
--------------------------------
| 1 | false | true | 1 |
| 2 | false | true | 2 |
| 3 | true | true | 2 |
| 4 | true | true | 3 |
| 5 | true | false | 1 |
TABLE B
| id | name |
--------------
| 1 | Bob |
| 2 | Jim |
| 3 | Helen |
| 4 | Michael|
| 5 | Jen |
I want to get the sum of two counts, which are the number of true values in col_a and number of true values in col_b. I want to group that data by user_id. I also want to join Table B and get the name of each user. The result would look like this:
|user_id|total (col_a + col_b)|name
------------------------------------
| 1 | 2 | Bob
| 2 | 3 | Jim
| 3 | 2 | Helen
So far I got the total sum with the following query:
SELECT
(SELECT COUNT(*) FROM "TABLE_A" WHERE "col_a" is true)+
(SELECT COUNT(*) FROM "TABLE_A" WHERE "col_b" is true)
as total
However, I'm not sure how to proceed with grouping these counts by user_id.
Something like this is typically fastest:
SELECT *
FROM "TABLE_B" b
JOIN (
SELECT user_id AS id
, count(*) FILTER (WHERE col_a)
+ count(*) FILTER (WHERE col_b) AS total
FROM "TABLE_A"
GROUP BY 1
) a USING (id);
While fetching all rows, aggregate first, join later. That's cheaper. See:
Query with LEFT JOIN not returning rows for count of 0
The aggregate FILTER clause is typically fastest. See:
For absolute performance, is SUM faster or COUNT?
Aggregate columns with additional (distinct) filters
Often, you want to keep total counts of 0 in the result. You did say:
get the name of each user.
SELECT b.id AS user_id, b.name, COALESCE(a.total, 0) AS total
FROM "TABLE_B" b
LEFT JOIN (
SELECT user_id AS id
, count(col_a OR NULL)
+ count(col_b OR NULL) AS total
FROM "TABLE_A"
GROUP BY 1
) a USING (id);
...
count(col_a OR NULL) is an equivalent alternative, shortest, and still fast. (Use the FILTER clause from above for best performance.)
The LEFT JOIN keeps all rows from "TABLE_B" in the result.
COALESCE() return 0 instead of NULL for the total count.
If col_a and col_b have only few true values, this is typically (much) faster - basically what you had already:
SELECT b.*, COALESCE(aa.ct, 0) + COALESCE(ab.ct, 0) AS total
FROM "TABLE_B" b
LEFT JOIN (
SELECT user_id AS id, count(*) AS ct
FROM "TABLE_A"
WHERE col_a
GROUP BY 1
) aa USING (id)
LEFT JOIN (
SELECT user_id AS id, count(*) AS ct
FROM "TABLE_A"
WHERE col_b
GROUP BY 1
) ab USING (id);
Especially with (small in this case!) partial indexes like:
CREATE INDEX a_true_idx on "TABLE_A" (user_id) WHERE col_a;
CREATE INDEX b_true_idx on "TABLE_A" (user_id) WHERE col_b;
Aside: use legal, lower-case unquoted names in Postgres to make your like simpler.
Are PostgreSQL column names case-sensitive?
select user_id,name
, count(case when col_a = true then 1 end)
+ count(case when col_b = true then 1 end) total
from tableA a
join TableB b on a.user_id= b.id
group by user_id,name
You are double counting JIM, if that is not supposed since it only shows up in two rows and not three, maybe you can do the following:
with cte_A as (
select col_a as col, user_id
from A
where col_a=true
union -- ALL -- (if you want to double count Jim)
select col_b as col, user_id
from A
where col_b=true
)
select B.user_id, sum(*) as total, B.name
from cte_A
join B
on cte_A.user_id = B.user_id
group by B.user_id
If you want to actually double count then use the UNION ALL instead of UNION

Select rows with most similar set of attributes

I have a PostgreSQL 8.3.4 DB to keep information about photo taggings.
First off, my table definitions:
create table photos (
id integer
, user_id integer
, primary key (id, user_id)
);
create table tags (
photo_id integer
, user_id integer
, tag text
, primary key (user_id, photo_id, tag)
);
What I'm trying to do + simple example:
I am trying to return all the photos that have at least k other photos with at least j common tags.
I. e., if Photo X has these tags (info field in the tags table):
gold
clock
family
And photo Y has the next tags:
gold
sun
family
flower
X and Y have 2 tags in common. For k = 1 and j = 2 X and Y will be returned.
What I have tried
SELECT tags1.user_id , users.name, tags1.photo_id
FROM users, tags tags1, tags tags2
WHERE ((tags1.info = tags2.info) AND (tags1.photo_id != tags2.photo_id)
AND (users.id = tags1.user_id))
GROUP BY tags1.user_id, tags1.photo_id, tags2.user_id, tags2.photo_id, users.name
HAVING ((count(tags1.info) = <j>) and (count(*) >= <k>))
ORDER BY user_id asc, photo_id asc
My failed results:
When I tried to run it on those tables:
photos
photo_id user_id
0 0
1 0
2 0
20 1
23 1
10 3
tags
photo_id user_id tag
0 0 Car
0 0 Bridge
0 0 Sky
20 1 Car
20 1 Bridge
10 3 Sky
The result for k = 1 and j = 1:
Expected:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
| 1 | Ben | 20 |
| 3 | Lev | 10 |
Actual:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
| 3 | Lev | 10 |
For k = 2 and j = 1:
Expected:
| user_id | User Name | photo_id |
| 0 | Bob | 0 |
Actual: empty result.
For j = 2 and k = 2:
Expected: empty result.
Actual:
| user_id | User Name | Photo ID |
| 0 | Bob | 0 |
| 1 | Ben | 20 |
How to solve this properly?
Working with your current design, this uses only basic SQL features and should work for Postgres 8.3, too (untested):
SELECT *
FROM photos p
WHERE (
SELECT count(*) >= 1 -- k other photos
FROM (
SELECT 1
FROM tags t1
JOIN tags t2 USING (tag)
WHERE t1.photo_id = p.id
AND t1.user_id = p.user_id
AND (t2.photo_id <> p.id OR
t2.user_id <> p.user_id)
GROUP BY t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) t1
);
Or:
SELECT *
FROM (
SELECT id, user_id
FROM (
SELECT t1.photo_id AS id, t1.user_id
FROM tags t1
JOIN tags t2 USING (tag)
WHERE (t2.photo_id <> t1.photo_id OR
t2.user_id <> t1.user_id)
GROUP BY t1.photo_id, t1.user_id, t2.photo_id, t2.user_id
HAVING count(*) >= 1 -- j common tags
) sub1
GROUP BY 1, 2
HAVING count(*) >= 1 -- k other photos
) sub2
JOIN photos p USING (id, user_id);
In Postgres 9.3 or later you could use a correlated subquery with a LATERAL join ...
The above are probably even faster than my first query:
SELECT *
FROM (
SELECT photo_id, user_id
FROM tags t
GROUP BY 1, 2
HAVING (
SELECT count(*) >= 1
FROM (
SELECT photo_id, user_id
FROM tags
WHERE tag = ANY(array_agg(t.tag))
AND (photo_id <> t.photo_id OR
user_id <> t.user_id)
GROUP BY 1, 2
HAVING count(*) >= 2
) t1
)
) t
JOIN photos p ON p.id = t.photo_id
AND p.user_id = t.user_id;
SQL Fiddle showing both on Postgres 9.3.
The 1st query just needs the right basic indexes.
For the 2nd, I would build a materialized view with integer arrays, install the intarray module, a GIN index on the integer array column for better performance ...
Related:
Order result by count of common array elements
Proper design
It would be much more efficient to have a single column serial PK for photos and only store IDs of tags per photo ...:
CREATE TABLE photo (
photo_id serial PRIMARY KEY
, user_id int NOT NULL
);
CREATE TABLE tag (
tag_id serial PRIMARY KEY
, tag text UNIQUE NOT NULL
);
CREATE TABLE photo_tag (
photo_id int REFERENCES (photo)
, tag_id int REFERENCES (tag)
, PRIMARY KEY (photo_id, tag_id)
);
Would make the query much simpler and faster, too.
How to implement a many-to-many relationship in PostgreSQL?
If I understood you correctly, you want to calculate similarity between all photos of all users by common tags.
I think you need this:
SELECT
A.user_id,
A.photo_id,
B.user_id,
B.photo_id,
(
SELECT COUNT(*)
FROM
tags TA
JOIN tags TB ON TA.tag = TB.tag
WHERE
A.user_id = TA.user_id
AND A.photo_id = TA.photo_id
AND B.user_id = TB.user_id
AND B.photo_id = TB.photo_id
) AS common_tags
FROM
users A
,users B
WHERE
-- Exclude results to self
A.user_id <> B.User_id
AND A.photo_id <> B.photo_id

Determining relationship between rows using subqueries in PostgreSQL

I'm trying to figure out how to complete the following task in a single query.
Basically, given a user's ID, I want to return the user profiles of all users he is friends with.
If anything is unclear, I'll be happy to go into more detail. Thanks!
table 'users':
user_id | col1 | col2 | etc
-----------------------------------------
a | *** | *** | ***
-----------------------------------------
b | *** | *** | ***
table 'users_friends'
user_id | friend_user_id | status
-----------------------------------------
a | b | 1
-----------------------------------------
b | a | 1
given a value of a, find rows in table users_friends where
user_id = a
status = 1
using the resulting rows of that query, find rows in table users_friends where
user_id = b (column `user_friend_id` from resulting rows)
user_friend_id = a (column `user_id` from resulting rows)
status = 1
if any rows are returned, select rows from table 'users' where
user_id = b (column `user_id` from resulting row)
This is a really rough one I came up with. I think it does what I'm looking for, but I'm sure there are better ways to go about it.
SELECT * FROM users WHERE user_id IN
(SELECT user_id FROM users_friends WHERE friend_user_id IN
(SELECT user_id FROM users_friends WHERE user_id = 'someuserid' AND status = 1 ) AND status = 1 );
select u.*
from
users u
inner join
users_friends f on u.user_id = f.friend_user_id
where
f.status = 1
and f.friend_user_id = 'a'
Assuming there are no duplicates in friends table:
SELECT u.user_id, u.col1, u.col2
JOIN users_friends AS f1 ON u.user_id=f1.user_id
JOIN users_friends AS f2 ON f1.user_id=f2.friend_id AND f1.friend_id=f2.user_id
WHERE f1.status=1 AND f2.status=1 AND f2.user_id='a'
SQL Fiddle
SELECT u.user_id, u.col1
FROM users_friends AS f
JOIN users AS u
ON f.friend_user_id = u.user_id
WHERE f.user_id = 'a'
AND f.status = 1

MySQL query help (semi-unique results)

user_id | user_destination | user_date_out | user_date_in | user_purpose | uid
0095 | NYC | 2010-11-25 | 2010-11-26 | Work | 1
0105 | Seattle | 2010-11-15 | 2010-11-20 | Work | 2
0095 | Home | 2010-11-10 | 2010-11-11 | Personal | 3
0123 | Nashville | 2010-11-10 | 2010-11-12 | Doctober | 4
I have the above data in a MySQL table. I need a query that will output the row if it is the most recent user_date_out for that user_id. Hard to explain, but the rows that should be displayed via the query are uid 2,3 and 4 (UID 1 would fall off because the user_date_out is "older" than UID 3 for the same user). Can anyone help me with this? Thanks in advance!
EDIT: I solved it by using the following query:
SELECT *
FROM `table`
WHERE `uid` = (
SELECT `uid`
FROM `table` as `alt`
WHERE `alt`.`user_id` = `table`.`user_id`
ORDER BY `user_date_out`
LIMIT 1
)
ORDER BY `user_date_out`
Per request
SELECT *
FROM `table`
WHERE `uid` = (
SELECT `uid`
FROM `table` as `alt`
WHERE `alt`.`user_id` = `table`.`user_id`
ORDER BY `user_date_out`
LIMIT 1
)
ORDER BY `user_date_out`
SELECT t.*, a_subquery.min_user_date_out
FROM your_table t
JOIN (SELECT user_id, MIN(user_date_out) AS min_user_date_out
FROM your_table
GROUP BY user_id) a_subquery
ON a_subquery.min_user_date_out = t.user_date_out
AND a_subquery.user_id = t.user_id
Since MySQL sometimes has issues w/ subqueries, here's the same equivalent query with a self-exclusion join:
SELECT t.*
FROM your_table t
LEFT JOIN your_table t2
ON t.user_id = t2.user_id
AND t.user_date_out > t2.user_date_out
WHERE t2.user_id IS NULL
Note that both of these queries will not work if you have more than one of any user_date_out for a given user_id.