Count distinct rows via a pair of known values - sql

I wasn't even sure how to phrase this question. I'll give example content and wanted output, I'm looking for a query to do this.
Let's say I have table called "flagged" with this content:
content_id | user_id
1 | 1
1 | 2
1 | 3
2 | 1
2 | 3
2 | 4
3 | 2
3 | 3
4 | 1
4 | 2
5 | 1
6 | 1
6 | 4
And I have a a-symmetrical relationship between content_ids:
master_content_id | slave_content_id
1 | 2
3 | 4
5 | 6
For each "master" content_id (1, 3 and 5), I want to count how many distinct users have flagged either the master or the slave content, but count someone who flagged both as a single flag - which means that in the above example, content_id=1 was counted by user_id=1 (as content_id=1 and content_id=2), by user_id=2 (as content_id=1), by user_id=3 (as content_id=1 and content_id=2), and by user_id=4 (as content_id=2!)
An example of the output of the query I want to make is:
content_id | user_count
1 | 4 # users 1, 2, 3, 4
3 | 3 # users 1, 2, 3
5 | 2 # users 1, 4
I can't assume that the related content_ids are always a consecutive odd/even (i.e. 66 can be the master of the slave 58)
I am using MySQL and don't mind using its extensions to SQL (but rather the query be ANSI, or at least portable to the most databases)

The query below worked for me.
I'm using a sub-query with a UNION ALL to treat your mapped contents equal to the direct contents.
SELECT master_content_id AS content_id,
COUNT(DISTINCT user_id) AS user_count
FROM (
SELECT master_content_id, slave_content_id
FROM relationship
UNION ALL
SELECT master_content_id, master_content_id
FROM relationship
) r
JOIN flagged f ON ( f.content_id = r.slave_content_id )
GROUP BY master_content_id
Result:
content_id user_count
1 4
3 3
5 2

I think something like this will work for you (although GROUP_CONCAT is MySQL specific, similar concatenation can be achieved in other RDBMS)
SELECT COALESCE(Master_Content_ID, Content_ID) AS Content_ID,
COUNT(DISTINCT User_ID) AS Users,
CONCAT('#Users ', GROUP_CONCAT(DISTINCT User_ID ORDER BY User_ID)) AS UserList
FROM Flagged
LEFT JOIN MasterContent
ON Content_ID = Slave_Content_ID
GROUP BY COALESCE(Master_Content_ID, Content_ID)
Sample SQL Fiddle here: http://www.sqlfiddle.com/#!2/d09be/2
Output:
CONTENT_ID USERS USERLIST
1 4 #Users 1,2,3,4
3 3 #Users 1,2,3
5 2 #Users 1,4

From the samples given, does this do the job (I don't have MySQL available to test)?
SELECT
ms.master_content_id,
(SELECT COUNT(DISTINCT f.user_id) FROM flagged f WHERE
f.content_id = ms.slave_content_id OR
f.content_id = ms.master_content_id)
FROM
master_slave ms
It would be better not to have the DISTINCT, but I can't see a way around it.

SELECT master_content_id AS content_id
, COUNT(*) AS user_count
, GROUP_CONCAT(user_id) AS flagging_users
FROM
( SELECT r.master_content_id
, f.user_id
FROM relationship AS r
JOIN flagged AS f
ON f.content_id = r.master_content_id
UNION
SELECT r.master_content_id
, f.user_id
FROM relationship AS r
JOIN flagged AS f
ON f.content_id = r.slave_content_id
) AS un
GROUP BY master_content_id

Related

Why does my PostgreSQL query not work as expected?

I currently have two tables called calendars and events in my PostgreSQL database which are joined on calendars.uuid = events.calendar_id.
At present, a person can have more than one calendar in the calendars table, however I need to change this so the person_id has a unique constraint, and hence they should only be able to have one entry moving forward.
I therefore need to identify only the person(s) which currently have more than one calendar and all the associated records from the events table i.e. person_id = 4
calendars:
uuid | person_id
-----+---------------
1 | 1
2 | 2
3 | 3
4 | 4
5 | 4
6 | 4
7 | 5
8 | 5
events:
uuid | calendar_id | event_id
-----+-----------------------
1 | 1 | 4728
2 | 1 | 8942
3 | 1 | 7842
4 | 2 | 9784
5 | 3 | 9852
6 | 3 | 1298
7 | 4 | 4983
8 | 5 | 4892
9 | 5 | 8522
My query is as follows, however this is not working, and as i'm fairly new to SQL/PSQL I'm struggling to figure this one out:
SELECT
calendars.uuid,
calendars.person_id,
events.uuid,
events.calendar_id,
events.event_id
FROM
events
INNER JOIN (
SELECT
person_id,
count(*)
FROM
calendars
GROUP BY
person_id
HAVING
count(*) > 1) AS calendars ON calendars.uuid = events.calendar_id
Any help would be much appreciated.
You can join events and calendar then put the person_id in the where clause.
SELECT
calendars.uuid,
calendars.person_id,
events.uuid,
events.calendar_id,
events.event_id
FROM
calendars
INNER JOIN events
ON events.calendar_id = calendars.uuid
WHERE calendars.person_id in (
SELECT
person_id
FROM
calendars
GROUP BY
person_id
HAVING
count(*) > 1 )
uuid person_id uuid calendar_id event_id
4 4 7 4 4983
5 4 8 5 4892
5 4 9 5 8522
I find it helps to structure your query such that you segregate the parts that are most restrictive first. So I would use a cte to restrict the persons to those wanted and then include the cte as an inner join to a standard query. Something like this:
WITH cte as
(SELECT person_id
FROM
calendars
GROUP BY
person_id
HAVING
count(*) > 1)
SELECT
calendars.uuid,
calendars.person_id,
events.uuid,
events.calendar_id,
events.event_id
FROM
events
INNER JOIN calendars ON calendars.uuid = events.calendar_id
INNER JOIN cte ON cte.person_id = calendars.person_id

Find out what group id contains all relevant attributes in SQL

So lets say in this case, the group that we have is groups of animals.
Lets say I have the following tables:
animal_id | attribute_id | animal
----------------------------------
1 | 1 | dog
1 | 4 | dog
2 | 1 | cat
2 | 3 | cat
3 | 2 | fish
3 | 5 | fish
id | attribute
------------------
1 | four legs
2 | no legs
3 | feline
4 | canine
5 | aquatic
Where the first table contains the attributes that define an animal, and the second table keeps track of what each attribute is. Now lets say that we run a query on some data and get the following result table:
attribute_id
------------
1
4
This data would describe a dog, since it is the only animal_id that has both attributes 1 and 4. I want to be able to somehow get the animal_id (which in this case would be 1) based on the third table, which is essentially a table that has already been generated that contains the attributes of an animal.
EDIT
So the third table that has 1 and 4 doesn't have to be 1 and 4. It could return 2 and 5 (for fish), or 1 and 3 (cat). We can assume that it's result will always match one animal completely, but we don't know which one.
You can use group by and having:
with a as (
select 1 as attribute_id from dual union all
select 4 as attribute_id from dual
)
select t.animal_id, t.animal
from t join
a
on t.attribute_id = a.attribute_id
group by t.animal_id, t.animal
having count(*) = (select count(*) from a);
The above will find all animals that have those attributes and any others. If you want animals that have exactly those 2 attributes:
with a as (
select 1 as attribute_id from dual union all
select 4 as attribute_id from dual
)
select t.animal_id, t.animal
from t left join
a
on t.attribute_id = a.attribute_id
group by t.animal_id, t.animal
having count(*) = (select count(*) from a) and
count(*) = count(a.attribute_id);

SQL server matching two table on a column

I have two tables one storing user skills another storing skills required for a job. I want to match how many skills a of each user matches with a job.
The table structure is
Table1: User_Skills
| ID | User_ID | Skill |
---------------------------
| 1 | 1 | .Net |
---------------------------
| 2 | 1 | Software|
---------------------------
| 3 | 1 | Engineer|
---------------------------
| 4 | 2 | .Net |
---------------------------
| 5 | 2 | Software|
---------------------------
Table2: Job_Skills_Requirement
| ID | Job_ID | Skill |
--------------------------
| 1 | 1 | .Net |
---------------------------
| 2 | 1 | Engineer|
---------------------------
| 3 | 1 | HTML |
---------------------------
| 4 | 2 | Software|
---------------------------
| 5 | 2 | HTML |
---------------------------
I was trying to have comma separated skills and compare but these can be in different order.
Edit
All the answers here are excellent. The result I am looking for is matching all jobs with all users as later on I will match other properties as well.
You could join the tables by the skill columns and count the matches:
SELECT user_id, job_id, COUNT(*) AS matching_skills
FROM user_skills u
JOIN job_skills_requirement j ON u.skill = j.skill
GROUP BY user_id, job_id
EDIT:
IF you want to also show users and jobs that have no matching skills, you can use a full outer join instead.
SELECT user_id, job_id, COUNT(*) AS matching_skills
FROM user_skills u
FULL OUTER JOIN job_skills_requirement j ON u.skill = j.skill
GROUP BY user_id, job_id
EDIT 2:
As Jiri Tousek commented, the above query will produce nulls where there's no match between a user and a job. If you want a full Cartesian products between them, you could use (abuse?) the cross join syntax and count how many skills actually match between each user and each job:
SELECT user_id,
job_id,
COUNT(CASE WHEN u.skill = j.skill THEN 1 END) AS matching_skills
FROM user_skills u
CROSS JOIN job_skills_requirement j
GROUP BY user_id, job_id
If you want to match all users and all jobs, then Mureinik's otherwise excellent answer is not correct.
You need to generate all the rows first, which I would do using a cross join and then count the matching ones:
select u.user_id, j.job_id, count(jsr.job_id) as skills_in_common
from users u cross join
jobs j left join
user_skills us
on us.user_id = u.user_id left join
Job_Skills_Requirement jsr
on jsr.job_id = j.job_id and
jsr.skill = us.skill
group by u.user_id, j.job_id;
Note: This assumes the existence of a users and a jobs table. You can of course generate these using subqueries.
WITH User_Skills(ID,User_ID,Skill)AS(
SELECT 1,1,'.Net' UNION ALL
SELECT 2,1,'Software' UNION ALL
SELECT 3,1,'Engineer' UNION ALL
SELECT 4,2,'.Net' UNION ALL
SELECT 5,2 ,'Software'
),Job_Skills_Requirement(ID,Job_ID,Skill)AS(
SELECT 1,1,'.Net' UNION ALL
SELECT 2,1,'Engineer' UNION ALL
SELECT 3,1,'HTML' UNION ALL
SELECT 4,2,'Software' UNION ALL
SELECT 5,2 ,'HTML'
),Job_User_Skill AS (
SELECT j.Job_ID,u.User_ID,u.Skill
FROM Job_Skills_Requirement AS j INNER JOIN User_Skills AS u ON u.Skill=j.Skill
)
SELECT jus.Job_ID,jus.User_ID,COUNT(jus.Skill),STUFF(c.Skills,1,1,'') AS Skill
FROM Job_User_Skill AS jus
CROSS APPLY(SELECT ','+j.Skill FROM Job_User_Skill AS j WHERE j.Job_ID=jus.Job_ID AND j.User_ID=jus.User_ID FOR XML PATH('')) c(Skills)
GROUP BY jus.Job_ID,jus.User_ID,c.Skills
ORDER BY jus.Job_ID
Job_ID User_ID Skill
----------- ----------- ----------- -------------
1 1 2 .Net,Engineer
1 2 1 .Net
2 1 1 Software
2 2 1 Software

How do I query previous rows?

I have a page audit table that records which pages a user has accessed. Given an specific page, I need to find what previous page the user has accessed and what was the most accessed.
For example, the FAQ Page_ID is 3. I want to know if it is more frequently accessed from the First Access page (ID 1) or Home page (ID 5).
Example:
Page Audit Table (SQL Server)
ID | Page_ID | User_ID
1 | 1 | 6
2 | 3 | 6
3 | 5 | 4
4 | 3 | 4
5 | 1 | 7
6 | 3 | 7
7 | 1 | 5
8 | 3 | 2 --Note: previous user is not 2
9 | 3 | 5 --Note: previous page for user 5 is 1 and not 3
Looking for Page_ID = 3, I want to retrieve:
Previous Page | Count
1 | 3
5 | 1
Note: I've looked some similar questions here (like that one), but it didn't help me to solve this problem.
You can use window functions as one way to figure this out:
with UserPage as (
select
User_ID,
Page_ID,
row_number() over (partition by User_ID order by ID) as rn
from
PageAudit
)
select
p1.Page_ID,
count(*)
from
UserPage p1
inner join
UserPage p2
on p1.User_ID = p2.User_ID and
p1.rn + 1 = p2.rn
where
p2.Page_ID = 3
group by
p1.Page_ID;
SQLFiddle Demo
If you have SQL2012, the answers using lag will be a lot more efficient. This one works on SQL2008 too.
For reference, as I think one of the lag solutions is over complicated, and one is wrong:
with prev as (
select
page_id,
lag(page_id,1) over (partition by user_id order by id) as prev_page
from
PageAudit
)
select
prev_page,
count(*)
from
prev
where
page_id = 3 and
prev_page is not null -- people who landed on page 3 without a previous page
group by
prev_page
SQLFiddle Example of Lag
select prev_page, count(*)
from (select id,
page_id,
user_id,
lag(page_id, 1) over(partition by user_id order by id) as prev_page
from page_audit_table) x
where page_id = 3
and prev_page <> page_id
group by prev_page
Fiddle:
http://sqlfiddle.com/#!6/c0037/23/0
You could use the LAG function (It is available only in MS SQL Server 2012+).
Test with this fiddle.
Query:
SELECT
previous_page, count(previous_page) as count
FROM
(SELECT
Page_id,
LAG(Page_ID, 1, NULL) OVER (PARTITION BY User_ID ORDER BY ID) as previous_page,
User_ID as current_usr,
LAG(User_ID, 1, NULL) OVER (PARTITION BY User_ID ORDER BY ID) as previous_usr
FROM
Page_Audit) p
WHERE
Page_ID = 3 AND current_usr = previous_usr
GROUP BY
previous_page
ORDER BY
count DESC

PostgreSQL , Select from 2 tables, but only the latest element from table 2

Hey, I have 2 tables in PostgreSql:
1 - documents: id, title
2 - updates: id, document_id, date
and some data:
documents:
| 1 | Test Title |
updates:
| 1 | 1 | 2006-01-01 |
| 2 | 1 | 2007-01-01 |
| 3 | 1 | 2008-01-01 |
So All updates are pointing to the same document, but all with different dates for the updates.
What I am trying to do is to do a select from the documents table, but also include the latest update based on the date.
How should a query like this look like? This is the one I currently have, but I am listing all updates, and not the latest one as the one I need:
SELECT * FROM documents,updates WHERE documents.id=1 AND documents.id=updates.document_id ORDER BY date
To include; The reason I need this in the query is that I want to order by the date from the updates template!
Edit: This script is heavily
simplified, so I should be able to
create a query that returns any number
of results, but including the latest
updated date. I was thinking of using a
inner join or left join or something
like that!?
Use PostgreSQL extension DISTINCT ON:
SELECT DISTINCT ON (documents.id) *
FROM document
JOIN updates
ON updates.document_id = document_id
ORDER BY
documents.id, updates.date DESC
This will take the first row from each document.id cluster in ORDER BY order.
Test script to check:
SELECT DISTINCT ON (documents.id) *
FROM (
VALUES
(1, 'Test Title'),
(2, 'Test Title 2')
) documents (id, title)
JOIN (
VALUES
(1, 1, '2006-01-01'::DATE),
(2, 1, '2007-01-01'::DATE),
(3, 1, '2008-01-01'::DATE),
(4, 2, '2009-01-01'::DATE),
(5, 2, '2010-01-01'::DATE)
) updates (id, document_id, date)
ON updates.document_id = documents.id
ORDER BY
documents.id, updates.date DESC
You may create a derived table which contains only the most recent "updates" records per document_id, and then join "documents" against that:
SELECT d.id, d.title, u.update_id, u."date"
FROM documents d
LEFT JOIN
-- JOIN "documents" against the most recent update per document_id
(
SELECT recent.document_id, id AS update_id, recent."date"
FROM updates
INNER JOIN
(SELECT document_id, MAX("date") AS "date" FROM updates GROUP BY 1) recent
ON updates.document_id = recent.document_id
WHERE
updates."date" = recent."date"
) u
ON d.id = u.document_id;
This will handle "un-updated" documents, like so:
pg=> select * from documents;
id | title
----+-------
1 | foo
2 | bar
3 | baz
(3 rows)
pg=> select * from updates;
id | document_id | date
----+-------------+------------
1 | 1 | 2009-10-30
2 | 1 | 2009-11-04
3 | 1 | 2009-11-07
4 | 2 | 2009-11-09
(4 rows)
pg=> SELECT d.id ...
id | title | update_id | date
----+-------+-----------+------------
1 | foo | 3 | 2009-11-07
2 | bar | 4 | 2009-11-09
3 | baz | |
(3 rows)
select *
from documents
left join updates
on updates.document_id=documents.id
and updates.date=(select max(date) from updates where document_id=documents.id)
where documents.id=?;
It has the some advantages over previous answers:
you can write document_id only in one place which is convenient;
you can omit where and you'll get a table of all documents and their latest updates;
you can use more broad selection criteria, for example where documents.id in (1,2,3).
You can also avoid a subselect using group by, but you'll have to list all fields of documents in group by clause:
select documents.*, max(date) as max_date
from documents
left join updates on documents.id=document_id
where documents.id=1
group by documents.id, title;
From the top of my head:
ORDER BY date DESC LIMIT 1
If you really want only id 1 your can use this query:
SELECT * FROM documents,updates
WHERE documents.id=1 AND updates.document_id=1
ORDER BY date DESC LIMIT 1
http://www.postgresql.org/docs/8.4/interactive/queries-limit.html
This should also work
SELECT * FROM documents, updates
WHERE documents.id=1 AND updates.document_id=1
AND updates.date = (SELECT MAX (date) From updates)