MySQL query - possible to include this clause? - sql

I have the following query, which retrieves 4 adverts from certain categories in a random order.
At the moment, if a user has more than 1 advert, then potentially all of those ads might be retrieved - I need to limit it so that only 1 ad per user is displayed.
Is this possible to achieve in the same query?
SELECT a.advert_id, a.title, a.url, a.user_id,
FLOOR(1 + RAND() * x.m_id) 'rand_ind'
FROM adverts AS a
INNER JOIN advert_categories AS ac
ON a.advert_id = ac.advert_id,
(
SELECT MAX(t.advert_id) - 1 'm_id'
FROM adverts t
) x
WHERE ac.category_id IN
(
SELECT category_id
FROM website_categories
WHERE website_id = '8'
)
AND a.advert_type = 'text'
GROUP BY a.advert_id
ORDER BY rand_ind
LIMIT 4

Note: The solution is the last query at the bottom of this answer.
Test Schema and Data
create table adverts (
advert_id int primary key, title varchar(20), url varchar(20), user_id int, advert_type varchar(10))
;
create table advert_categories (
advert_id int, category_id int, primary key(category_id, advert_id))
;
create table website_categories (
website_id int, category_id int, primary key(website_id, category_id))
;
insert website_categories values
(8,1),(8,3),(8,5),
(1,1),(2,3),(4,5)
;
insert adverts (advert_id, title, user_id) values
(1, 'StackExchange', 1),
(2, 'StackOverflow', 1),
(3, 'SuperUser', 1),
(4, 'ServerFault', 1),
(5, 'Programming', 1),
(6, 'C#', 2),
(7, 'Java', 2),
(8, 'Python', 2),
(9, 'Perl', 2),
(10, 'Google', 3)
;
update adverts set advert_type = 'text'
;
insert advert_categories values
(1,1),(1,3),
(2,3),(2,4),
(3,1),(3,2),(3,3),(3,4),
(4,1),
(5,4),
(6,1),(6,4),
(7,2),
(8,1),
(9,3),
(10,3),(10,5)
;
Data properties
each website can belong to multiple categories
for simplicity, all adverts are of type 'text'
each advert can belong to multiple categories. If a website has multiple categories that are matched multiple times in advert_categories for the same user_id, this causes the advert_id's to show twice when using a straight join between 3 tables in the next query.
This query joins the 3 tables together (notice that ids 1, 3 and 10 each appear twice)
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
inner join adverts a on a.advert_id = ac.advert_id and a.advert_type = 'text'
where wc.website_id='8'
order by a.advert_id
To make each website show only once, this is the core query to show all eligible ads, each only once
select *
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
The next query retrieves all the advert_id's to be shown
select advert_id, user_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
There are 3 levels to this query
aliased EligibleAdsAndUserIDs: core query, sorted randomly using order by rand()
aliased RowNumbered: row number added to core query, using MySQL side-effecting #variables
the outermost query forces mysql to collect rows as numbered randomly in the inner queries, and group by user_id causes it to retain only the first row for each user_id. limit 2 causes the query to stop as soon as two distinct user_id's have been encountered.
This is the final query which takes the advert_id's from the previous query and joins it back to table adverts to retrieve the required columns.
only once per user_id
feature user's with more ads proportionally (statistically) to the number of eligible ads they have
Note: Point (2) works because the more ads you have, the more likely you will hit the top placings in the row numbering subquery
select a.advert_id, a.title, a.url, a.user_id
from
(
select advert_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
) Top2
inner join adverts a on a.advert_id = Top2.advert_id;

I'm thinking through something but don't have MySQL available.. can you try this query to see if it works or crashes...
SELECT
PreQuery.user_id,
(select max( tmp.someRandom ) from PreQuery tmp where tmp.User_ID = PreQuery.User_ID ) MaxRandom
from
( select adverts.user_id,
rand() someRandom
from adverts, advert_categories
where adverts.advert_id = advert_categories.advert_id ) PreQuery
If the "tmp" alias is recognized as a temp buffer of the preliminary query as defined by the OUTER FROM clause, I might have something that will work... I think the field as a select statement from a queried from WONT work, but if it does, I know I'll have something solid for you.

Ok, this one might make the head hurt a bit, but lets get the logical thing going... The inner most "Core Query" is a basis that gets all unique and randomly assigned QUALIFIED Users that have a qualifying ad base on the category chosen, and type = 'text'. Since the order is random, I don't care what the assigned sequence is, and order by that. The limit 4 will return the first 4 entries that qualify. This is regardless of one user having 1 ad vs another having 1000 ads.
Next, join to the advertisements, reversing the table / join qualifications... but by having a WHERE - IN SUB-SELECT, the sub-select will be on each unique USER ID that was qualified by the "CoreQuery" and will ONLY be done 4 times based on ITs inner limit. So even if 100 users with different advertisements, we get 4 users.
Now, the Join to the CoreQuery is the Advert Table based on the same qualifying user. Typically this would join ALL records against the core query given they are for the same user in question... This is correct... HOWEVER, the NEXT WHERE clause is what filters it down to only ONE ad for the given person.
The Sub-Select is making sure its "Advert_ID" matches the one selected in the sub-select. The sub-select is based ONLY on the current "CoreQuery.user_ID" and gets ALL the qualifying category / ads for the user (wrong... we don't want ALL ads)... So, by adding an ORDER BY RAND() will randomize only this one person's ads in the result set... then Limiting THAT by 1 will only give ONE of their qualified ads...
So, the CoreQuery restricts down to 4 users. Then for each qualified user ID, gets only 1 of the qualified ads (by its inner order by RAND() and LIMIT 1 )...
Although I don't have MySQL to try, the queries are COMPLETELY legit and hope it works for you.... man, I love brain teasers like this...
SELECT
ad1.*
from
( SELECT ad.user_id,
count(*) as UserAdCount,
RAND() as ANYRand
from
website_categories wc
inner join advert_categories ac
ON wc.category_id = ac.category_id
inner join adverts ad
ON ac.advert_id = ad.advert_id
AND ad.advert_type = 'text'
where
wc.website_id = 8
GROUP BY
1
order by
3
limit
4 ) CoreQuery,
adverts ad1
WHERE
ad1.advert_type = 'text'
AND CoreQuery.User_ID = ad1.User_ID
AND ad1.advert_id in
( select
ad2.advert_id
FROM
adverts ad2,
advert_categories ac2,
website_categories wc2
WHERE
ad2.user_id = CoreQuery.user_id
AND ad2.advert_id = ac2.advert_id
AND ac2.category_id = wc2.category_id
AND wc2.website_id = 8
ORDER BY
RAND()
LIMIT
1 )

I like to suggest that you do the random with php. This is way faster than doing it in mySQL.
"However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row."
http://www.greggdev.com/web/articles.php?id=6

set #userid = -1;
select
a.id,
a.title,
case when #userid = a.userid then
0
else
1
end as isfirst,
(#userid := a.userid)
from
adverts a
inner join advertcategories ac on ac.advertid = a.advertid
inner join categories c on c.categoryid = ac.categoryid
where
c.website = 8
order by
a.userid,
rand()
having
isfirst = 1
limit 4

Add COUNT(a.user_id) as owned in the main select directive and add HAVING owned < 2 after Group By
http://dev.mysql.com/doc/refman/5.5/en/select.html
I think this is the way to do it, if the one user has more than one advert then we will not select it.

Related

INNER JOIN of pagevies, contacts and companies - duplicated entries

In short: 3 table inner join duplicates records
I have data in BigQuery in 3 tables:
Pageviews with columns:
timestamp
user_id
title
path
Contacts with columns:
website_user_id
email
company_id
Companies with columns:
id
name
I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.
First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC
It works as expected and I can see pageviews attributed to known contacts.
Next, I'd like to show pageviews of contacts with known company and which company is this:
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`,
`Companies`.`name` AS `name`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC
With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).
I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.
How to approach this?
Your description sounds like you have duplciates in companies. This is easy to test for:
select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;
You can get the details using window functions:
select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
from `analytics.companies` c
) c
where cnt >= 2
order by cnt desc, id;

Join on resultant table of another join without using subquery,CTE or temp tables

My question is can we join a table A to resultant table of inner join of table A and B without using subquery, CTE or temp tables ?
I am using SQL Server.
I will explain the situation with an example
The are two tables GoaLScorers and GoalScoredDetails.
GoaLScorers
gid Name
-----------
1 A
2 B
3 A
GoalScoredDetails
DetailId gid stadium goals Cards
---------------------------------------------
1 1 X 2 1
2 2 Y 5 2
3 3 Y 2 1
The result I am expecting is if I select a stadium 'X' (or 'Y')
I should get name of all who may or may not have scored there, also aggregate total number of goals,total cards.
Null value is acceptable for names if no goals or no cards.
I can get the result I am expecting with the below query
SELECT
gs.name,
SUM(goal) as TotalGoals,
SUM(cards) as TotalCards
FROM
(SELECT
gid, stadium, goal, cards
FROM
GoalScoredDetails
WHERE
stadium = 'Y') AS vtable
RIGHT OUTER JOIN
GoalScorers AS gs ON vtable.gid = gs.gid
GROUP BY
gs.name
My question is can we get the above result without using a subquery or CTE or temp table ?
Basically what we need to do is OUTER JOIN GoalScorers to resultant virtual table of INNER JOIN OF GoalScorers and GoalScoredDetails.
But I am always faced with ambiguous column name error as "gid" column is present in GoalScorers and also in resultant table. Error persists even if I try to use alias for column names.
I have created a sql fiddle for this her: http://sqlfiddle.com/#!3/40162/8
SELECT gs.name, SUM(gsd.goal) AS totalGoals, SUM(gsd.cards) AS totalCards
FROM GoalScorers gs
LEFT JOIN GoalScoredDetails gsd ON gsd.gid = gs.gid AND
gsd.Stadium = 'Y'
GROUP BY gs.name;
IOW, you could push your where criteria onto joining expression.
The error Ambiguous column name 'ColumnName' occurs when SQL Server encounters two or more columns with the same and it hasn't been told which to use. You can avoid the error by prefixing your column names with either the full table name, or an alias if provided. For the examples below use the following data:
Sample Data
DECLARE #GoalScorers TABLE
(
gid INT,
Name VARCHAR(1)
)
;
DECLARE #GoalScoredDetails TABLE
(
DetailId INT,
gid INT,
stadium VARCHAR(1),
goals INT,
Cards INT
)
;
INSERT INTO #GoalScorers
(
gid,
Name
)
VALUES
(1, 'A'),
(2, 'B'),
(3, 'A')
;
INSERT INTO #GoalScoredDetails
(
DetailId,
gid,
stadium,
goals,
Cards
)
VALUES
(1, 1, 'x', 2, 1),
(2, 2, 'y', 5, 2),
(3, 3, 'y', 2, 1)
;
In this first example we recieve the error. Why? Because there is more than one column called gid it cannot tell which to use.
Failed Example
SELECT
gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
This example works because we explicitly tell SQL which gid to return:
Working Example
SELECT
gs.gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
You can, of course, return both:
Example
SELECT
gs.gid,
gsd.gid
FROM
#GoalScoredDetails AS gsd
RIGHT OUTER JOIN #GoalScorers as gs ON gs.gid = gsd.gid
;
In multi table queries I would always recommend prefixing every column name with a table/alias name. This makes the query easier to follow, and reduces the likelihood of this sort of error.

How to implement/improve/ make faster this query with joins

Please bear with me I am not skilled in SQL:
I have three tables
1) Notifications - stores all my data
2) GroupTable - Has the names of groups and related id
3) GroupUser - this table maps Uname and Udob to a group from GroupTable.
Now before I fetch records from Notifications I want to check the GroupTable for GroupID take this GroupID and look in GroupUser for all the records in this GroupID (Names,DOB as these are unique) Once I get this data I want to fetch records from Notifications table for the Names and DOB's in ascending order of the date:
So far I have the following query, it works fine just that I am not satisfied and I think this can be improved:
SELECT
*
FROM
(SELECT
*
FROM Notifications
WHERE
DateToNotify < '2016-03-24' AND
NotificationDateFor IN
(SELECT gu.Name
FROM GroupUser AS gu
INNER JOIN GroupTable AS gt ON
gu.GroupID = gt._id AND
gt.GroupName = "Groupn"
) AND
DOB IN
(SELECT gu.DOB
FROM GroupUser AS gu
INNER JOIN GroupTable AS gt ON
gu.GroupID = gt._id AND
gt.GroupName = "Groupn"
)
) as T
ORDER BY
SUBSTR(DATE('NOW'), 0) > SUBSTR(DateToNotify, 0)
, SUBSTR(DateToNotify, 0)
I don't think that you would get this faster with joins instead of the IN clauses. It can be that re-writing would not even change the execution plan, because the dbms tries to access the data in the optimal way anyhow.
It seems a bit strange that you don't look for group users matching name and dob, but only ensure that there are group users matching the name and - possibly other - group users matching the dob. But as you say that the query works fine as is, okay.
EDIT: Okay, according to your comment you actually want groupuser matches on both name and dob. So what you are looking for would be
AND (NotificationDateFor, DOB) IN (SELECT gu.Name, gu.DOB FROM ...)
But SQLite doesn't support this beautiful syntax (Oracle is the only dbms I know of that does).
So you either join or use EXISTS.
With JOIN:
select distinct n.*
from notifications n
join
(
select name, dob
from groupuser
where groupid = (select _id from grouptable where groupname = 'groupn')
) as gu on n.notificationdatefor = gu.name and n.dob = gu.dob
where n.datetonotify < '2016-03-24'
order by date('now') > n.datetonotify, n.datetonotify;
With EXISTS:
select *
from notifications n
where datetonotify < '2016-03-24'
and exists
(
select *
from groupuser gu
where gu.groupid = (select _id from grouptable where groupname = 'groupn')
and gu.name = n.notificationdatefor
and gu.dob = n.dob
)
order by date('now') > n.datetonotify, n.datetonotify;

Getting random profiles that have a match with current profile

I'm trying to get 3 random unique profiles with the same sex ID as the current user ID (orig.id_user = 6 in this example), and their respective reviews.
SELECT DISTINCT u.id_user, s.review
FROM user AS u
CROSS JOIN user AS orig ON orig.id_sex = u.id_sex
INNER JOIN user_review AS s ON s.id_user_to = u.id_user
WHERE orig.id_user = 6
ORDER BY RAND()
LIMIT 3
Somehow, the id_user column displays repeated values. Why?
UPDATE (assuming i have the id_sex value)
SELECT DISTINCT s.id_user_to, s.id_user_from, s.review
FROM user_review AS s
LEFT JOIN user AS u ON u.id_user = s.id_user_to
WHERE u.id_sex = 2
ORDER BY RAND()
LIMIT 20
But this is still returning repeated rows in the id_user_to column, they should be unique values because of the DISTINCT.
SOLUTION using GROUP BY
SELECT us.id_user_to, us.review
FROM user_review AS us
LEFT JOIN user AS u ON u.id_user = us.id_user_to
WHERE u.id_sex = 2
GROUP BY us.id_user_to
ORDER BY RAND()
LIMIT 3

Using a complicated double join to get a count of child objects

Note that I'm using postgresql
I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.
organizations
------------
id (pk)
company_name
users
------------
id (pk)
organization_id
jobs
------------
id (pk)
client_id (id of an organization)
server_id (id of an organization)
creator_id (id of a user)
documents
------------
id (pk)
job_id
Result Desired
organizations.id | organizations.company_name | document_count
85 | Big Corporation | 84
905 | Some other folks | 65
403 | ACME, Inc | 14
As you can see, an organization can be connected to a document through 3 different paths:
organizations.id => jobs.client_id => documents.job_id
organizations.id => jobs.server_id => documents.job_id
organizations.id => users.organization_id => jobs.creator_id => documents.job_id
But I want a query that will get the count of all the documents each company has access to...
I tried a couple of things... like this:
SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
jobs.client_id = organizations.id OR
jobs.server_id = organizations.id OR
jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10
The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.
The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.
The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.
The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:
its total number of users times the total number of documents belonging to jobs for which it's a client
its total number of users times the total number of documents belonging to jobs for which it's a server
the total number of documents belonging to jobs for which one if its users is the creator
One way to fix this is to remove the INNER JOIN users line, and change this:
jobs.creator_id = users.id
to this:
jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)
. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.
Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that
It's probably too late to redesign this, but you really should.
The jobs table should not have its own id field a d key.
The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).
You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".
What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.
I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.
With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.
In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.
None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:
SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.client_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.server_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs on documents.job_id = jobs.id
INNER JOIN users ON jobs.creator_id = users.id
INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC
The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer
But I want a query that will get the count of all the documents you have access to...
That's where your query starts:
SELECT ... FROM documents
...
Since the only clue to the documents table is in jobs, you'll need the jobs table as well::
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...
Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
;
But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:
SELECT ...
FROM documents dc
WHERE EXISTS (
SELECT *
FROM jobs jo
WHERE jo.document_id = dc.id
AND ( jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
)
)
;
As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.
UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
--
-- create the missing tables
--
CREATE TABLE organizations
( id SERIAL NOT NULL PRIMARY KEY
, company_name varchar
);
CREATE TABLE users
( id SERIAL NOT NULL PRIMARY KEY
, organization_id INTEGER NOT NULL REFERENCES organizations(id)
);
CREATE TABLE jobs
( id SERIAL NOT NULL PRIMARY KEY
, client_id INTEGER NOT NULL REFERENCES organizations(id)
, server_id INTEGER NOT NULL REFERENCES organizations(id)
, creator_id INTEGER NOT NULL REFERENCES users(id)
);
CREATE TABLE documents
( id SERIAL NOT NULL PRIMARY KEY
, job_id INTEGER NOT NULL REFERENCES jobs(id)
);
--
-- Populate
--
INSERT INTO organizations(id, company_name) VALUES
(85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
;
select setval('organizations_id_seq', 905);
INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
;
DELETE FROM documents
WHERE random() < 0.5
;
--
-- And the query ...
--
EXPLAIN ANALYZE
SELECT o.id AS org
, count(*) AS the_docs
FROM organizations o
JOIN documents d ON 1=1 -- start with a carthesian product
WHERE EXISTS (
SELECT *
FROM jobs j
WHERE d.job_id = j.id
AND (j.client_id = o.id OR j.server_id = o.id )
)
OR EXISTS (
SELECT *
FROM jobs j
JOIN users u ON j.creator_id = u.id
WHERE u.organization_id = o.id
AND d.job_id = j.id
)
GROUP BY o.id
;