Using a complicated double join to get a count of child objects - sql

Note that I'm using postgresql
I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.
organizations
------------
id (pk)
company_name
users
------------
id (pk)
organization_id
jobs
------------
id (pk)
client_id (id of an organization)
server_id (id of an organization)
creator_id (id of a user)
documents
------------
id (pk)
job_id
Result Desired
organizations.id | organizations.company_name | document_count
85 | Big Corporation | 84
905 | Some other folks | 65
403 | ACME, Inc | 14
As you can see, an organization can be connected to a document through 3 different paths:
organizations.id => jobs.client_id => documents.job_id
organizations.id => jobs.server_id => documents.job_id
organizations.id => users.organization_id => jobs.creator_id => documents.job_id
But I want a query that will get the count of all the documents each company has access to...
I tried a couple of things... like this:
SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
jobs.client_id = organizations.id OR
jobs.server_id = organizations.id OR
jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10
The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.
The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.
The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.

The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:
its total number of users times the total number of documents belonging to jobs for which it's a client
its total number of users times the total number of documents belonging to jobs for which it's a server
the total number of documents belonging to jobs for which one if its users is the creator
One way to fix this is to remove the INNER JOIN users line, and change this:
jobs.creator_id = users.id
to this:
jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)
. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.

Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that

It's probably too late to redesign this, but you really should.
The jobs table should not have its own id field a d key.
The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).
You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".
What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.
I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.
With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.
In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.

None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:
SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.client_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.server_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs on documents.job_id = jobs.id
INNER JOIN users ON jobs.creator_id = users.id
INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC
The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer

But I want a query that will get the count of all the documents you have access to...
That's where your query starts:
SELECT ... FROM documents
...
Since the only clue to the documents table is in jobs, you'll need the jobs table as well::
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...
Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
;
But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:
SELECT ...
FROM documents dc
WHERE EXISTS (
SELECT *
FROM jobs jo
WHERE jo.document_id = dc.id
AND ( jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
)
)
;
As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.
UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
--
-- create the missing tables
--
CREATE TABLE organizations
( id SERIAL NOT NULL PRIMARY KEY
, company_name varchar
);
CREATE TABLE users
( id SERIAL NOT NULL PRIMARY KEY
, organization_id INTEGER NOT NULL REFERENCES organizations(id)
);
CREATE TABLE jobs
( id SERIAL NOT NULL PRIMARY KEY
, client_id INTEGER NOT NULL REFERENCES organizations(id)
, server_id INTEGER NOT NULL REFERENCES organizations(id)
, creator_id INTEGER NOT NULL REFERENCES users(id)
);
CREATE TABLE documents
( id SERIAL NOT NULL PRIMARY KEY
, job_id INTEGER NOT NULL REFERENCES jobs(id)
);
--
-- Populate
--
INSERT INTO organizations(id, company_name) VALUES
(85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
;
select setval('organizations_id_seq', 905);
INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
;
DELETE FROM documents
WHERE random() < 0.5
;
--
-- And the query ...
--
EXPLAIN ANALYZE
SELECT o.id AS org
, count(*) AS the_docs
FROM organizations o
JOIN documents d ON 1=1 -- start with a carthesian product
WHERE EXISTS (
SELECT *
FROM jobs j
WHERE d.job_id = j.id
AND (j.client_id = o.id OR j.server_id = o.id )
)
OR EXISTS (
SELECT *
FROM jobs j
JOIN users u ON j.creator_id = u.id
WHERE u.organization_id = o.id
AND d.job_id = j.id
)
GROUP BY o.id
;

Related

INNER JOIN of pagevies, contacts and companies - duplicated entries

In short: 3 table inner join duplicates records
I have data in BigQuery in 3 tables:
Pageviews with columns:
timestamp
user_id
title
path
Contacts with columns:
website_user_id
email
company_id
Companies with columns:
id
name
I want to display all recorded pageviews and, if user and/or company is known, display this data next to pageview.
First, I join contact and pageviews data (SQL is generated by Metabase business intelligence tool):
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
ORDER BY `timestamp` DESC
It works as expected and I can see pageviews attributed to known contacts.
Next, I'd like to show pageviews of contacts with known company and which company is this:
SELECT
`analytics.pageviews`.`timestamp` AS `timestamp`,
`analytics.pageviews`.`title` AS `title`,
`analytics.pageviews`.`path` AS `path`,
`Contacts`.`email` AS `email`,
`Companies`.`name` AS `name`
FROM `analytics.pageviews`
INNER JOIN `analytics.contacts` `Contacts` ON `analytics.pageviews`.`user_id` = `Contacts`.`website_user_id`
INNER JOIN `analytics.companies` `Companies` ON `Contacts`.`company_id` = `Companies`.`id`
ORDER BY `timestamp` DESC
With this query I would expect to see only pageviews where associated contact AND company are known (just another column for company name). The problem is, I get duplicate rows for every pageview (sometimes 5, sometimes 20 identical rows).
I want to avoid selecting DISTINCT timestamps because it can lead to excluding valid pageviews from different users but with identical timestamp.
How to approach this?
Your description sounds like you have duplciates in companies. This is easy to test for:
select c.id, count(*)
from `analytics.companies` c
group by c.id
having count(*) >= 2;
You can get the details using window functions:
select c.*
from (select c.*, count(*) over (partition by c.id) as cnt
from `analytics.companies` c
) c
where cnt >= 2
order by cnt desc, id;

Matching similar entities based on many to many relationship

I have two entities in my database that are connected with a many to many relationship. I was wondering what would be the best way to list which entities have the most similarities based on it?
I tried doing a count(*) with intersect, but the query takes too long to run on every entry in my database (there are about 20k records). When running the query I wrote, CPU usage jumps to 100% and the database has locking issues.
Here is some code showing what I've tried:
My tables look something along these lines:
/* 20k records */
create table Movie(
Id INT PRIMARY KEY,
Title varchar(255)
);
/* 200-300 records */
create table Tags(
Id INT PRIMARY KEY,
Desc varchar(255)
);
/* 200,000-300,000 records */
create table TagMovies(
Movie_Id INT,
Tag_Id INT,
PRIMARY KEY (Movie_Id, Tag_Id),
FOREIGN KEY (Movie_Id) REFERENCES Movie(Id),
FOREIGN KEY (Tag_Id) REFERENCES Tags(Id),
);
(This works, but it is terribly slow)
This is the query that I wrote to try and list them:
Usually I also filter with top 1 & add a where clause to get a specific set of related data.
SELECT
bk.Id,
rh.Id
FROM
Movies bk
CROSS APPLY (
SELECT TOP 15
b.Id,
/* Tags Score */
(
SELECT COUNT(*) FROM (
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = bk.Id
INTERSECT
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = b.Id
) Q1
)
as Amount
FROM
Movies b
WHERE
b.Id <> bk.Id
ORDER BY Amount DESC
) rh
Explanation:
Movies have tags and the user can get try to find movies similar to the one that they selected based on other movies that have similar tags.
Hmm ... just an idea, but maybe I didnt understand ...
This query should return best matched movies by tags for a given movie ID:
SELECT m.id, m.title, GROUP_CONCAT(DISTINCT t.Descr SEPARATOR ', ') as tags, count(*) as matches
FROM stack.Movie m
LEFT JOIN stack.TagMovies tm ON m.Id = tm.Movie_Id
LEFT JOIN stack.Tags t ON tm.Tag_Id = t.Id
WHERE m.id != 1
AND tm.Tag_Id IN (SELECT Tag_Id FROM stack.TagMovies tm WHERE tm.Movie_Id = 1)
GROUP BY m.id
ORDER BY matches DESC
LIMIT 15;
EDIT:
I just realized that it's for M$ SQL ... but maybe something similar can be done...
You should probably decide on a naming convention and stick with it. Are tables singular or plural nouns? I don't want to get into that debate, but pick one or the other.
Without access to your database I don't know how this will perform. It's just off the top of my head. You could also limit this by the M.id value to find the best matches for a single movie, which I think would improve performance by quite a bit.
Also, TOP x should let you get the x closest matches.
SELECT
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title,
COUNT(*) AS matched_tags
FROM
Movie M
INNER JOIN TagsMovie TM1 ON TM1.movie_id = M.movie_id
INNER JOIN TagsMovie TM2 ON
TM2.tag_id = TM1.tag_id AND
TM2.movie_id <> TM1.movie_id
INNER JOIN Movie SM ON SM.movie_id = TM2.movie_id
GROUP BY
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title
ORDER BY
COUNT(*) DESC

Postgres LEFT JOIN with SUM, missing records

I am trying to get the count of certain types of records in a related table. I am using a left join.
So I have a query that isn't quite right and one that is returning the correct results. The correct results query has a higher execution cost. Id like to use the first approach, if I can correct the results. (see http://sqlfiddle.com/#!15/7c20b/5/2)
CREATE TABLE people(
id SERIAL,
name varchar not null
);
CREATE TABLE pets(
id SERIAL,
name varchar not null,
kind varchar not null,
alive boolean not null default false,
person_id integer not null
);
INSERT INTO people(name) VALUES
('Chad'),
('Buck'); --can't keep pets alive
INSERT INTO pets(name, alive, kind, person_id) VALUES
('doggio', true, 'dog', 1),
('dog master flash', true, 'dog', 1),
('catio', true, 'cat', 1),
('lucky', false, 'cat', 2);
My goal is to get a table back with ALL of the people and the counts of the KINDS of pets they have alive:
| ID | ALIVE_DOGS_COUNT | ALIVE_CATS_COUNT |
|----|------------------|------------------|
| 1 | 2 | 1 |
| 2 | 0 | 0 |
I made the example more trivial. In our production app (not really pets) there would be about 100,000 dead dogs and cats per person. Pretty screwed up I know, but this example is simpler to relay ;) I was hoping to filter all the 'dead' stuff out before the count. I have the slower query in production now (from sqlfiddle above), but would love to get the LEFT JOIN version working.
Typically fastest if you fetch all or most rows:
SELECT pp.id
, COALESCE(pt.a_dog_ct, 0) AS alive_dogs_count
, COALESCE(pt.a_cat_ct, 0) AS alive_cats_count
FROM people pp
LEFT JOIN (
SELECT person_id
, count(kind = 'dog' OR NULL) AS a_dog_ct
, count(kind = 'cat' OR NULL) AS a_cat_ct
FROM pets
WHERE alive
GROUP BY 1
) pt ON pt.person_id = pp.id;
Indexes are irrelevant here, full table scans will be fastest. Except if alive pets are a rare case, then a partial index should help. Like:
CREATE INDEX pets_alive_idx ON pets (person_id, kind) WHERE alive;
I included all columns needed for the query (person_id, kind) to allow index-only scans.
SQL Fiddle.
Typically fastest for a small subset or a single row:
SELECT pp.id
, count(kind = 'dog' OR NULL) AS alive_dogs_count
, count(kind = 'cat' OR NULL) AS alive_cats_count
FROM people pp
LEFT JOIN pets pt ON pt.person_id = pp.id
AND pt.alive
WHERE <some condition to retrieve a small subset>
GROUP BY 1;
You should at least have an index on pets.person_id for this (or the partial index from above) - and possibly more, depending ion the WHERE condition.
Related answers:
Query with LEFT JOIN not returning rows for count of 0
GROUP or DISTINCT after JOIN returns duplicates
Get count of foreign key from multiple tables
Your WHERE alive=true is actually filtering out record for person_id = 2. Use the below query, push the WHERE alive=true condition into the CASE condition as can be noticed here. See your modified Fiddle
SELECT people.id,
pe.alive_dogs_count,
pe.alive_cats_count
FROM people
LEFT JOIN
(
select person_id,
COALESCE(SUM(case when pets.kind='dog' and alive = true then 1 else 0 end),0) as alive_dogs_count,
COALESCE(SUM(case when pets.kind='cat' and alive = true then 1 else 0 end),0) as alive_cats_count
from pets
GROUP BY person_id
) pe on people.id = pe.person_id
(OR) your version
SELECT
people.id,
COALESCE(SUM(case when pets.kind='dog' and alive = true then 1 else 0 end),0) as alive_dogs_count,
COALESCE(SUM(case when pets.kind='cat' and alive = true then 1 else 0 end),0) as alive_cats_count
FROM people
LEFT JOIN pets on people.id = pets.person_id
GROUP BY people.id;
JOIN with SUM
I think your original query was something like this:
SELECT people.id, stats.dog, stats.cat
FROM people
JOIN (SELECT person_id, count(kind)filter(where kind='dog') dog, count(kind)filter(where kind='cat') cat FROM pets WHERE alive GROUP BY person_id) stats
ON stats.person_id = people.id
That works smoothly, but you should understand, that the result will miss the people with 0 pets, because of inner join.
In order to include people who miss pets, you can:
firstly LEFT JOIN,
then GROUP BY joined result
and be ready for NULL values instead of counts.
See the accepted answer above.
Credits to #ErwinBrandstetter
Slowness
In contrast to other DBMS', Postgresql doesn't create indexes for foreign keys.
One multicolumn index will be more efficient than three single indexes. Extend the foreign key index with extra columns from WHERE and JOIN ON columns in the right order:
CREATE INDEX people_fk_with_kind_alive ON test2 (person_id, alive, kind);
REF: https://postgresql.org/docs/11/indexes-multicolumn.html
Of course, your primary keys should be defined. The primary key will be indexed by default.

Join tables in sqlite with many to many

I have the following database schema:
create table people (
id integer primary key autoincrement,
);
create table groups (
id integer primary key autoincrement,
);
and I already have which people are members of which groups in a separate file (let's say in tuples of (person id, group id). How can I structure my database schema such that it's easy to access a person's groups, and also easy to access the members of a group? It is difficult and slow to read the tuples that I currently have, so I want this to be in database form. I can't have things like member1, member2, etc. as columns because the number of people in a group is currently unlimited.
Move your text file into a database table
CREATE TABLE groups_people (
groups_id integer,
people_id integer,
PRIMARY KEY(group_id, people_id)
);
And select all people that are a member of group 7
SELECT * FROM people p
LEFT JOIN groups_people gp ON gp.people_id = p.id
WHERE gp.groups_id = '7';
And select all the groups that person 5 is in
SELECT * FROM groups g
LEFT JOIN groups_people gp ON gp.groups_id = g.id
WHERE gp.people_id = '5';

MySQL query - possible to include this clause?

I have the following query, which retrieves 4 adverts from certain categories in a random order.
At the moment, if a user has more than 1 advert, then potentially all of those ads might be retrieved - I need to limit it so that only 1 ad per user is displayed.
Is this possible to achieve in the same query?
SELECT a.advert_id, a.title, a.url, a.user_id,
FLOOR(1 + RAND() * x.m_id) 'rand_ind'
FROM adverts AS a
INNER JOIN advert_categories AS ac
ON a.advert_id = ac.advert_id,
(
SELECT MAX(t.advert_id) - 1 'm_id'
FROM adverts t
) x
WHERE ac.category_id IN
(
SELECT category_id
FROM website_categories
WHERE website_id = '8'
)
AND a.advert_type = 'text'
GROUP BY a.advert_id
ORDER BY rand_ind
LIMIT 4
Note: The solution is the last query at the bottom of this answer.
Test Schema and Data
create table adverts (
advert_id int primary key, title varchar(20), url varchar(20), user_id int, advert_type varchar(10))
;
create table advert_categories (
advert_id int, category_id int, primary key(category_id, advert_id))
;
create table website_categories (
website_id int, category_id int, primary key(website_id, category_id))
;
insert website_categories values
(8,1),(8,3),(8,5),
(1,1),(2,3),(4,5)
;
insert adverts (advert_id, title, user_id) values
(1, 'StackExchange', 1),
(2, 'StackOverflow', 1),
(3, 'SuperUser', 1),
(4, 'ServerFault', 1),
(5, 'Programming', 1),
(6, 'C#', 2),
(7, 'Java', 2),
(8, 'Python', 2),
(9, 'Perl', 2),
(10, 'Google', 3)
;
update adverts set advert_type = 'text'
;
insert advert_categories values
(1,1),(1,3),
(2,3),(2,4),
(3,1),(3,2),(3,3),(3,4),
(4,1),
(5,4),
(6,1),(6,4),
(7,2),
(8,1),
(9,3),
(10,3),(10,5)
;
Data properties
each website can belong to multiple categories
for simplicity, all adverts are of type 'text'
each advert can belong to multiple categories. If a website has multiple categories that are matched multiple times in advert_categories for the same user_id, this causes the advert_id's to show twice when using a straight join between 3 tables in the next query.
This query joins the 3 tables together (notice that ids 1, 3 and 10 each appear twice)
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
inner join adverts a on a.advert_id = ac.advert_id and a.advert_type = 'text'
where wc.website_id='8'
order by a.advert_id
To make each website show only once, this is the core query to show all eligible ads, each only once
select *
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
The next query retrieves all the advert_id's to be shown
select advert_id, user_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
There are 3 levels to this query
aliased EligibleAdsAndUserIDs: core query, sorted randomly using order by rand()
aliased RowNumbered: row number added to core query, using MySQL side-effecting #variables
the outermost query forces mysql to collect rows as numbered randomly in the inner queries, and group by user_id causes it to retain only the first row for each user_id. limit 2 causes the query to stop as soon as two distinct user_id's have been encountered.
This is the final query which takes the advert_id's from the previous query and joins it back to table adverts to retrieve the required columns.
only once per user_id
feature user's with more ads proportionally (statistically) to the number of eligible ads they have
Note: Point (2) works because the more ads you have, the more likely you will hit the top placings in the row numbering subquery
select a.advert_id, a.title, a.url, a.user_id
from
(
select advert_id
from (
select
advert_id, user_id,
#r := #r + 1 r
from (select #r:=0) r
cross join
(
# core query -- vvv
select a.advert_id, a.user_id
from adverts a
where a.advert_type = 'text'
and exists (
select *
from website_categories wc
inner join advert_categories ac on wc.category_id = ac.category_id
where wc.website_id='8'
and a.advert_id = ac.advert_id)
# core query -- ^^^
order by rand()
) EligibleAdsAndUserIDs
) RowNumbered
group by user_id
order by r
limit 2
) Top2
inner join adverts a on a.advert_id = Top2.advert_id;
I'm thinking through something but don't have MySQL available.. can you try this query to see if it works or crashes...
SELECT
PreQuery.user_id,
(select max( tmp.someRandom ) from PreQuery tmp where tmp.User_ID = PreQuery.User_ID ) MaxRandom
from
( select adverts.user_id,
rand() someRandom
from adverts, advert_categories
where adverts.advert_id = advert_categories.advert_id ) PreQuery
If the "tmp" alias is recognized as a temp buffer of the preliminary query as defined by the OUTER FROM clause, I might have something that will work... I think the field as a select statement from a queried from WONT work, but if it does, I know I'll have something solid for you.
Ok, this one might make the head hurt a bit, but lets get the logical thing going... The inner most "Core Query" is a basis that gets all unique and randomly assigned QUALIFIED Users that have a qualifying ad base on the category chosen, and type = 'text'. Since the order is random, I don't care what the assigned sequence is, and order by that. The limit 4 will return the first 4 entries that qualify. This is regardless of one user having 1 ad vs another having 1000 ads.
Next, join to the advertisements, reversing the table / join qualifications... but by having a WHERE - IN SUB-SELECT, the sub-select will be on each unique USER ID that was qualified by the "CoreQuery" and will ONLY be done 4 times based on ITs inner limit. So even if 100 users with different advertisements, we get 4 users.
Now, the Join to the CoreQuery is the Advert Table based on the same qualifying user. Typically this would join ALL records against the core query given they are for the same user in question... This is correct... HOWEVER, the NEXT WHERE clause is what filters it down to only ONE ad for the given person.
The Sub-Select is making sure its "Advert_ID" matches the one selected in the sub-select. The sub-select is based ONLY on the current "CoreQuery.user_ID" and gets ALL the qualifying category / ads for the user (wrong... we don't want ALL ads)... So, by adding an ORDER BY RAND() will randomize only this one person's ads in the result set... then Limiting THAT by 1 will only give ONE of their qualified ads...
So, the CoreQuery restricts down to 4 users. Then for each qualified user ID, gets only 1 of the qualified ads (by its inner order by RAND() and LIMIT 1 )...
Although I don't have MySQL to try, the queries are COMPLETELY legit and hope it works for you.... man, I love brain teasers like this...
SELECT
ad1.*
from
( SELECT ad.user_id,
count(*) as UserAdCount,
RAND() as ANYRand
from
website_categories wc
inner join advert_categories ac
ON wc.category_id = ac.category_id
inner join adverts ad
ON ac.advert_id = ad.advert_id
AND ad.advert_type = 'text'
where
wc.website_id = 8
GROUP BY
1
order by
3
limit
4 ) CoreQuery,
adverts ad1
WHERE
ad1.advert_type = 'text'
AND CoreQuery.User_ID = ad1.User_ID
AND ad1.advert_id in
( select
ad2.advert_id
FROM
adverts ad2,
advert_categories ac2,
website_categories wc2
WHERE
ad2.user_id = CoreQuery.user_id
AND ad2.advert_id = ac2.advert_id
AND ac2.category_id = wc2.category_id
AND wc2.website_id = 8
ORDER BY
RAND()
LIMIT
1 )
I like to suggest that you do the random with php. This is way faster than doing it in mySQL.
"However, when the table is large (over about 10,000 rows) this method of selecting a random row becomes increasingly slow with the size of the table and can create a great load on the server. I tested this on a table I was working that contained 2,394,968 rows. It took 717 seconds (12 minutes!) to return a random row."
http://www.greggdev.com/web/articles.php?id=6
set #userid = -1;
select
a.id,
a.title,
case when #userid = a.userid then
0
else
1
end as isfirst,
(#userid := a.userid)
from
adverts a
inner join advertcategories ac on ac.advertid = a.advertid
inner join categories c on c.categoryid = ac.categoryid
where
c.website = 8
order by
a.userid,
rand()
having
isfirst = 1
limit 4
Add COUNT(a.user_id) as owned in the main select directive and add HAVING owned < 2 after Group By
http://dev.mysql.com/doc/refman/5.5/en/select.html
I think this is the way to do it, if the one user has more than one advert then we will not select it.