Matching similar entities based on many to many relationship

Matching similar entities based on many to many relationship - sql

I have two entities in my database that are connected with a many to many relationship. I was wondering what would be the best way to list which entities have the most similarities based on it?
I tried doing a count(*) with intersect, but the query takes too long to run on every entry in my database (there are about 20k records). When running the query I wrote, CPU usage jumps to 100% and the database has locking issues.
Here is some code showing what I've tried:
My tables look something along these lines:
/* 20k records */
create table Movie(
Id INT PRIMARY KEY,
Title varchar(255)
);
/* 200-300 records */
create table Tags(
Id INT PRIMARY KEY,
Desc varchar(255)
);
/* 200,000-300,000 records */
create table TagMovies(
Movie_Id INT,
Tag_Id INT,
PRIMARY KEY (Movie_Id, Tag_Id),
FOREIGN KEY (Movie_Id) REFERENCES Movie(Id),
FOREIGN KEY (Tag_Id) REFERENCES Tags(Id),
);
(This works, but it is terribly slow)
This is the query that I wrote to try and list them:
Usually I also filter with top 1 & add a where clause to get a specific set of related data.
SELECT
bk.Id,
rh.Id
FROM
Movies bk
CROSS APPLY (
SELECT TOP 15
b.Id,
/* Tags Score */
(
SELECT COUNT(*) FROM (
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = bk.Id
INTERSECT
SELECT x.Tag_Id FROM TagMovies x WHERE x.Movie_Id = b.Id
) Q1
)
as Amount
FROM
Movies b
WHERE
b.Id <> bk.Id
ORDER BY Amount DESC
) rh
Explanation:
Movies have tags and the user can get try to find movies similar to the one that they selected based on other movies that have similar tags.

Hmm ... just an idea, but maybe I didnt understand ...
This query should return best matched movies by tags for a given movie ID:
SELECT m.id, m.title, GROUP_CONCAT(DISTINCT t.Descr SEPARATOR ', ') as tags, count(*) as matches
FROM stack.Movie m
LEFT JOIN stack.TagMovies tm ON m.Id = tm.Movie_Id
LEFT JOIN stack.Tags t ON tm.Tag_Id = t.Id
WHERE m.id != 1
AND tm.Tag_Id IN (SELECT Tag_Id FROM stack.TagMovies tm WHERE tm.Movie_Id = 1)
GROUP BY m.id
ORDER BY matches DESC
LIMIT 15;
EDIT:
I just realized that it's for M$ SQL ... but maybe something similar can be done...

You should probably decide on a naming convention and stick with it. Are tables singular or plural nouns? I don't want to get into that debate, but pick one or the other.
Without access to your database I don't know how this will perform. It's just off the top of my head. You could also limit this by the M.id value to find the best matches for a single movie, which I think would improve performance by quite a bit.
Also, TOP x should let you get the x closest matches.
SELECT
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title,
COUNT(*) AS matched_tags
FROM
Movie M
INNER JOIN TagsMovie TM1 ON TM1.movie_id = M.movie_id
INNER JOIN TagsMovie TM2 ON
TM2.tag_id = TM1.tag_id AND
TM2.movie_id <> TM1.movie_id
INNER JOIN Movie SM ON SM.movie_id = TM2.movie_id
GROUP BY
M.id,
M.title,
SM.id AS similar_movie_id,
SM.title AS similar_movie_title
ORDER BY
COUNT(*) DESC

Related

Postgres many to one relationship join multiple tables and select all rows, provided that at least one row matches some criterea

Suppose I have a schema something like
create table if not exists user (
id serial primary key,
name text not null
);
create table if not exists post (
id serial primary key,
user_id integer not null references user (id),
score integer not null
)
I want to run a query that selects a row from the user table by ID, and all the rows that reference it from the post table, provided that at least one row in the post table has a score of greater than some number n (e.g. 50). I'm not exactly sure how to do this though.

You can use window functions. Let me assume that post has a user_id column so the tables can be tied together:
select u.*
from user u join
(select p.*, max(score) over (partition by user_id) as max_score
from post p
) p
on p.user_id = u.id
where p.max_score > 50;
If you just wanted all scores, then aggregation with filtering might be sufficient:
select u.*, array_agg(p.score order by p.score desc)
from user u join
post p
) p
on p.user_id = u.id
group by u.id
having max(p.score) > 50;

Flattening nested query in WHERE clause with NOT IN

Suppose I have these two tables, simplified for the purpose of the question:
CREATE TABLE merchandises
(
id BIGSERIAL PRIMARY KEY,
name VARCHAR(255) NOT NULL,
price INT NOT NULL
)
CREATE TABLE gifts
(
id BIGSERIAL NOT NULL PRIMARY KEY,
from_user VARCHAR(255) REFERENCES users(id),
to_user VARCHAR(255) REFERENCES users(id),
with_merchandise BIGINT REFERENCES merchandises(id)
)
The merchandises table lists available merchandises. The gifts table show records that a user has sent a merchandise to another user as gift (proper index is in place to avoid duplication).
What I would like to query is a list of merchandises that a user can send to another user, provided that the merchandises should not have been gifted before.
This is a query that works, but I hope that I can find one that does not have a nested query, thinking that it might give better performance thanks to the optimizer of POSTGRESQL.
SELECT DISTINCT ON (m.id) m.id, m.name, m.description
FROM merchandises m
WHERE m.id NOT IN (
SELECT g.with_merchandise
FROM gifts g
WHERE g.from_user = 'some_user_id' AND g.to_user = 'some_other_user_id'
)
ORDER BY m.id ASC
LIMIT 20 OFFSET 0
In the previous attempt, I had this query, but I found out that it does not work:
SELECT DISTINCT ON (m.id) m.id, m.name, m.description
FROM merchandises m
LEFT JOIN gifts g
ON m.id = g.with_merchandise
WHERE g.id IS NULL
OR g.from_user <> 'some_user_id' AND g.to_user <> 'some_other_user_id'
ORDER BY m.id ASC
LIMIT 20 OFFSET 0
This query does not work because even though the WHERE clause filters out gift entries from two specific users, two other users might have given gifts with the same merchandise (same merchandise_id).

Even though you asked to remove the subquery, using a not exists subquery might run faster than not in especially if the not in query returns a lot of values:
SELECT m.id, m.name, m.description
FROM merchandises m
WHERE NOT EXISTS (
SELECT 1
FROM gifts g
WHERE g.with_merchandise = m.id
AND g.from_user = 'some_user_id'
AND g.to_user = 'some_other_user_id'
)
This query can take advantage of a composite key on gifts(with_merchandise,from_user,to_user)
If you still rather use left join, then move your conditions for from_user and to_user from the where to the on clause
SELECT m.id, m.name, m.description
FROM merchandises m
LEFT JOIN gifts g ON m.id = g.with_merchandise
AND g.from_user = 'some_user_id' AND g.to_user = 'some_other_user_id'
WHERE g.id IS NULL
ORDER BY m.id ASC
LIMIT 20 OFFSET 0

This uses a left outer join and should perform well.
SELECT m.*
FROM merchandises m
LEFT OUTER JOIN (SELECT with_merchandise FROM gifts WHERE from_user = 'some_user_id' AND to_user = 'some_other_user_id' GROUP BY with_merchandise) g ON m.id = g.with_merchandise
WHERE g.with_merchandise IS NULL
ORDER BY m.id ASC
LIMIT 20 OFFSET 0

SQL: Counting number of games for team from results page

Completely noob to SQL.
I have created the following table, which stores data on matches between two opponents and the points the winner got.
CREATE TABLE matches ( winner INT references players,
loser INT references players,
gamepoints INT);
I created the below VIEW to show standings:
CREATE VIEW standings as
select
players.id,
players.name,
count(matches.winner) as number_of_wins,
coalesce(sum(matches.gamepoints),0) as points
from players left join matches
on players.id = matches.winner
group by players.name, players.id
order by number_of_wins desc, points desc;
I wish to add a column that will show how many games a player played. My problem is that games appear in both matches.winner and matches.loser columns, and I'm not sure how to aggregate them in the standings view.
Also, would you say that the matches table is normalized?
Any help would be greatly appreciated.
EDIT: changed matches content.

With the help of #Jorge Campos, this is the solution:
CREATE VIEW games_won as
select p.id, p.name, coalesce(sum(m.gamepoints),0) gp, count(m.winner) ng
from players p left join matches m
on p.id=m.winner
group by p.id, p.name;
CREATE VIEW games_lost as
select p.id, p.name, count(m.loser) as ng
from players p left join matches m
on p.id=m.loser
group by p.id, p.name;
CREATE VIEW standings as
select w.id, w.name, w.ng as wins, w.ng+l.ng as matches, w.gp as gamepoints
from games_won w INNER JOIN games_lost l
on w.id=l.id
order by wins desc, gamepoints;

For the simple case you show there are only a few things that you should fix to be ok. Again for the problem you show.
First: Change the columns types of the table matches it shouldn't be SERIAL as it is an autoincrement type column (not a real type). Both columns are foreign keys and it should be integer, int or bigint
as
create table matches (
winner bigint,
loser bigint,
gamepoints int,
constraint fk_player_winner foreign key (winner)
references players(id),
constraint fk_player_loser foreign key (loser)
references players(id)
);
Second: to know how many games a player did with the number of points you can create two subqueries one with the winners and one with the losers and join the two summing the values. The catch is that you have to decrease the gamepoints from the two:
select w.id, w.name, w.gp-l.gp as gamepoints, w.ng+l.ng
from (select p.id, p.name, sum(m.gamepoints) gp, count(m.winner) as ng
from players p inner join matches m
on p.id=m.winner
group by p.id, p.name ) w
INNER JOIN
(select p.id, p.name, sum(m.gamepoints) gp, count(m.loser) as ng
from players p inner join matches m
on p.id=m.loser
group by p.id, p.name) l on w.id=l.id;
From it you create your view.
Note: maybe I'm being overkill with this two subqueries. It is possible to work out with a join between two players tables and a matches
See how it goes here on fiddle: http://sqlfiddle.com/#!15/5b6a4/4

Using a complicated double join to get a count of child objects

Note that I'm using postgresql
I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.
organizations
------------
id (pk)
company_name
users
------------
id (pk)
organization_id
jobs
------------
id (pk)
client_id (id of an organization)
server_id (id of an organization)
creator_id (id of a user)
documents
------------
id (pk)
job_id
Result Desired
organizations.id | organizations.company_name | document_count
85 | Big Corporation | 84
905 | Some other folks | 65
403 | ACME, Inc | 14
As you can see, an organization can be connected to a document through 3 different paths:
organizations.id => jobs.client_id => documents.job_id
organizations.id => jobs.server_id => documents.job_id
organizations.id => users.organization_id => jobs.creator_id => documents.job_id
But I want a query that will get the count of all the documents each company has access to...
I tried a couple of things... like this:
SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
jobs.client_id = organizations.id OR
jobs.server_id = organizations.id OR
jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10
The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.
The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.
The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.

The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:
its total number of users times the total number of documents belonging to jobs for which it's a client
its total number of users times the total number of documents belonging to jobs for which it's a server
the total number of documents belonging to jobs for which one if its users is the creator
One way to fix this is to remove the INNER JOIN users line, and change this:
jobs.creator_id = users.id
to this:
jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)
. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.

Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that

It's probably too late to redesign this, but you really should.
The jobs table should not have its own id field a d key.
The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).
You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".
What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.
I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.
With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.
In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.

None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:
SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.client_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.server_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs on documents.job_id = jobs.id
INNER JOIN users ON jobs.creator_id = users.id
INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC
The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer

But I want a query that will get the count of all the documents you have access to...
That's where your query starts:
SELECT ... FROM documents
...
Since the only clue to the documents table is in jobs, you'll need the jobs table as well::
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...
Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
;
But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:
SELECT ...
FROM documents dc
WHERE EXISTS (
SELECT *
FROM jobs jo
WHERE jo.document_id = dc.id
AND ( jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
)
)
;
As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.
UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
--
-- create the missing tables
--
CREATE TABLE organizations
( id SERIAL NOT NULL PRIMARY KEY
, company_name varchar
);
CREATE TABLE users
( id SERIAL NOT NULL PRIMARY KEY
, organization_id INTEGER NOT NULL REFERENCES organizations(id)
);
CREATE TABLE jobs
( id SERIAL NOT NULL PRIMARY KEY
, client_id INTEGER NOT NULL REFERENCES organizations(id)
, server_id INTEGER NOT NULL REFERENCES organizations(id)
, creator_id INTEGER NOT NULL REFERENCES users(id)
);
CREATE TABLE documents
( id SERIAL NOT NULL PRIMARY KEY
, job_id INTEGER NOT NULL REFERENCES jobs(id)
);
--
-- Populate
--
INSERT INTO organizations(id, company_name) VALUES
(85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
;
select setval('organizations_id_seq', 905);
INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
;
DELETE FROM documents
WHERE random() < 0.5
;
--
-- And the query ...
--
EXPLAIN ANALYZE
SELECT o.id AS org
, count(*) AS the_docs
FROM organizations o
JOIN documents d ON 1=1 -- start with a carthesian product
WHERE EXISTS (
SELECT *
FROM jobs j
WHERE d.job_id = j.id
AND (j.client_id = o.id OR j.server_id = o.id )
)
OR EXISTS (
SELECT *
FROM jobs j
JOIN users u ON j.creator_id = u.id
WHERE u.organization_id = o.id
AND d.job_id = j.id
)
GROUP BY o.id
;

SQL joins with multiple records into one with a default

My 'people' table has one row per person, and that person has a division (not unique) and a company (not unique).
I need to join people to p_features, c_features, d_features on:
people.person=p_features.num_value
people.division=d_features.num_value
people.company=c_features.num_value
... in a way that if there is a record match in p_features/d_features/c_features only, it would be returned, but if it was in 2 or 3 of the tables, the most specific record would be returned.
From my test data below, for example, query for person=1 would return
'FALSE'
person 3 returns maybe, person 4 returns true, and person 9 returns default
The biggest issue is that there are 100 features and I have queries that need to return all of them in one row. My previous attempt was a function which queried on feature,num_value in each table and did a foreach, but 100 features * 4 tables meant 400 reads and it brought the database to a halt it was so slow when I loaded up a few million rows of data.
create table p_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table c_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table d_features (
num_value int8,
feature varchar(20),
feature_value varchar(128)
);
create table default_features (
feature varchar(20),
feature_value varchar(128)
);
create table people (
person int8 not null,
division int8 not null,
company int8 not null
);
insert into people values (4,5,6);
insert into people values (3,5,6);
insert into people values (1,2,6);
insert into p_features values (4,'WEARING PANTS','TRUE');
insert into c_features values (6,'WEARING PANTS','FALSE');
insert into d_features values (5,'WEARING PANTS','MAYBE');
insert into default_features values('WEARING PANTS','DEFAULT');

You need to transpose the features into rows with a ranking. Here I used a common-table expression. If your database product does not support them, you can use temporary tables to achieve the same effect.
;With RankedFeatures As
(
Select 1 As FeatureRank, P.person, PF.feature, PF.feature_value
From people As P
Join p_features As PF
On PF.num_value = P.person
Union All
Select 2, P.person, PF.feature, PF.feature_value
From people As P
Join d_features As PF
On PF.num_value = P.division
Union All
Select 3, P.person, PF.feature, PF.feature_value
From people As P
Join c_features As PF
On PF.num_value = P.company
Union All
Select 4, P.person, DF.feature, DF.feature_value
From people As P
Cross Join default_features As DF
)
, HighestRankedFeature As
(
Select Min(FeatureRank) As FeatureRank, person
From RankedFeatures
Group By person
)
Select RF.person, RF.FeatureRank, RF.feature, RF.feature_value
From people As P
Join HighestRankedFeature As HRF
On HRF.person = P.person
Join RankedFeatures As RF
On RF.FeatureRank = HRF.FeatureRank
And RF.person = P.person
Order By P.person

I don't know if I had understood very well your question, but to use JOIN, you need your table loaded already and then use the SELECT statement with INNER JOIN, LEFT JOIN or whatever you need to show.
If you post some more information, maybe turn it easy to understand.

There are some aspects of your schema I'm not understanding, like how to relate to the default_features table if there's no match in any of the specific tables. The only possible join condition is on feature, but if there's no match in the other 3 tables, there's no value to join on. So, in my example, I've hard-coded the DEFAULT since I can't think of how else to get it.
Hopefully this can get you started and if you can clarify the model a bit more, the solution can be refined.
select p.person, coalesce(pf.feature_value, df.feature_value, cf.feature_value, 'DEFAULT')
from people p
left join p_features pf
on p.person = pf.num_value
left join d_features df
on p.division = df.num_value
left join c_features cf
on p.company = cf.num_value

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas