Join tables in sqlite with many to many - sql

I have the following database schema:
create table people (
id integer primary key autoincrement,
);
create table groups (
id integer primary key autoincrement,
);
and I already have which people are members of which groups in a separate file (let's say in tuples of (person id, group id). How can I structure my database schema such that it's easy to access a person's groups, and also easy to access the members of a group? It is difficult and slow to read the tuples that I currently have, so I want this to be in database form. I can't have things like member1, member2, etc. as columns because the number of people in a group is currently unlimited.

Move your text file into a database table
CREATE TABLE groups_people (
groups_id integer,
people_id integer,
PRIMARY KEY(group_id, people_id)
);
And select all people that are a member of group 7
SELECT * FROM people p
LEFT JOIN groups_people gp ON gp.people_id = p.id
WHERE gp.groups_id = '7';
And select all the groups that person 5 is in
SELECT * FROM groups g
LEFT JOIN groups_people gp ON gp.groups_id = g.id
WHERE gp.people_id = '5';

Related

Postgres many-to-many relation: finding rows all having relation to a set of rows from related table

Consider three tables, let's call them groups, subgroups, another_groups and table subgroups_another_groups that is specifying many-to-many relation between subgroups and another_groups. subgroups and groups are in one-to-many relation, so subgroups has foreign key group_id.
How is it possible to select another_groups that all of subgroups within a group have relation to?
I assume that you are speaking of such a setup:
CREATE TABLE groups (
id integer PRIMARY KEY
);
CREATE TABLE subgroups (
id integer PRIMARY KEY,
group_id integer REFERENCES groups NOT NULL
);
CREATE INDEX ON subgroups(group_id);
CREATE TABLE another_groups (
id integer PRIMARY KEY
);
CREATE TABLE subgroups_another_groups (
subgroup_id integer REFERENCES subgroups NOT NULL,
another_groups_id integer REFERENCES another_groups NOT NULL,
PRIMARY KEY(subgroup_id, another_groups_id)
);
CREATE INDEX ON subgroups_another_groups(another_groups_id);
Then you want to know all another_groups that are connected to a groups via the other two tables except the ones where there is a subgroup that does not have a connection to this another_groups, right?
In SQL, that would read:
SELECT DISTINCT g.id, a.id
FROM another_groups a
JOIN subgroups_another_groups sag ON a.id = sag.another_groups_id
JOIN subgroups s ON sag.subgroup_id = s.id
JOIN groups g ON s.group_id = g.id
WHERE NOT EXISTS
(SELECT 1 FROM subgroups s1
WHERE s1.group_id = g.id
AND NOT EXISTS
(SELECT 1 FROM subgroups_another_groups sag1
WHERE sag1.subgroup_id = s1.id
AND sag1.another_groups_id = a.id
)
);

Optimise many-to-many join

I have three tables: groups and people and groups_people which forms a many-to-many relationship between groups and people.
Schema:
CREATE TABLE groups (
id SERIAL PRIMARY KEY,
name TEXT
);
CREATE TABLE people (
id SERIAL PRIMARY KEY,
name TEXT,
join_date TIMESTAMP
);
CREATE TABLE groups_people (
group_id INT REFERENCES groups(id),
person_id INT REFERENCES people(id)
);
When I want to query for the latest 10 people who recenlty joined the group which has id = 1:
WITH person_ids AS (SELECT person_id FROM groups_people WHERE group_id = 1)
SELECT * FROM people WHERE id = ANY(SELECT person_id FROM person_ids)
ORDER BY join_date DESC LIMIT 10;
The query needs to scan all of the joined people then ordering them before selecting. That would be slow if the group containing too many people.
Is there anyway to work around it?
Schema (re-)design to allow same person joining multiple group
Since you mentioned that the relationship between groups and people
is many-to-many, I think you may want to move join_date to groups_people
(from people) because the same person can join different groups and each
such event has its own join_date
So I would change the schema to
CREATE TABLE people (
id SERIAL PRIMARY KEY,
name TEXT --, -- change
-- join_date TIMESTAMP -- delete
);
CREATE TABLE groups_people (
group_id INT REFERENCES groups(id),
person_id INT REFERENCES people(id), -- change
join_date TIMESTAMP -- add
);
Query
select
p.id
, p.name
, gp.join_date
from
people as p
, groups_people as gp
where
p.id = gp.person_id
and gp.group_id=1
order by gp.join_date desc
limit 10
Disclaimer: The above query is in MySQL syntax (the question was originally tagged with MySQL)
This seems much easier to write as a simple join with order by and limit:
select p.*
from people p join
groups_people gp
on p.id = gp.person_id
where gp.group_id = 1
order by gp.join_date desc
limit 10; -- or fetch first 10 rows only
Try rewriting using EXISTS
SELECT *
FROM people p
WHERE EXISTS (SELECT 1
FROM groups_people ps
WHERE p.id = ps.person_id and group_id = 1)
ORDER BY join_date DESC
LIMIT 10;

SQL cross-reference table self-reference

I am working on a project where I have a table of
all_names(
team_name TEXT,
member_name TEXT,
member_start INT,
member_end INT);
What I have been tasked with is creating a table of
participants(
ID SERIAL PRIMARY KEY,
type TEXT,
name TEXT);
which contains all team and member names as their own entries. Type may be either "team" or "member".
To compliment this table of participants I am trying to create a cross-reference table that allows a member to be referenced to a team by ID and vice versa. My table looks like this:
belongs_to(
member_id INT REFERENCES participants(ID),
group_id INT REFERENCES participants(ID),
begin_year INT,
end_year INT,
PRIMARY KEY (member_id, group_id);
I am unsure of how to proceed and populate the table properly.
The select query I have so far is:
SELECT DISTINCT ON (member_name, team_name)
id, member_name, team_name, member_begin_year, member_end_year
FROM all_names
INNER JOIN artists ON all_names.member_name = participants.name;
but I am unsure of how to proceed. What is the proper way to populate the cross reference table?
Probably the easiest solution is to use a few statements. Wrap this is a transaction to make sure you don't get concurrency issues:
BEGIN;
INSERT INTO participants (type, name)
SELECT DISTINCT 'team', team_name
FROM all_names
UNION
SELECT DISTINCT 'member', member_name
FROM all_names;
INSERT INTO belongs_to
SELECT m.id, g.id, a.member_start, a.member_end
FROM all_names a
JOIN participants m ON m.name = a.member_name
JOIN participants g ON g.name = a.team_name;
COMMIT;
Members that are part of multiple teams get all of their memberships recorded.

How to massive update?

I have three tables:
group:
id - primary key
name - varchar
profile:
id - primary key
name - varchar
surname - varchar
[...etc...]
profile_group:
profile_id - integer, foreign key to table profile
group_id - integer, foreign key to table group
Profiles may be in many groups. I have group named "Users" with id=1 and I want to assign all users to this group but only if there was no such entry for the table profiles.
How to do it?
If I understood you correctly, you want to add entries like (profile_id, 1) into profile_group table for all profiles, that were not in this table before. If so, try this:
INSERT INTO profile_group(profile_id, group_id)
SELECT id, 1 FROM profile p
LEFT JOIN profile_group pg on (p.id=pg.profile_id)
WHERE pg.group_id IS NULL;
What you want to do is use a left join to the profile group table and then exclude any matching records (this is done in the where clause of the below SQL statement).
This is faster than using not in (select xxx) since the query profiler seems to handle it better (in my experience)
insert into profile_group (profile_id, group_id)
select p.id, 1
from profiles p
left join profile_group pg on p.id = pg.profile_id
and pg.group_id = 1
where pg.profile_id is null

Using a complicated double join to get a count of child objects

Note that I'm using postgresql
I have an organizations table, a users table, a jobs table, and a documents table. I want to get a list of the organizations ordered by the number of total documents they have access to.
organizations
------------
id (pk)
company_name
users
------------
id (pk)
organization_id
jobs
------------
id (pk)
client_id (id of an organization)
server_id (id of an organization)
creator_id (id of a user)
documents
------------
id (pk)
job_id
Result Desired
organizations.id | organizations.company_name | document_count
85 | Big Corporation | 84
905 | Some other folks | 65
403 | ACME, Inc | 14
As you can see, an organization can be connected to a document through 3 different paths:
organizations.id => jobs.client_id => documents.job_id
organizations.id => jobs.server_id => documents.job_id
organizations.id => users.organization_id => jobs.creator_id => documents.job_id
But I want a query that will get the count of all the documents each company has access to...
I tried a couple of things... like this:
SELECT COUNT(documents.id) document_count, organizations.id, organizations.company_name
FROM organizations
INNER JOIN users ON organizations.id = users.organization_id
INNER JOIN jobs ON (
jobs.client_id = organizations.id OR
jobs.server_id = organizations.id OR
jobs.creator_id = users.id
)
INNER JOIN documents ON documents.job_id = jobs.id
GROUP BY organizations.id, organizations.company_name
ORDER BY document_count DESC
LIMIT 10
The query takes awhile to run, but it's not horrible since i'm doing it for a one-time report, but the results... cannot possibly be correct.
The first listed organization has a reported count of 129,834 documents -- but that's impossible since there's only 32,820 records in the documents table. I feel like it must be counting drastic quantities of duplicates (due to an error in one of my joins?) but I'm not sure where I've gone wrong.
The order appears correct since the highest volume user of the system is clearly at the top of the list... but the value is inflated somehow.
The problem is that if jobs.client_id = organizations.id or jobs.server_id = organizations.id, then there's nothing to filter your INNER JOIN users (aside from its ON clause), so you'll get a separate record for every single user that belongs to that organization. In other words, for each organization, you're adding three values:
its total number of users times the total number of documents belonging to jobs for which it's a client
its total number of users times the total number of documents belonging to jobs for which it's a server
the total number of documents belonging to jobs for which one if its users is the creator
One way to fix this is to remove the INNER JOIN users line, and change this:
jobs.creator_id = users.id
to this:
jobs.creator_id IN (SELECT id FROM users WHERE organization_id = organizations.id)
. . . but that might perform terribly. You might need to try a few things before finding a query that performs acceptably.
Simplify your thinking. You have 3 paths to docid so write 3 queries, union them and count that
It's probably too late to redesign this, but you really should.
The jobs table should not have its own id field a d key.
The jobs table is horribly designed because every reference to a disk page from the id index is gonna have to go read 1-100 different pages from disk out of the data file just to get the three other id fields that you always want to use (which is the clue that a job should not have its own id).
You can make a quick fix by making jobs use an index that is clustered or clustering ( depending on the db system) on the job id field. And alternative will be to mark the other three id fields as "includes" on the index so the page reads to the data file will 100% go away. Either of these may be enough to make this "just work".
What I would encourage you to do though is drop the id field and key on jobs and instead make a "natural key" that has the three other id fields in it and use that key on the documents table as well.
I would also demoralize (repeat) the organization of the creator on the jobs table and the document table. A user isn't going to move to another org and keep the same acces, so you should never have to run a sweep to update these in sync and even if you did it would be easy.
With these changes you can just do a select on the documents table directly, skipping the random pages reads needed from the other tables. The group by to group across the three different id fields would be a bit tricky. I might give this a try as it is interesting.
In the short term though, try clustering or includes on the jobs table to solve the performance issue and I will check the join logic tonight.
None of the answers quite got me there except for the one suggesting a UNION. This is what I came up with:
SELECT COUNT(docs.doc_id) document_count, docs.org_id, docs.org_name
FROM (
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.client_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs ON documents.job_id = jobs.id
INNER JOIN organizations ON jobs.server_id = organizations.id
UNION
SELECT documents.id doc_id, organizations.id org_id, organizations.company_name org_name
FROM documents
INNER JOIN jobs on documents.job_id = jobs.id
INNER JOIN users ON jobs.creator_id = users.id
INNER JOIN organizations ON users.organization_id = organizations.id
) docs
GROUP BY org_id, org_name
ORDER BY document_count DESC
The performance was much better than any of the people suggesting subqueries and it appears to have given me a reasonable answer
But I want a query that will get the count of all the documents you have access to...
That's where your query starts:
SELECT ... FROM documents
...
Since the only clue to the documents table is in jobs, you'll need the jobs table as well::
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
...
Now, it is time for restrictions. Which documents do you actually want ? There are three cases you want: either the client_id matches the organisation, or the server_id maches the company, or the creator_id matches a user that happens to work for the company:
SELECT ...
FROM documents dc
JOIN jobs jo ON jo.document_id = dc.id
WHERE jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
;
But, there might be a problem here. If two or more different jobs-records would point to the same document, you would count these double. You can either add a DISTINCT to the outer query, or move the jobs-table down into a subquery:
SELECT ...
FROM documents dc
WHERE EXISTS (
SELECT *
FROM jobs jo
WHERE jo.document_id = dc.id
AND ( jo.client_id = $THE_COMPANY
OR jo.server_id = $THE_COMPANY
OR EXISTS (
SELECT *
FROM users uu
JOIN organizations oo ON uu.organization_id = ex.id
WHERE uu.id = jo.creator_id
AND oo.id = $THE_COMAPNY
)
)
)
;
As you can see, the thee ways of selecting a document end up in a WHERE (a OR b OR c) clause.
UPDATE: (since the OP does not give us the table definions in a useble form I had to reconstruct these)
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
--
-- create the missing tables
--
CREATE TABLE organizations
( id SERIAL NOT NULL PRIMARY KEY
, company_name varchar
);
CREATE TABLE users
( id SERIAL NOT NULL PRIMARY KEY
, organization_id INTEGER NOT NULL REFERENCES organizations(id)
);
CREATE TABLE jobs
( id SERIAL NOT NULL PRIMARY KEY
, client_id INTEGER NOT NULL REFERENCES organizations(id)
, server_id INTEGER NOT NULL REFERENCES organizations(id)
, creator_id INTEGER NOT NULL REFERENCES users(id)
);
CREATE TABLE documents
( id SERIAL NOT NULL PRIMARY KEY
, job_id INTEGER NOT NULL REFERENCES jobs(id)
);
--
-- Populate
--
INSERT INTO organizations(id, company_name) VALUES
(85,'Big Corporation') ,(905,'Some other folks') ,(403,'ACME, Inc')
;
select setval('organizations_id_seq', 905);
INSERT INTO users(organization_id)
SELECT o.id
FROM generate_series(1,1000)
JOIN organizations o ON random() < 0.3
;
INSERT INTO jobs (client_id,server_id,creator_id)
SELECT o1.id, o2.id, u.id
FROM users u
JOIN organizations o1 ON 1=1
JOIN organizations o2 ON o2.id <> o1.id
;
INSERT INTO documents(job_id)
SELECT id FROM jobs j
;
DELETE FROM documents
WHERE random() < 0.5
;
--
-- And the query ...
--
EXPLAIN ANALYZE
SELECT o.id AS org
, count(*) AS the_docs
FROM organizations o
JOIN documents d ON 1=1 -- start with a carthesian product
WHERE EXISTS (
SELECT *
FROM jobs j
WHERE d.job_id = j.id
AND (j.client_id = o.id OR j.server_id = o.id )
)
OR EXISTS (
SELECT *
FROM jobs j
JOIN users u ON j.creator_id = u.id
WHERE u.organization_id = o.id
AND d.job_id = j.id
)
GROUP BY o.id
;