Efficient approach for fetching many-to-many relations - sql

I have the following tables:
team: identifier, name
fan: identifier, name
team_fan: team_identifier, fan_identifier
In other words, there is a many-to-many relation between team and fan.
I want to fetch all teams for which a certain condition is met; and for each selected team, I want to fetch all its fans. So, in my application, I want to have the following data-structures:
Team A
Fan F1
Fan F2
Team B
Fan F1
Fan F3
Team C
Fan F2
Fan F3
Fan F4
I already came up with the following solutions:
[0] default, typical approach
The default, typical approach is the inner join:
select team.name, fan.name
from team
inner join team_fan
on team.identifier = team_fan.team_identifier
inner join fan
on team_fan.fan_identifier = fan.identifier
where ... (team conditions)
This provides all the required information to construct the data-structures as demonstrated above.
There a lot of teams and fans can belong to multiple teams. The query above might not be a good idea, because teams and fans are duplicated in the result. All these duplicates need to be transmitted over the wire.
In the alternatives below, I am doing the JOIN in the application. The alternatives below might be slower, but I don't know yet. I want to compare and learn from this.
[1] very naive approach
First, we select all teams:
select name from team where ...
Then, for each team with identifier X, we select its fans:
select name
from fan
where exists(select 1 from team_fan where team_identifier = X)
This is a bad solution, because the number of required queries is 1 + number of teams. Also, a fan belonging to multiple teams is fetched multiple times. We can do better.
[2] top-down approach
First, we select all teams. While doing this, we also collect in an array all fans belonging to the team:
select name, array(select identifier
from fan
where exists(select 1 from team_fan where fan.identifier = team_fan.fan_identifier and team.identifier = team_fan.team_identifier)) as fans
from team
where ...
Then, in our application, we construct the union of all fan identifiers. Given this set of fan identifiers, we can select all fans:
select name from fan where identifier in(...)
Now, I have enough information to replicate the JOIN in my application and construct the data-structures as demonstrated above.
This seems like a better solution. The number of queries is always 2. Also, each team and each fan is only fetched once.
[3] bottom-up approach
I called the previous solution top-down because we are adding an array of children (fan) to the parent (team). In this approach, we do the inverse: we are adding an array of parents (team) to the child (fan).
So, first, let's just select all teams:
select name from team where ...
Next, in our application, we construct the union of all team identifiers. Given this set of team identifier, we can select all fans:
select name, array(select team_fan.team_identifier from team_fan where fan_identifier = fan.identifier and team_identifier in(...))
from fan
where exists(select 1 from team_fan where fan_identifier = fan.identifier and team_identifier in(...));
Now, I have enough information to replicate the JOIN in my application and construct the data-structures as demonstrated above.
This seems also like a valid solution. Also in this case, the number of queries is always 2. Also, each team and each fan is only fetched once.
My question
So, back to my question: I want to fetch all teams for which a certain condition is met; and for each selected team, I want to fetch all its fans.
Currently, I am unsure if approach 2 is better than approach 3 (or vice versa), or even, if there are better approaches for this. Any insights are welcome.

Do a simple join
Select
t.identitfier team_identifier
,t.name team_name
,f.identitfier fan_identifier
,f.name fan_name
From team t
inner join team_fan tf
on t.identifier=tf.team_identifier
/* and --(team condition can be put here) */
inner join fan f on tf.fan_identifier=f.identifier
/*where ... --(or team condition can be put here)*/

I recommend modifying option 2 and removing the fan table altogether.
Assuming there are fewer teams than fans this approach will return fewer rows to your application and will likely be more efficient as the array function will not need to execute on as many rows as the alternative.
SELECT
name,
array(
SELECT DISTINCT
fan_identifier
FROM team_fan
WHERE team.identifier = team_fan.team_identifier
) as fans
FROM team
WHERE ...

Related

Efficient approach to get two-dimensional datau using

For the sake of example, let's say I have the following models:
teams
each team has an arbitrary amount of fans
In SQL, this means you end up with the following tables:
team: identifier, name
fan: identifier, name
team_fan: team_identifier, fan_identifier
I am looking for an approach to retrieve:
all teams, and
for each team, the first 5 fans of which his/her name starts with an 'A'.
What is an efficient approach to do this?
In my current naive approach, I do <# teams> + 1 queries, which is troublesome:
First: SELECT * FROM team
Then, for each team with identifier X:
SELECT *
FROM fan
INNER JOIN team_fan
ON fan.identifier = team_fan.fan_identifier AND team_fan.team_identifier = X
WHERE fan.name LIKE 'A%'
ORDER BY fan.name LIMIT 5
There should be a better way to do this.
I could first retrieve all teams, as I do now, and then do something like:
SELECT *
FROM fan
WHERE fan.name LIKE 'A%'
AND fan.identifier IN (
SELECT fan_identifier
FROM team_fan
WHERE team_identifier IN (<all team identifiers from first query>))
ORDER BY fan.name
However, this approach ignores the requirement that I need the first 5 fans for each team with his/her name starting with an 'A'. Just adding LIMIT 5 to the query above is not correct.
Also, with this approach, if I have a large amount of teams, I am sending the corresponding team identifiers back to the database in the second query (for the IN (<all team identifiers from first query>)), which might kill performance?
I am developing against PostgreSQL, Java, Spring and plain JDBC.
You need a three table join
SELECT team.*, fan.*
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
Now to filter you need to do this.
with cte as (
SELECT team.*, fan.*,
row_number() over (partition by team.team_identifier
order by fan.name) as rn
FROM team
JOIN team_fan
ON team.team_identifier = team_fan.team_identifier
JOIN fan
ON fan.fan_identifier = team_fan.fan_identifier
WHERE fan.name LIKE 'A%'
)
SELECT *
FROM cte
WHERE rn <= 5
Usually, RDBMSes have their own hacks around standard SQL that allows you to have a number in a count over some condition of grouping/ordering.
Postgres is no exception, it got ROW_NUMBER() function.
What you need is to partition your row numbers properly, order them by alphabet and restrict the query to row numbers < 6.

How can I improve a mostly "degenerate" inner join?

This is Oracle 11g.
I have two tables whose relevant columns are shown below (I have to take the tables as given -- I cannot change the column datatypes):
CREATE TABLE USERS
(
UUID VARCHAR2(36),
DATA VARCHAR2(128),
ENABLED NUMBER(1)
);
CREATE TABLE FEATURES
(
USER_UUID VARCHAR2(36),
FEATURE_TYPE NUMBER(4)
);
The tables express the concept that a user can be assigned a number of features. The (USER_UUID, FEATURE_TYPE) combination is unique.
I have two very similar queries I am interested in. The first one, expressed in English, is "return the UUIDs of enabled users who are assigned feature X". The second one is "return the UUIDs and DATA of enabled users who are assigned feature X". The USERS table has about 5,000 records and the FEATURES table has about 40,000 records.
I originally wrote the first query naively as:
SELECT u.UUID FROM USERS u
JOIN FEATURES f ON f.USER_UUID=u.UUID
WHERE f.FEATURE_TYPE=X and u.ENABLED=1
and that had lousy performance. As an experiment I tried to see what would happen if I didn't care about whether or not a user was enabled and that inspired me to try:
SELECT USER_UUID FROM FEATURES WHERE TYPE=X
and that ran very quickly. That in turn inspired me to try
(SELECT USER_UUID FROM FEATURES WHERE TYPE=X)
INTERSECT
(SELECT UUID FROM USERS WHERE ENABLED=1)
That didn't run as quickly as the second query, but ran much more quickly than the first.
After more thinking I realized that in the case at hand every user or almost every user was assigned at least one feature, which meant that the join condition was always or almost always true, which meant that the inner join completely or mostly degenerated into a cross join. And since 5,000 x 40,000 = 200,000,000 that is not a good thing. Obviously the INTERSECT version would be dealing with many fewer rows which presumably is why it is significantly faster.
Question: Is INTERSECT really the way go to in this case or should I be looking at some other type of join?
I wrote the query for the one that also needs to return DATA similarly to the very first one:
SELECT u.UUID, u.DATA FROM USERS u
JOIN FEATURES f ON f.USER_UUID=u.UUID
WHERE f.FEATURE_TYPE=X and u.ENABLED=1
But it would seem I can't do the INTERSECT trick here because there's no column in FEATURES that matches the DATA column.
Question: How can I rewrite this to avoid the degenerate join problem and perform like the query that doesn't return DATA?
I would intuitively use the EXISTS clause:
SELECT u.UUID
FROM USERS u
WHERE u.ENABLED=1
AND EXISTS (SELECT 1 FROM FEATURES f where f.FEATURE_TYPE=X and f.USER_UUID=u.UUID)
or similarly:
SELECT u.UUID, u.DATA
FROM USERS u
WHERE u.ENABLED=1
AND EXISTS (SELECT 1 FROM FEATURES f where f.FEATURE_TYPE=X and f.USER_UUID=u.UUID)
This way you can select every field from USERS since there is no need for INTERSECT anymore (which was a rather good choice for the 1st case, IMHO).

How do I choose the best SQL query if there are different ways of accomplishing the same task?

I'm learning SQL (using SQLite 3 and its sqlite3 command-line tool) and I've noticed that I can do some things in several ways, and sometimes it is not clear which one is better. Here are three queries which do the same thing, one executed through intersect, another through inner join and distinct, the last one similar to the second one but it incorporates filtering through where. (The first one was written by the author of the book I'm reading, and the others I wrote myself.)
The question is, which of these queries is better and why? And, more generally, how can I know when one query is better than another? Are there some guidelines I missed or perhaps I should learn SQLite internals despite the declarative nature of SQL?
(In the following example, there are tables that describe food names that are mentioned in some TV series. Foods_episodes is many-to-many linking table while others describe food names and episode names together with season number. Note that all-time ten top foods (based on the count of their appearances in all series) are being looked for, not just top foods in seasons 3..5)
-- task
-- find the all-time top ten foods that appear in seasons 3 through 5
-- schema
-- CREATE TABLE episodes (
-- id integer primary key,
-- season int,
-- name text );
-- CREATE TABLE foods(
-- id integer primary key,
-- name text );
-- CREATE TABLE foods_episodes(
-- food_id integer,
-- episode_id integer );
select f.* from foods f
inner join
(select food_id, count(food_id) as count
from foods_episodes
group by food_id
order by count(food_id) desc limit 10) top_foods
on f.id=top_foods.food_id
intersect
select f.* from foods f
inner join foods_episodes fe on f.id = fe.food_id
inner join episodes e on fe.episode_id = e.id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
inner join (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10) as lol
on lol.food_id = fe.food_id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
where
fe.food_id in (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10)
and e.season between 3 and 5
order by
f.name;
-- output (same for these thee):
-- id name
-- ---------- ----------
-- 4 Bear Claws
-- 146 Decaf Capp
-- 153 Hennigen's
-- 55 Kasha
-- 94 Ketchup
-- 164 Naya Water
-- 317 Pizza
-- CPU Time: user 0.000000 sys 0.000000
Similar to MySQL, it looks like SQLlite has an EXPLAIN command. Prepend your select with the EXPLAIN keyword and it will return information about the query, including the number of rows scanned, and the indexes used.
http://www.sqlite.org/lang_explain.html
By running EXPLAIN on various selects you can determine which queries (and sub-queries) are more efficient than others.
And here is a general overview of SQLlite's query planner and optimization: http://sqlite.org/optoverview.html
SQLlite3 also supports a callback function to trace queries. You have to implement it though: http://www.sqlite.org/c3ref/profile.html
Generally, there is more than 1 way to solve a problem. If you are getting correct answers, the only other question is whether the process/script/statement needs to be improved, or if it works well now.
In SQL generally, there may be a "best' way, but it's usually not the goal to find a canonical best way to do something - you want a way that efficiently blances your needs from the program, and your time. You can spend months optimizing a process, but if the process is used only weekly, and it only takes 5 minutes now, reducing it to 4 minutes isn't much help.
it's weird to transition from a context where there are correct answers (like school,) to a context where the goal is to get something done well, and works well enough trumps perfect, because there are time constraints. It's something that took me a while to appreciate, but I'm not sure there is a better answer. Hope the perspective helps a bit!

Modelling database for a small soccer league

The database is quite simple. Below there is a part of a schema relevant to this question
ROUND (round_id, round_number)
TEAM (team_id, team_name)
MATCH (match_id, match_date, round_id)
OUTCOME (team_id, match_id, score)
I have a problem with query to retrieve data for all matches played. The simple query below gives of course two rows for every match played.
select *
from round r
inner join match m on m.round_id = r.round_id
inner join outcome o on o.match_id = m.match_id
inner join team t on t.team_id = o.team_id
How should I write a query to have the match data in one row?
Or maybe should I redesign the database - drop the OUTCOME table and modify the MATCH table to look like this:
MATCH (match_id, match_date, team_away, team_home, score_away, score_home)?
You can almost generate the suggested change from the original tables using a self join on outcome table:
select o1.team_id team_id_1,
o2.team_id team_id_2,
o1.score score_1,
o2.score score_2,
o1.match_id match_id
from outcome o1
inner join outcome o2 on o1.match_id = o2.match_id and o1.team_id < o2.team_id
Of course, the information for home and away are not possible to generate, so your suggested alternative approach might be better after all. Also, take note of the condition o1.team_id < o2.team_id, which gets rid of the redundant symmetric match data (actually it gets rid of the same outcome row being joined with itself as well, which can be seen as the more important aspect).
In any case, using this select as part of your join, you can generate one row per match.
you fetch 2 rows for every matches played but team_id and team_name are differents :
- one for team home
- one for team away
so your query is good
Using the match table as you describe captures the logic of a game simply and naturally and additionally shows home and away teams which your initial model does not.
You might want to add the round id as a foreign key to round table and perhaps a flag to indicate a match abandoned situation.
drop outcome. it shouldn't be a separate table, because you have exactly one outcome per match.
you may consider how to handle matches that are cancelled - perhaps scores are null?

How to make this relation in PostgreSQL?

Hello.
As shown in the ER model, I want to create a relation between "Busses" and "Chauffeurs", where every entity in "Chauffeurs" must have at least one relation in "Certified", and every entity in "Busses" must have at least one relation in "Certified".
Though it was pretty easy to design the ER model, I can't seem to find a way of making this relation in PostgreSQL. Anybody got some ideas ?
Thanks
The solution should be database agnostic. If I understand you correctly, you probably want your certified table to look like:
CERTIFIED
id
bus_id
chauffer_id
...
...
The only solution I've been able to find is the notion of a single mandatory field in the parent table to represent the "at least one" and then storing the 2 or more relationships in the intersection table.
chauffeurs
chauffeur_id
chauffer_name
certified_bus_id (not null)
certified
chauffer_id
bus_id
busses
bus_id
bus_name
certified_chauffer_id (not null)
To get a list of all busses where a chauffer is certified becomes
select c.chauffer_name, b.bus_name
from chauffeurs c
inner join busses b on (b.bus_id = c.certified_bus_id)
UNION
select c.chauffer_name, b.bus_name
from chauffeurs c
inner join certified ct on (c.chauffeur_id = ct.chauffer_id)
inner join busses b on (ct.bus_id = b.bus_id)
The UNION (vs UNION ALL) takes care of deduplication with the values in certified.