Optimize subquery join order - sql

I have this query - but it's taking a super long time.
I read on wikipedia that the join order may be a factor:
The performance of a query plan is determined largely by the order in which the tables are joined. For example, when joining 3 tables A, B, C of size 10 rows, 10,000 rows, and 1,000,000 rows, respectively, a query plan that joins B and C first can take several orders-of-magnitude more time to execute than one that joins A and C first
I'm trying to get TV shows an actor has been on through their episodes.
My query looks like this:
select distinct e.show_id
from episodes e
where e.id IN
(select c.episode_id
from contributions c
where c.person_id = #{#person.id})
")
The column count for each is:
2,500,000 episodes
600,000 contributions
40,000 shows
20,000 people
Am I on the right track or should I be doing joins? This query sometimes takes over 10 seconds on heroku even though everything has an index.

Try to use JOIN instead of nested select and IN. Something like that:
SELECT distinct e.show_id
FROM episodes e
JOIN contributions c
ON e.id = c.episode_id
WHERE c.person_id = #person.id

Related

Joining 3 tables in Sqlite and not receiving expected output

I understand similar question have been posted, however my issue isn't an error rather the lack of the desired result. I'm trying to join 3 tables each with 10,000 observations and combine them in the one table, however when i use inner join the observations reduce to little over 4000. I understand that INNER JOIN is essentially an intersection but I'm expecting 10,000 observations and based on my code I don't see how that is occurring. Here is my code:
SELECT *
FROM Characteristics
INNER JOIN Prices ON Prices.pid = Characteristics.pid
INNER JOIN Locations ON Locations.tid = Characteristics.tid
;
CHARACTERISTICS
||Property_Id|| ||Beds|| ||Baths|| ||Type_ID||
PRICES
||Price|| ||Year|| ||Property_ID||
LOCATIONS
||Type_ID|| ||X coord|| ||Y coord||
Those are representative of the tables I didn't include numbers because of formatting issues but as you can imagine the number contained in Property_id and Type_id are the same for all columns regardless of table, what i would like is one table with each of the respective columns containing 10,000 rows, i've checked for NA values on R and they're all of the same length.
If you want to keep all characteristics -- even when there are no matches in the other tables -- then use left join:
SELECT *
FROM Characteristics c LEFT JOIN
Prices p
ON p.pid = c.pid LEFT JOIN
Locations l
ON l.tid = c.tid;

Configuring Merge Join in PostgreSQL

I'm using PostgreSQL with big tables, and query takes too much time.
I have two tables. The first one has about 6 million rows (data table), and the second one has about 30000 rows (users table).
Each user has about 200 rows in data table.
Later, data and users tables may increase up to 30 times.
My query is:
SELECT d.name, count(*) c
FROM users AS u JOIN data AS d on d.id = u.id
WHERE u.language = 'eng' GROUP BY d.name ORDER BY c DESC LIMIT 10;
90% of users has eng language, and query time is 7 seconds. Each column is indexed!
I read about Merge Join and it should be really fast, so I sorted tables by id and forced Merge Join, but time increased up to 20 seconds.
I suppose, the tables configuration is wrong, but I don't know how to fix it.
Should I make other improvements?
For this query:
SELECT d.name, count(*) c
FROM users u JOIN
data d
on d.id = u.id
WHERE u.language = 'eng'
GROUP BY d.name
ORDER BY c DESC
LIMIT 10;
First, try indexes: users(language, id), data(id, name). See if this speeds up the query.
Second, what is d.name? Can a user have more than one of them? Is there a table of valid values? Depending on the answers to these questions, there may be other ways to structure the query.

SQL query - Joining a many-to-many relationship, filtering/joining selectively

I find myself in a bit of an unworkable situation with a SQL query and I'm hoping that I'm missing something or might learn something new. The structure of the DB2 database I'm working with isn't exactly built for this sort of query, but I'm tasked with this...
Let's say we have Table People and Table Groups. Groups can contain multiple people, and one person can be part of multiple groups. Yeah, it's already messy. In any case, there are a couple of intermediary tables linking the two. The problem is that I need to start with a list of groups, get all of the people in those groups, and then get all of the groups with which the people are affiliated, which would be a superset of the initial group set. This would mean starting with groups, joining down to the people, and then going BACK and joining to the groups again. I need information from both tables in the result set, too, so that rules out a number of techniques.
I have to join this with a number of other tables for additional information and the query is getting enormous, cumbersome, and slow. I'm wondering if there's some way that I could start with People, join it to Groups, and then specify that if a person has one group that is in the supplied set of groups (which is done via a subquery), then ALL groups for that person should be returned. I don't know of a way to make this happen, but I'm thinking (hoping) that there's a relatively clean way to make this happen in SQL.
A quick and dirty example:
SELECT ...
FROM GROUPS g
JOIN LINKING_A a
ON g.GROUPID = a.GROUPID
AND GROUPID IN (subquery)
JOIN LINKING_B b
ON a.GROUPLIST = b.GROUPLIST
JOIN PEOPLE p
ON b.PERSONID = p.PERSONID
--This gets me all people affiliated with groups,
-- but now I need all groups affiliated with those people...
JOIN LINKING_B b2
ON p.PERSONID = b2.PERSONID
JOIN LINKING_A a2
ON b2.GROUPLIST = a.GROUPLIST
JOIN GROUPS g2
ON a2.GROUPID = g.GROUPID
And then I can return information from p and g2 in the result set. You can see where I'm having trouble. That's a lot of joining on some large tables, not to mention a number of other joins that are performed in this query as well. I need to be able to query by joining PEOPLE to GROUPS, then specify that if any person has an associated group that is in the subquery, it should return ALL groups affiliated with that entry in PEOPLE. I'm thinking that GROUP BY might be just the thing, but I haven't used that one enough to really know. So if Bill is part of group A, B, and C, and our subquery returns a set containing Group A, the result set should include Bill along with groups A, B, and C.
The following is a shorter way to get all the groups that people in the supplied group list are in. Does this help?
Select g.*
From Linking_B b
Join Linking_B b2
On b2.PersonId = b.PersonId
Join Group g
On g.GroupId = b2.GroupId
Where b.Groupid in (SubQuery)
I'm not clear why you have both Linking_A and Linking_B. Generally all you should need to represent a many-to-many relationship between two master tables is a single association table with GroupID and PersonId.
I often recommend using "common table expressions" [CTE's] in order to help you break a problem up into chunks that can be easier to understand. CTE's are specified using a WITH clause, which can contain several CTE's before starting the main SELECT query.
I'm going to assume that the list of groups you want to start with is specified by your subquery, so that will be the 1st CTE. The next one selects people who belong to those groups. The final part of the query then selects groups those people belong to, and returns the columns from both master tables.
WITH g1 as
(subquery)
, p1 as
(SELECT p.*
from g1
join Linking a1 on g1.groupID=a1.groupID
join People p on p.personID=a1.personID )
SELECT p1.*, g2.*
from p1
join Linking a2 on p2.personID=a2.personID
join Groups g2 on g2.groupID=a2.groupID
I think I'd build the list of people you want to pull records for first, then use that to query out all the groups for those people. This will work across any number of link tables with the appropriate joins added:
with persons_wanted as
(
--figure out which people are in a group you want to include
select p.person_key
from person p
join link l1
on p.person_key = l1.person_key
join groups g
on l1.group_key = g.group_key
where g.group name in ('GROUP_I_WANT_PEOPLE_FROM', 'THIS_ONE_TOO')
group by p.person_key --we only want each person_key once
)
--now pull all the groups for the list of people in at least one group we want
select p.name as person_name, g.name as group_name, ...
from person p
join link l1
on p.person_key = l1.person_key
join groups g
on l1.group_key = g.group_key
where p.person_key in (select person_key from persons_wanted);

SQL Query, return all children in a one-to-many relationship when one child matches

I'm working on enhancing a query for a DB2 database and I'm having some problems getting acceptable performance due to the number of joins across large tables that need to be performed to get all of the data and I'm hoping that there's a SQL function or technique that can simplify and speed up the process.
To break it down, let's say there are two tables: People and Groups. Groups contain multiple people, and a person can be part of multiple groups. It's a many-to-many, but bear with me. Basically, there's a subquery that will return a set of groups. From that, I can join to People (which requires additional joins across other tables) to get all of the people from those groups. However, I also need to know all of the groups that those people are in, which means joining back to the Groups table again (several more joins) to get a superset of the original subquery. There are additional joins in the query as well to get other pieces of relevant data, and the cost is adding up in a very ugly way. I also need to return information from both tables, so that rules out a number of techniques.
What I'd like to do is be able to start with the People table, join it to Groups, and then compare Groups with the subquery. If the Groups attached to one person has one match in the subquery, it should return ALL Group items associated with that person.
Essentially, let's say that Bob is part of Group A, B, and C. Currently, I start with groups, and let's say that only Group A comes out of the subquery. Then I join A to Bob, but then I have to come back and join Bob to Group again to get B and C. SQL example:
SELECT p.*, g2.*
FROM GROUP g
JOIN LINKA link
ON link.GROUPID = g.GROUPID
JOIN LINKB link1
ON link1.LISTID = link.LISTID
JOIN PERSON p
ON link1.PERSONID = p.PERSONID
JOIN LINKB link2
ON link2.PERSONID = p.PERSONID
JOIN LINKA link3
ON link2.LISTID = link3.LISTID
JOIN GROUP g2
ON link3.GROUPID = g2.GROUPID
WHERE
g.GROUPID IN (subquery)
Yes, the linking tables aren't ideal, but they're basically normalized tables containing additional information that is not relevant to the query I'm running. We have to start with a filtered Group set, join to People, then come back to get all of the Groups that the People are associated to.
What I'd like to do is start with People, join to Group, and if ANY Group that Bob is in returns from the subquery, ALL should be returned, so if we have Bob joined to A, B, and C, and A is in the subquery, it will return three rows of Bob to A, B, and C as there was at least one match. In this way, it could be treated as a one-to-many relationship if we're only concerned with the Groups for each Person and not the other way around. SQL example:
SELECT p.*, g.*
FROM PEOPLE p
JOIN LINKB link
ON link.PERSONID = p.PERSONID
JOIN LINKA link1
ON link.LISTID = link1.LISTID
JOIN GROUP g
ON link1.GROUPID = g.GROUPID
WHERE
--SQL function, expression, or other method to return
--all groups for any person who is part of any group contained in the subquery
The number of joins in the first query make it largely unusable as these are some pretty big tables. The second would be far more ideal if this sort of thing is possible.
From the question, I think you are querying hierarchical data. DB2 provides facility to deal with such data. There are two clauses Start with and Connect by in DB2 which will be useful. They are explained here.

How do I choose the best SQL query if there are different ways of accomplishing the same task?

I'm learning SQL (using SQLite 3 and its sqlite3 command-line tool) and I've noticed that I can do some things in several ways, and sometimes it is not clear which one is better. Here are three queries which do the same thing, one executed through intersect, another through inner join and distinct, the last one similar to the second one but it incorporates filtering through where. (The first one was written by the author of the book I'm reading, and the others I wrote myself.)
The question is, which of these queries is better and why? And, more generally, how can I know when one query is better than another? Are there some guidelines I missed or perhaps I should learn SQLite internals despite the declarative nature of SQL?
(In the following example, there are tables that describe food names that are mentioned in some TV series. Foods_episodes is many-to-many linking table while others describe food names and episode names together with season number. Note that all-time ten top foods (based on the count of their appearances in all series) are being looked for, not just top foods in seasons 3..5)
-- task
-- find the all-time top ten foods that appear in seasons 3 through 5
-- schema
-- CREATE TABLE episodes (
-- id integer primary key,
-- season int,
-- name text );
-- CREATE TABLE foods(
-- id integer primary key,
-- name text );
-- CREATE TABLE foods_episodes(
-- food_id integer,
-- episode_id integer );
select f.* from foods f
inner join
(select food_id, count(food_id) as count
from foods_episodes
group by food_id
order by count(food_id) desc limit 10) top_foods
on f.id=top_foods.food_id
intersect
select f.* from foods f
inner join foods_episodes fe on f.id = fe.food_id
inner join episodes e on fe.episode_id = e.id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
inner join (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10) as lol
on lol.food_id = fe.food_id
where
e.season between 3 and 5
order by
f.name;
select
distinct f.*
from
foods_episodes as fe
inner join episodes as e on e.id = fe.episode_id
inner join foods as f on fe.food_id = f.id
where
fe.food_id in (select food_id from foods_episodes
group by food_id order by count(*) desc limit 10)
and e.season between 3 and 5
order by
f.name;
-- output (same for these thee):
-- id name
-- ---------- ----------
-- 4 Bear Claws
-- 146 Decaf Capp
-- 153 Hennigen's
-- 55 Kasha
-- 94 Ketchup
-- 164 Naya Water
-- 317 Pizza
-- CPU Time: user 0.000000 sys 0.000000
Similar to MySQL, it looks like SQLlite has an EXPLAIN command. Prepend your select with the EXPLAIN keyword and it will return information about the query, including the number of rows scanned, and the indexes used.
http://www.sqlite.org/lang_explain.html
By running EXPLAIN on various selects you can determine which queries (and sub-queries) are more efficient than others.
And here is a general overview of SQLlite's query planner and optimization: http://sqlite.org/optoverview.html
SQLlite3 also supports a callback function to trace queries. You have to implement it though: http://www.sqlite.org/c3ref/profile.html
Generally, there is more than 1 way to solve a problem. If you are getting correct answers, the only other question is whether the process/script/statement needs to be improved, or if it works well now.
In SQL generally, there may be a "best' way, but it's usually not the goal to find a canonical best way to do something - you want a way that efficiently blances your needs from the program, and your time. You can spend months optimizing a process, but if the process is used only weekly, and it only takes 5 minutes now, reducing it to 4 minutes isn't much help.
it's weird to transition from a context where there are correct answers (like school,) to a context where the goal is to get something done well, and works well enough trumps perfect, because there are time constraints. It's something that took me a while to appreciate, but I'm not sure there is a better answer. Hope the perspective helps a bit!