We have a query that is currently killing our database and I know there has to be a way to optimize it. We have 3 tables:
items - table of items where each items has an associated object_id, length, difficulty_rating, rating, avg_rating & status
lists - table of lists which are basically lists of items created by our users
list_items - table with 2 columns: list_id, item_id
We've been using the following query to display a simple HTML table that shows each list and a number of attributes related to the list including averages of attributes of the included list items:
select object_id, user_id, slug, title, description, items,
city, state, country, created, updated,
(select AVG(rating) from items
where object_id IN
(select object_id from list_items where list_id=lists.object_id)
AND status="A"
) as 'avg_rating',
(select AVG(avg_rating) from items
where object_id IN
(select object_id from list_items where list_id=lists.object_id)
AND status="A"
) as 'avg_avg_rating',
(select AVG(length) from items
where object_id IN
(select object_id from list_items where list_id=lists.object_id)
AND status="A"
) as 'avg_length',
(select AVG(difficulty_rating) from items
where object_id IN
(select object_id from list_items where list_id=lists.object_id)
AND status="A"
) as 'avg_difficulty'
from lists
where user_id=$user_id AND status="A"
order by $orderby LIMIT $start,$step
The reason why we haven't broken this up in 1 query to get all the lists and subsequent lookups to pull the averages for each list is because we want the user to be able to sort on the averages columns (i.e. 'order by avg_difficulty').
Hopefully my explanation makes sense. There has to be a much more efficient way to do this and I'm hoping that a MySQL guru out there can point me in the right direction. Thanks!
It looks like you can replace all the subqueries with joins:
SELECT l.object_id,
l.user_id,
<other columns from lists>
AVG(i.rating) as avgrating,
AVG(i.avg_rating) as avgavgrating,
<other averages>
FROM lists l
LEFT JOIN list_items li
ON li.list_id = l.object_id
LEFT JOIN items i
ON i.object_id = li.object_id
AND i.status = 'A'
WHERE l.user_id = $user_id AND l.status = 'A'
GROUP BY l.object_id, l.user_id, <other columns from lists>
That would save a lot of work for the DB engine.
Here how to find the bottleneck:
Add the keyword EXPLAIN before the SELECT. This will cause the engine to output how the SELECT was performed.
To learn more about Query Optimization with this method see: http://dev.mysql.com/doc/refman/5.0/en/using-explain.html
A couple of things to consider:
Make sure that all of your joins are indexed on both sides. For example, you join list_items.list_id=lists.object_id in several places. list_id and object_id should both have indexes on them.
Have you done any research as to what the variation in the averages are? You might benefit from having a worker thread (or cronjob) calculate the averages periodically rather than putting the load on your RDBMS every time you run this query. You'd need to store the averages in a separate table of course...
Also, are you using status as an enum or a varchar? The cardinality of an enum would be much lower; consider switching to this type if you have a limited range of values for status column.
-aj
That's one hell of a query... you should probably edit your question and change the query so it's a bit more readable, although due to the complex nature of it, I'm not sure that's possible.
Anyway, the simple answer here is to denormalize your database a bit and cache all of your averages on the list table itself in indexed decimal columns. All those sub queries are killing you.
The hard part, and what you'll have to figure out is how to keep those averages updated. A generally easy way is to store the count of all items and the sum of all those values in two separate fields. Anytime an action is made, increment the count by 1, and the sum by whatever. Then update table avg_field = sum_field/count_field.
Besides indexing, even a cursory analysis shows that your query contains much redundancy that your DBMS' optimizer cannot be able to spot (SQL is a redundant language, it admits too many equivalents, syntactically different expressions; this is a known and documented problem - see for example SQL redundancy and DBMS performance, by Fabian Pascal).
I will rewrite your query, below, to highlight that:
let LI =
select object_id from list_items where list_id=lists.object_id
in
select object_id, user_id, slug, title, description, items, city, state, country, created, updated,
(select AVG(rating) from items where object_id IN LI AND status="A") as 'avg_rating',
(select AVG(avg_rating) from items where object_id IN LI AND status="A") as 'avg_avg_rating',
(select AVG(length) from items where object_id IN LI AND status="A") as 'avg_length',
(select AVG(difficulty_rating) from items where object_id IN LI AND status="A") as 'avg_difficulty'
from lists
where user_id=$user_id AND status="A"
order by $orderby
LIMIT $start, $step
Note: this is only the first step to refactor that beast.
I wonder: why people rarely - if at all - use views, even only to simplify SQL queries? It will help in writing more manageable and refactorable queries.
Related
I have a table, the columns are:
Respondent_ID, classical, gospel, pop, kpop, country, folk, rock, metal ... (all genre of music)
there are 16 columns of different type of genre of music,
and data value is Never, Rarely, Sometimes or Very frequently
SELECT *
FROM genre_frequency
WHERE
I want to design a query which show results of all columns in the table what has the value 'Very Frequently', can anyone lend me a hand here? I'm still new to this, please help anyone...
Could put the same criteria under every genre field with OR operator - very messy. Or could use a VBA custom function.
Or could normalize data structure so you have fields: RespondentID, Genre, Frequency. A UNION query can rearrange data to this normalized structure (unpivot). There is a limit of 50 SELECT lines and there is no builder or wizard for UNION - must type or copy/paste in SQL View.
SELECT Respondent_ID, "classical" AS Genre, classical AS Frequency FROM genre_frequency
UNION SELECT Respondent_ID, "gospel", gospel FROM genre_frequency
... {continue for additional genre columns};
Now use that query like a table in subsequent queries. Just cannot edit data.
SELECT * FROM qryUNION WHERE Frequency="Very frequently";
UNION query can perform slowly with very large dataset. Probably would be best to redesign table. Could save this rearranged data to a table. If you want to utilize lookup tables for Genre and Frequency in order to save ID keys instead of full descriptive text, that can also be accommodated in redesign.
You should normalize your schema. This one has the problem that it requires you to alter the table whenever you want to add or remove a genre.
You must have at least three tables:
Table Respondent: Respondent_ID (PK), Name, etc.
Table Genre: Genre_ID (PK), Name
Table Respondent_Genre: Respondent_ID (PK, FK), Genre_ID (PK, FK), Frequency
This also easily allows you to alter the name of a genre or to add additional attributes to a genre like sub-genre or an annotation like (1930–present).
Optionally, you could also have a lookup table for Frequencies and then include the Frequency_ID in Respondent_Genre instead the Frequency as text.
Then you can write a query like this
SELECT r.Name, g.Name, rg.Frequency
FROM
(Respondent r
INNER JOIN Respondent_Genre rg
ON r.Respondent_ID = rg.Respondent_ID)
INNER JOIN Genre g
ON rg.Genre_ID = g.Genre_ID
WHERE
rg.Frequency = 'Very Frequently'
I have a data model such that items can have many-to-many relationships with other items in the same table using a second table to define relationships. Let's call the primary table items, keyed by item_id and the relationships table item_assoc with columns item_id and other_item_id and assoc_type. Generally, you might use a union to pick up on relationships that may be defined in either direction in the item_assoc table, but you would wind up repeating other parts of the same query just to be sure to pick up associations defined in either direction.
Let's say that you're trying to put together a fairly complex query similar to the following where you want to find a list of items that have related items that COULD have associated cancellation items, but select those that do not have cancellation items:
select
orig.*
from items as orig
join item_assoc as orig2related
on orig.item_id = orig2related.item_id
join items as related
on orig2related.other_item_id = related.item_id
and orig2related.assoc_type = 'Related'
left join item_assoc as related2cancel
on related.item_id = related2cancel.item_id
left join items as cancel
on related2cancel.other_item_id = cancel.item_id
and related2cancel.assoc_type = 'Cancellation'
where cancel.item_id is null
This query obviously only picks up items whose relationships are defined in one direction. For a less complex query, I might solve this by adding a union at the bottom for every permutation of the reverse relationships, but I think that would make the query unnecessarily long and hard to understand.
Is there a way I can define both directions of each relationship without repeating the other parts of the query?
A UNION within item_assoc could help. Assuming you have a DB without a WITH clause you would have to define a view
CREATE VIEW bidirec_item_assoc AS
(
SELECT item_id, other_item_id, assoc_type, 1 as direction FROM item_assoc
UNION
SELECT other_item_id, item_id, assoc_type, 2 as direction FROM item_assoc
)
You can now use bidirec_item_assoc in your queries where you have used items_assoc before.
Edited Out: You could add columns for direction and relationtype, of course
Simplify, simplify, simplify: Don't involve tables in the query that aren't needed.
The following query should be equivalent to your sample query and more expressive of your intent:
select i.*
from items i
where not exists ( select *
from item_assoc r
join item_assoc c on c.item_id = r.item_id
and c.assoc_type = 'Cancellation'
where r.item_id = i.item_id
and r.assoc_type = 'Related'
)
It should select the set of items that aren't related to an item that has been cancelled. There's not need to join against the items table 3 times.
Further, your original query will have duplicate rows: every row in the first item table (orig) will be duplicated once for every related item.
I have three tables that control products, colors and sizes. Products can have or not colors and sizes. Colors can or not have sizes.
product color size
------- ------- -------
id id id
unique_id id_product (FK from product) id_product (FK from version)
stock unique_id id_version (FK from version)
title stock unique_id
stock
The unique_id column, that is present in all tables, is a serial type (autoincrement) and its counter is shared with the three tables, basically it works as a global unique ID between them.
It works fine, but i am trying to increase the query performance when i have to select some fields based in the unique_id.
As i don't know where is the unique_id that i am looking for, i am using UNION, like below:
select title, stock
from product
where unique_id = 10
UNION
select p.title, c.stock
from color c
join product p on c.id_product = p.id
where c.unique_id = 10
UNION
select p.title, s.stock
from size s
join product p on s.id_product = p.id
where s.unique_id = 10;
Is there a better way to do this? Thanks for any suggestion!
EDIT 1
Based on #ErwinBrandstetter and #ErikE answers i decided to use the below query. The main reasons is:
1) As unique_id has indexes in all tables, i will get a good performance
2) Using the unique_id i will find the product code, so i can get all columns i need using a another simple join
SELECT
p.title,
ps.stock
FROM (
select id as id_product, stock
from product
where unique_id = 10
UNION
select id_product, stock
from color
where unique_id = 10
UNION
select id_product, stock
from size
where unique_id = 10
) AS ps
JOIN product p ON ps.id_product = p.id;
PL/pgSQL function
To solve the problem at hand, a plpgsql function like the following should be faster:
CREATE OR REPLACE FUNCTION func(int)
RETURNS TABLE (title text, stock int) LANGUAGE plpgsql AS
$BODY$
BEGIN
RETURN QUERY
SELECT p.title, p.stock
FROM product p
WHERE p.unique_id = $1; -- Put the most likely table first.
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, c.stock
FROM color c
JOIN product p ON c.id_product = p.id
WHERE c.unique_id = $1;
END;
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, s.stock
FROM size s
JOIN product p ON s.id_product = p.id
WHERE s.unique_id = $1;
END IF;
END;
$BODY$;
Updated function with table-qualified column names to avoid naming conflicts with OUT parameters.
RETURNS TABLE requires PostgreSQL 8.4, RETURN QUERY requires version 8.2. You can substitute both for older versions.
It goes without saying that you need to index the columns unique_id of every involved table. id should be indexed automatically, being the primary key.
Redesign
Ideally, you can tell which table from the ID alone. You could keep using one common sequence, but add 100000000 for the first table, 200000000 for the second and 300000000 for the third - or whatever suits your needs. This way, the least significant part of the number is easily distinguishable.
A plain integer spans numbers from -2147483648 to +2147483647, move to bigint if that's not enough for you. I would stick to integer IDs, though, if possible. They are smaller and faster than bigint or text.
CTEs (experimental!)
If you cannot create a function for some reason, this pure SQL solution might do a similar trick:
WITH x(uid) AS (SELECT 10) -- provide unique_id here
, a AS (
SELECT title, stock
FROM x, product
WHERE unique_id = x.uid
)
, b AS (
SELECT p.title, c.stock
FROM x, color c
JOIN product p ON c.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM a)
AND c.unique_id = x.uid
)
, c AS (
SELECT p.title, s.stock
FROM x, size s
JOIN product p ON s.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM b)
AND s.unique_id = x.uid
)
SELECT * FROM a
UNION ALL
SELECT * FROM b
UNION ALL
SELECT * FROM c;
I am not sure whether it avoids additional scans like I hope. Would have to be tested. This query requires at least PostgreSQL 8.4.
Upgrade!
As I just learned, the OP runs on PostgreSQL 8.1.
Upgrading alone would speed up the operation a lot.
Query for PostgreSQL 8.1
As you are limited in your options, and a plpgsql function is not possible, this function should perform better than the one you have. Test with EXPLAIN ANALYZE - available in v8.1.
SELECT title, stock
FROM product
WHERE unique_id = 10
UNION ALL
SELECT p.title, ps.stock
FROM product p
JOIN (
SELECT id_product, stock
FROM color
WHERE unique_id = 10
UNION ALL
SELECT id_product, stock
FROM size
WHERE unique_id = 10
) ps ON ps.id_product = p.id;
I think it's time for a redesign.
You have things that you're using as bar codes for items that are basically all the same in one respect (they are SerialNumberItems), but have been split into multiple tables because they are different in other respects.
I have several ideas for you:
Change the Defaults
Just make each product required to have one color "no color" and one size "no size". Then you can query any table you want to find the info you need.
SuperType/SubType
Without too much modification you could use the supertype/subtype database design pattern.
In it, there is a parent table where all the distinct detail-level identifiers live, and the shared columns of the subtype tables go in the supertype table (the ways that all the items are the same). There is one subtype table for each different way that the items are distinct. If mutual exclusivity of the subtype is required (you can have a Color or a Size but not both), then the parent table is given a TypeID column and the subtype tables have an FK to both the ParentID and the TypeID. Looking at your design, in fact you would not use mutual exclusivity.
If you use the pattern of a supertype table, you do have the issue of having to insert in two parts, first to the supertype, then the subtype. Deleting also requires deleting in reverse order. But you get a great benefit of being able to get basic information such as Title and Stock out of the supertype table with a single query.
You could even create schema-bound views for each subtype, with instead-of triggers that convert inserts, updates, and deletes into operations on the base table + child table.
A Bigger Redesign
You could completely change how Colors and Sizes are related to products.
First, your patterns of "has-a" are these:
Product (has nothing)
Product->Color
Product->Size
Product->Color->Size
There is a problem here. Clearly Product is the main item that has other things (colors and sizes) but colors don't have sizes! That is an arbitrary assignment. You may as well have said that Sizes have Colors--it doesn't make a difference. This reveals that your table design may not be best, as you're trying to model orthogonal data in a parent-child type of relationship. Really, products have a ColorAndSize.
Furthermore, when a product comes in colors and sizes, what does the uniqueid in the Color table mean? Can such a product be ordered without a size, having only a color? This design is assigning a unique ID to something that (it seems to me) should never be allowed to be ordered--but you can't find this information out from the Color table, you have to compare the Color and Size tables first. It is a problem.
I would design this as: Table Product. Table Size listing all distinct sizes possible for any product ever. Table Color listing all distinct colors possible for any product ever. And table OrderableProduct that has columns ProductId, ColorID, SizeID, and UniqueID (your bar code value). Additionally, each product must have one color and one size or it doesn't exist.
Basically, Color and Size are like X and Y coordinates into a grid; you are filling in the boxes that are allowable combinations. Which one is the row and which the column is irrelevant. Certainly, one is not a child of the other.
If there are any reasonable rules, in general, about what colors or sizes can be applied to various sub-groups of products, there might be utility in a ProductType table and a ProductTypeOrderables table that, when creating a new product, could populate the OrderableProduct table with the standard set—it could still be customized but might be easier to modify than to create anew. Or, it could define the range of colors and sizes that are allowable. You might need separate ProductTypeAllowedColor and ProductTypeAllowedSize tables. For example, if you are selling T-shirts, you'd want to allow XXXS, XXS, XS, S, M, L, XL, XXL, XXXL, and XXXXL, even if most products never use all those sizes. But for soft drinks, the sizes might be 6-pack 8oz, 24-pack 8oz, 2 liter, and so on, even if each soft drink is not offered in that size (and soft drinks don't have colors).
In this new scheme, you only have one table to query to find the correct orderable product. With proper indexes, it should be blazing fast.
Your Question
You asked:
in PostgreSQL, so do you think if i use indexes on unique_id i will get a satisfactory performance?
Any column or set of columns that you use to repeatedly look up data must have an index! Any other pattern will result in a full table scan each time, which will be awful performance. I am sure that these indexes will make your queries lightning fast as it will take only one leaf-level read per table.
There's an easier way to generate unique IDs using three separate auto_increment columns. Just prepend a letter to the ID to uniquify it:
Colors:
C0000001
C0000002
C0000003
Sizes:
S0000001
S0000002
S0000003
...
Products:
P0000001
P0000002
P0000003
...
A few advantages:
You don't need to serialize creation of ids across tables to ensure uniqueness. This will give better performance.
You don't actually need to store the letter in the table. All IDs in the same table start with the same letter, so you only need to store the number. This means that you can use an ordinary auto_increment column to generate your IDs.
If you have an ID you only need to check the first character to see which table it can be found in. You don't even need to make a query to the database if you just want to know whether it's a product ID or a size ID.
A disadvantage:
It's no longer a number. But you can get around that by using 1,2,3 instead of C,S,P.
Your query will be pretty much efficient, as long as you have an index on unique_id, on every table and indices on the joining columns.
You could turn those UNION into UNION ALL but the won't be any differnce on performance, for this query.
This is a bit different. I don't understand the intended behaviour if stocks exists in more than one of the {product,color,zsize} tables. (UNION will remove duplicates, but for the row-as-a-whole, eg the {product_id,stock} tuples. That makes no sense to me. I just take the first. (Note the funky self-join!!)
SELECT p.title
, COALESCE (p2.stock, c.stock, s.stock) AS stock
FROM product p
LEFT JOIN product p2 on p2.id = p.id AND p2.unique_id = 10
LEFT JOIN color c on c.id_product = p.id AND c.unique_id = 10
LEFT JOIN zsize s on s.id_product = p.id AND s.unique_id = 10
WHERE COALESCE (p2.stock, c.stock, s.stock) IS NOT NULL
;
I have these tables Genre and Songs. There is obviously many to many relationship btw them, as one genre can have (obviously) have many songs and one song may belong to many genre (say there is a song xyz, it belong to rap, it can also belong to hip-hop). I have this table GenreSongs which acts as a many to many relationship map btw these two, as it contains GenreID and SongID column. So, what I am supposed to do this, add a column to this Genre table named SongsCount which will contain the number of songs in this genre. I can alter table to add a column, also create a query that will give the count of song,
SELECT GenreID, Count(SongID) FROM GenreSongs GROUP BY GenreID
Now, this gives us what we require, the number of songs per genre, but how can I use this query to update the column I made (SongsCount). One way is that run this query and see the results, and then manually update that column, but I am sure everyone will agree that's not a programmtic way to do it.
I came to think I would require to create a query with a subquery, that would get the value of GenreID from outer query and then count of its value from inner query (correlated query) but I can't make any. Can any one please help me make this?
The question of how to approach this depends on the size of your data and how frequently it is updated. Here are some scenarios.
If your songs are updated quite frequently and your tables are quite large, then you might want to have a column in Genre with the count, and update the column using a trigger on the Songs table.
Alternatively, you could build an index on the GenreSong table on Genre. Then the following query:
select count(*)
from GenreSong gs
where genre = <whatever>
should run quite fast.
If your songs are updated infrequently or in a batch (say nightly or weekly), then you can update the song count as part of the batch. Your query might look like:
update Genre
set SongCnt = cnt
from (select Genre, count(*) as cnt from GenreCount gc group by Genre) gc
where Genre.genre = gc.Genre
And yet another possibility is that you don't need to store the value at all. You can make it part of a view/query that does the calculation on the fly.
Relational databases are quite flexible, and there is often more than one way to do things. The right approach depends very much on what you are trying to accomplish.
Making a table named SongsCount is just plainly bad design (redundant data and update overhead). Instead use this query for single results:
SELECT ID, ..., (SELECT Count(*) FROM GenreSongs WHERE GenreID = X) AS SongsCount FROM Genre WHERE ID = X
And this for multiple results (much more efficient):
SELECT ID, ..., SongsCount FROM (SELECT GenreID, Count(*) AS SongsCount FROM GenreSongs GROUP BY GenreID) AS sub RIGHT JOIN Genre AS g ON sub.GenreID = g.ID
I'm taking a database course this semester, and we're learning SQL. I understand most simple queries, but I'm having some difficulty using the count aggregate function.
I'm supposed to relate an advertisement number to a property number to a branch number so that I can tally up the amount of advertisements by branch number and compute their cost. I set up what I think are two appropriate new views, but I'm clueless as to what to write for the select statement. Am I approaching this the correct way? I have a feeling I'm over complicating this bigtime...
with ad_prop(ad_no, property_no, overseen_by) as
(select a.ad_no, a.property_no, p.overseen_by
from advertisement as a, property as p
where a.property_no = p.property_no)
with prop_branch(property_no, overseen_by, allocated_to) as
(select p.property_no, p.overseen_by, s.allocated_to
from property as p, staff as s
where p.overseen_by = s.staff_no)
select distinct pb.allocated_to as branch_no, count( ??? ) * 100 as ad_cost
from prop_branch as pb, ad_prop as ap
where ap.property_no = pb.property_no
group by branch_no;
Any insight would be greatly appreciated!
You could simplify it like this:
advertisement
- ad_no
- property_no
property
- property_no
- overseen_by
staff
- staff_no
- allocated_to
SELECT s.allocated_to AS branch, COUNT(*) as num_ads, COUNT(*)*100 as ad_cost
FROM advertisement AS a
INNER JOIN property AS p ON a.property_no = p.property_no
INNER JOIN staff AS s ON p.overseen_by = s.staff_no
GROUP BY s.allocated_to;
Update: changed above to match your schema needs
You can condense your WITH clauses into a single statement. Then, the piece I think you are missing is that columns referenced in the column definition have to be aggregated if they aren't included in the GROUP BY clause. So you GROUP BY your distinct column then apply your aggregation and math in your column definitions.
SELECT
s.allocated_to AS branch_no
,COUNT(a.ad_no) AS ad_count
,(ad_count * 100) AS ad_cost
...
GROUP BY s.allocated_to
i can tell you that you are making it way too complicated. It should be a select statement with a couple of joins. You should re-read the chapter on joins or take a look at the following link
http://www.sql-tutorial.net/SQL-JOIN.asp
A join allows you to "combine" the data from two tables based on a common key between the two tables (you can chain more tables together with more joins). Once you have this "joined" table, you can pretend that it is really one table (aliases are used to indicate where that column came from). You understand how aggregates work on a single table right?
I'd prefer not to give you the answer so that you can actually learn :)