Aggregation with two Joins (MySQL) - sql

I have one table called gallery. For each row in gallery there are several rows in the table picture. One picture belongs to one gallery. Then there is the table vote. There each row is an upvote or a downvote for a certain gallery.
Here is the (simplified) structure:
gallery ( gallery_id )
picture ( picture_id, picture_gallery_ref )
vote ( vote_id, vote_value, vote_gallery_ref )
Now I want one query to give me the following information: All galleries with their own data fields and the number of pictures that are connected to the gallery and the sumarized value of the votes.
Here is my query, but due to the multiple joining the aggregated values are not the right ones. (At least when there is more than one row of either pictures or votes.)
SELECT
*, SUM( vote_value ) as score, COUNT( picture_id ) AS pictures
FROM
gallery
LEFT JOIN
vote
ON gallery_id = vote_gallery_ref
LEFT JOIN
picture
ON gallery_id = picture_gallery_ref
GROUP BY gallery_id
Because I have noticed that COUNT( DISTINCT picture_id ) gives me the correct number of pictures I tried this:
( SUM( vote_value ) / GREATEST( COUNT( DISTINCT picture_id ), 1 ) ) AS score
It works in this example, but what if there were more joins in one query?
Just want to know whether there is a better or more 'elegant' way this problem can be solved. Also I'd like to know whether my solution is MySQL-specific or standard SQL?

This quote from William of Okham applies here:
Enita non sunt multiplicanda praeter necessitatem
(Latin for "entities are not to be multiplied beyond necessity").
You should reconsider why do you need this to be done in a single query? It's true that a single query has less overhead than multiple queries, but if the nature of that single query becomes too complex, both for you to develop, and for the RDBMS to execute, then run separate queries.

Or just use subqueries...
I don't know if this is valid MySQL syntax, but you might be able to do something similar to:
SELECT
gallery.*, a.score, b.pictures
LEFT JOIN
(
select vote_gallery_ref, sum(vote_value) as score
from vote
group by vote_gallery_ref
) a ON gallery_id = vote_gallery_ref
LEFT JOIN
(
select picture_gallery_ref, count(picture_id) as pictures
from picture
group by picture_gallery_ref
) b ON gallery_id = picture_gallery_ref

How often do you add/change vote records?
How often do you add/remove picture records?
How often do you run this query for these totals?
It might be better to create total fields on the gallery table (total_pictures, total_votes, total_vote_values).
When you add or remove a record on the picture table you also update the total on the gallery table. This could be done using triggers on the picture table to automatically update the gallery table. It could also be done using a transaction combining two SQL statements to update the picture table and the gallery table. When you add a record on the picture table increment the total_pictures field on the gallery table. When you delete a record on the picture table decrement the total_pictures field.
Similary when a vote record is added or removed or the vote_value changes you update the total_votes and total_vote_values fields. Adding a record increments the total_votes field and adds vote_values to total_vote_values. Deleting a record decrements the total_votes field and subtracts vote_values from total_vote_values. Updating vote_values on a vote record should also update total_vote_values with the difference (subtract old value, add new value).
Your query now becomes trivial - it's just a straightforward query from the gallery table. But this is at the expense of more complex updates to the picture and vote tables.

As Bill Karwin said, doing this all within one query is pretty ugly.
But, if you have to do it, joining and selecting non-aggregate data with aggregate data requires joining against subqueries (I haven't used SQL that much in the past few years so I actually forgot the proper term for this).
Let's assume your gallery table has additional fields name and state:
select g.gallery_id, g.name, g.state, i.num_pictures, j.sum_vote_values
from gallery g
inner join (
select g.gallery_id, count(p.picture_id) as 'num_pictures'
from gallery g
left join picture p on g.gallery_id = p.picture_gallery_ref
group by g.gallery_id) as i on g.gallery_id = i.gallery_id
left join (
select g.gallery_id, sum(v.vote_value) as 'sum_vote_values'
from gallery g
left join vote v on g.gallery_id = v.vote_gallery_ref
group by g.gallery_id
) as j on g.gallery_id = j.gallery_id
This will yield a result set that looks like:
gallery_id, name, state, num_pictures, sum_vote_values
1, 'Gallery A', 'NJ', 4, 19
2, 'Gallery B', 'NY', 3, 32
3, 'Empty gallery', 'CT', 0,

Related

Add / update column from a query SELECT? SQL

I'm quite a novice on this and I don't know if I will explain myself well. I am trying to do an exercise in SQL in which asks me to update the data in an "X" table from other data in a "Y" table. The problem is that it is not about updating table X exactly like the data in table Y. I put the statement and my tables:
Update the "numJocs" field (number of games) for all platforms, depending on the number of games each of the platforms in the GAMES table has.
PLATFORM table:
where: "nom" is name.
GAMES table:
where: "nom" is name, "preu" is price, "idPlataforma" is idPlatform and "codiTenda" is storeCode, but only idPlataforma interested for this exercise.
If I do:
SELECT COUNT(games.idPlataforma)
FROM games
GROUP BY (games.idPlataforma)
I can see how many games there are for each platform. The result would be:
count(games.idPlataforma)
__________________________
2
1
2
2
I would like to be able to put this result in the PLATFORM table, column "numJocs". But I don't know how to do it ... I also don't want to put it manually, that is, a "2" in a row "1", etc ... but I would like to be able to make a query and add that query in the column that I have to fill in. He tried to do a thousand things, but nothing ... Any help?
Thanks!!
for one time update you can use below query
update Product P
INNER JOIN (
SELECT games.idPlataforma, COUNT(games.idPlataforma) as cnt
FROM games
GROUP BY games.idPlataforma
) x ON P.id= x.idPlataforma
SET P.numJocs= x.cnt
For the next time on every entry of new game you have a update numJocs
Suppose you have 2 tables, table 1 and table 2:
Table 1:
Table 2:
You could insert new values into table 1, based on table 2 by doing the following:
insert into Table1(number,CFG) select ITEM,results from Table2
Which has the following result in table 1:
Any database should support the syntax using a correlated subquery:
update platforms
set numjocs = (select count(*)
from games g
where g.idPlatforms = platforms.id
);
I would caution you though about storing this value in the table. It will be immediately out of data if the platforms table changes. If you want to keep it in synch, then you need to create triggers -- and this is all rather complicated.
Rather, calculate the data on the fly:
select p.*,
(select count(*)
from games g
where g.idPlatforms = platforms.id
) as numjocs
from platforms p ;
You can put this in a view if you like. Many databases support materialized views where the results of the view are stored in a "table" and the table is kept up-to-date with the underlying data.

How can I increase speed of SQL query?

As a Image that I sent ,
For getting count of comment can I add one filed in table of Module or not , my mean is for big record like 100 million comments or big project, which one is better/faster adding one filed to module and after each inserting comment update it or have a relationship
For getting count of comment , I must to choose which one :
select Module.Id,
(SELECT COUNT(*) AS Expr1
FROM dbo.CommentTable
WHERE (CommentTable.MuoduleId= Module.userid)) AS commentCount
from Model
or
select Module.Id, Module.CountComment
I suggest you compute it on the fly instead of saving the count in the table itself. To get the count of comments of each Module:
SELECT
m.id,
CommentCount = COUNT(c.ModuleId)
FROM Module m
LEFT JOIN CommentTable c
ON c.ModuleId = m.Id
GROUP BY m.id
This will be faster if you have an index on CommentTable(ModuleId):
CREATE NONCLUSTERED INDEX NCI_CommentTable_ModuleId ON CommentTable(ModuleId)

Select average rating from another datatable

I have 3 data tables.
In the entries data table I have entries with ID (entryId as primary key).
I have another table called EntryUsersRatings in there are multiple entries that have entryId field and a rating value (from 1 to 5).
(ratings are stored multiple times for one entryId).
Columns: ratingId (primary key), entryId, rating (integer value).
In the third data table I have translations of entries in the first table (with entryId, languageId and title - translation).
What I would like to do is select all entries from first data table with their titles (by language ID).
On a top of that I want average rating of each entry (which can be stored multiple times) that is stored in EntryUsersRatings.
I have tried this:
SELECT entries.entryId, EntryTranslations.title, AVG(EntryUsersRatings.rating) AS AverageRating
FROM entries
LEFT OUTER JOIN
EntryTranslations ON entries.entryId = EntryTranslations.entryId AND EntryTranslations.languageId = 1
LEFT OUTER JOIN
EntryUsersRatings ON entries.entryId = EntryUsersRatings.entryId
WHERE entries.isDraft=0
GROUP BY title, entries.entryId
isDraft is just something that means that entries are not stored with all information needed (just incomplete data - irrelevant for our case here).
Any help would be greatly appreciated.
EDIT: my solution gives me null values for rating.
Edit1: this query is working perfectly OK, I was looking into wrong database.
We also came to another solution, which gives us the same result (I hope someone will find this useful):
SELECT entries.entryId, COALESCE(x.EntryUsersRatings, 0) as averageRating
FROM entries
LEFT JOIN(
SELECT rr.entryId, AVG(rating) AS entryRating
FROM EntryUsersRatings rr
GROUP BY rr.entryId) x ON x.entryId = entries.entryId
#CyberHawk: as you are using left outer join with entries, your result will give all records from left table and matching record with your join condition from right table. but for unmatching records it will give you a null value .
check out following link for the deta:
http://msdn.microsoft.com/en-us/library/ms187518(v=sql.105).aspx

SQL query to search an unique ID that can be in three different tables

I have three tables that control products, colors and sizes. Products can have or not colors and sizes. Colors can or not have sizes.
product color size
------- ------- -------
id id id
unique_id id_product (FK from product) id_product (FK from version)
stock unique_id id_version (FK from version)
title stock unique_id
stock
The unique_id column, that is present in all tables, is a serial type (autoincrement) and its counter is shared with the three tables, basically it works as a global unique ID between them.
It works fine, but i am trying to increase the query performance when i have to select some fields based in the unique_id.
As i don't know where is the unique_id that i am looking for, i am using UNION, like below:
select title, stock
from product
where unique_id = 10
UNION
select p.title, c.stock
from color c
join product p on c.id_product = p.id
where c.unique_id = 10
UNION
select p.title, s.stock
from size s
join product p on s.id_product = p.id
where s.unique_id = 10;
Is there a better way to do this? Thanks for any suggestion!
EDIT 1
Based on #ErwinBrandstetter and #ErikE answers i decided to use the below query. The main reasons is:
1) As unique_id has indexes in all tables, i will get a good performance
2) Using the unique_id i will find the product code, so i can get all columns i need using a another simple join
SELECT
p.title,
ps.stock
FROM (
select id as id_product, stock
from product
where unique_id = 10
UNION
select id_product, stock
from color
where unique_id = 10
UNION
select id_product, stock
from size
where unique_id = 10
) AS ps
JOIN product p ON ps.id_product = p.id;
PL/pgSQL function
To solve the problem at hand, a plpgsql function like the following should be faster:
CREATE OR REPLACE FUNCTION func(int)
RETURNS TABLE (title text, stock int) LANGUAGE plpgsql AS
$BODY$
BEGIN
RETURN QUERY
SELECT p.title, p.stock
FROM product p
WHERE p.unique_id = $1; -- Put the most likely table first.
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, c.stock
FROM color c
JOIN product p ON c.id_product = p.id
WHERE c.unique_id = $1;
END;
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, s.stock
FROM size s
JOIN product p ON s.id_product = p.id
WHERE s.unique_id = $1;
END IF;
END;
$BODY$;
Updated function with table-qualified column names to avoid naming conflicts with OUT parameters.
RETURNS TABLE requires PostgreSQL 8.4, RETURN QUERY requires version 8.2. You can substitute both for older versions.
It goes without saying that you need to index the columns unique_id of every involved table. id should be indexed automatically, being the primary key.
Redesign
Ideally, you can tell which table from the ID alone. You could keep using one common sequence, but add 100000000 for the first table, 200000000 for the second and 300000000 for the third - or whatever suits your needs. This way, the least significant part of the number is easily distinguishable.
A plain integer spans numbers from -2147483648 to +2147483647, move to bigint if that's not enough for you. I would stick to integer IDs, though, if possible. They are smaller and faster than bigint or text.
CTEs (experimental!)
If you cannot create a function for some reason, this pure SQL solution might do a similar trick:
WITH x(uid) AS (SELECT 10) -- provide unique_id here
, a AS (
SELECT title, stock
FROM x, product
WHERE unique_id = x.uid
)
, b AS (
SELECT p.title, c.stock
FROM x, color c
JOIN product p ON c.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM a)
AND c.unique_id = x.uid
)
, c AS (
SELECT p.title, s.stock
FROM x, size s
JOIN product p ON s.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM b)
AND s.unique_id = x.uid
)
SELECT * FROM a
UNION ALL
SELECT * FROM b
UNION ALL
SELECT * FROM c;
I am not sure whether it avoids additional scans like I hope. Would have to be tested. This query requires at least PostgreSQL 8.4.
Upgrade!
As I just learned, the OP runs on PostgreSQL 8.1.
Upgrading alone would speed up the operation a lot.
Query for PostgreSQL 8.1
As you are limited in your options, and a plpgsql function is not possible, this function should perform better than the one you have. Test with EXPLAIN ANALYZE - available in v8.1.
SELECT title, stock
FROM product
WHERE unique_id = 10
UNION ALL
SELECT p.title, ps.stock
FROM product p
JOIN (
SELECT id_product, stock
FROM color
WHERE unique_id = 10
UNION ALL
SELECT id_product, stock
FROM size
WHERE unique_id = 10
) ps ON ps.id_product = p.id;
I think it's time for a redesign.
You have things that you're using as bar codes for items that are basically all the same in one respect (they are SerialNumberItems), but have been split into multiple tables because they are different in other respects.
I have several ideas for you:
Change the Defaults
Just make each product required to have one color "no color" and one size "no size". Then you can query any table you want to find the info you need.
SuperType/SubType
Without too much modification you could use the supertype/subtype database design pattern.
In it, there is a parent table where all the distinct detail-level identifiers live, and the shared columns of the subtype tables go in the supertype table (the ways that all the items are the same). There is one subtype table for each different way that the items are distinct. If mutual exclusivity of the subtype is required (you can have a Color or a Size but not both), then the parent table is given a TypeID column and the subtype tables have an FK to both the ParentID and the TypeID. Looking at your design, in fact you would not use mutual exclusivity.
If you use the pattern of a supertype table, you do have the issue of having to insert in two parts, first to the supertype, then the subtype. Deleting also requires deleting in reverse order. But you get a great benefit of being able to get basic information such as Title and Stock out of the supertype table with a single query.
You could even create schema-bound views for each subtype, with instead-of triggers that convert inserts, updates, and deletes into operations on the base table + child table.
A Bigger Redesign
You could completely change how Colors and Sizes are related to products.
First, your patterns of "has-a" are these:
Product (has nothing)
Product->Color
Product->Size
Product->Color->Size
There is a problem here. Clearly Product is the main item that has other things (colors and sizes) but colors don't have sizes! That is an arbitrary assignment. You may as well have said that Sizes have Colors--it doesn't make a difference. This reveals that your table design may not be best, as you're trying to model orthogonal data in a parent-child type of relationship. Really, products have a ColorAndSize.
Furthermore, when a product comes in colors and sizes, what does the uniqueid in the Color table mean? Can such a product be ordered without a size, having only a color? This design is assigning a unique ID to something that (it seems to me) should never be allowed to be ordered--but you can't find this information out from the Color table, you have to compare the Color and Size tables first. It is a problem.
I would design this as: Table Product. Table Size listing all distinct sizes possible for any product ever. Table Color listing all distinct colors possible for any product ever. And table OrderableProduct that has columns ProductId, ColorID, SizeID, and UniqueID (your bar code value). Additionally, each product must have one color and one size or it doesn't exist.
Basically, Color and Size are like X and Y coordinates into a grid; you are filling in the boxes that are allowable combinations. Which one is the row and which the column is irrelevant. Certainly, one is not a child of the other.
If there are any reasonable rules, in general, about what colors or sizes can be applied to various sub-groups of products, there might be utility in a ProductType table and a ProductTypeOrderables table that, when creating a new product, could populate the OrderableProduct table with the standard set—it could still be customized but might be easier to modify than to create anew. Or, it could define the range of colors and sizes that are allowable. You might need separate ProductTypeAllowedColor and ProductTypeAllowedSize tables. For example, if you are selling T-shirts, you'd want to allow XXXS, XXS, XS, S, M, L, XL, XXL, XXXL, and XXXXL, even if most products never use all those sizes. But for soft drinks, the sizes might be 6-pack 8oz, 24-pack 8oz, 2 liter, and so on, even if each soft drink is not offered in that size (and soft drinks don't have colors).
In this new scheme, you only have one table to query to find the correct orderable product. With proper indexes, it should be blazing fast.
Your Question
You asked:
in PostgreSQL, so do you think if i use indexes on unique_id i will get a satisfactory performance?
Any column or set of columns that you use to repeatedly look up data must have an index! Any other pattern will result in a full table scan each time, which will be awful performance. I am sure that these indexes will make your queries lightning fast as it will take only one leaf-level read per table.
There's an easier way to generate unique IDs using three separate auto_increment columns. Just prepend a letter to the ID to uniquify it:
Colors:
C0000001
C0000002
C0000003
Sizes:
S0000001
S0000002
S0000003
...
Products:
P0000001
P0000002
P0000003
...
A few advantages:
You don't need to serialize creation of ids across tables to ensure uniqueness. This will give better performance.
You don't actually need to store the letter in the table. All IDs in the same table start with the same letter, so you only need to store the number. This means that you can use an ordinary auto_increment column to generate your IDs.
If you have an ID you only need to check the first character to see which table it can be found in. You don't even need to make a query to the database if you just want to know whether it's a product ID or a size ID.
A disadvantage:
It's no longer a number. But you can get around that by using 1,2,3 instead of C,S,P.
Your query will be pretty much efficient, as long as you have an index on unique_id, on every table and indices on the joining columns.
You could turn those UNION into UNION ALL but the won't be any differnce on performance, for this query.
This is a bit different. I don't understand the intended behaviour if stocks exists in more than one of the {product,color,zsize} tables. (UNION will remove duplicates, but for the row-as-a-whole, eg the {product_id,stock} tuples. That makes no sense to me. I just take the first. (Note the funky self-join!!)
SELECT p.title
, COALESCE (p2.stock, c.stock, s.stock) AS stock
FROM product p
LEFT JOIN product p2 on p2.id = p.id AND p2.unique_id = 10
LEFT JOIN color c on c.id_product = p.id AND c.unique_id = 10
LEFT JOIN zsize s on s.id_product = p.id AND s.unique_id = 10
WHERE COALESCE (p2.stock, c.stock, s.stock) IS NOT NULL
;

Some SQL Questions

I have been using SQL for years, but have mostly been using the query designer within SQL Studio (etc.) to put together my queries. I've recently found some time to actually "learn" what everything is doing and have set myself the following fairly simple tasks. Before I begin, I'd like to ask the SOF community their thoughts on the questions, possible answers and any tips they may have.
The questions are;
Find all records w/ a duplicate in a particular column (e.g. a linking id is in more than 1 record throughout table)
SUM price from a linked table within the same query (select within a select?)
Explain the difference between the 4 joins; LEFT, RIGHT, OUTER, INNER
Copy data from one table to another based on SELECT and WHERE criteria
Input welcomed & appreciated.
Chris
I recommend that you start by following some tutorials on this topic. Your questions are not uncommon questions for someone moving from a beginner to intermediate level in SQL. SQLZoo is an excellent resource for learning SQL so consider following that.
In response to your questions:
1) Find all records with a duplicate in a particular column
There are two steps here: find duplicate records and select those records. To find the duplicate records you should be doing something along the lines of:
select possible_duplicate_field, count(*)
from table
group by possible_duplicate_field
having count(*) > 1
What we're doing here is selecting everything from a table, then grouping it by the field we want to check for duplicates. The count function then gives me a count of the number of items within that group. The HAVING clause indicates that we want to filter AFTER the grouping to only show the groups which have more than one entry.
This is all fine in itself but it doesn't give you the actual records that have those values on them. If you knew the duplicate values then you'd write this:
select * from table where possible_duplicate_field = 'known_duplicate_value'
We can use the SELECT within a select to get a list of the matches:
select *
from table
where possible_duplicate_field in (
select possible_duplicate_field
from table
group by possible_duplicate_field
having count(*) > 1
)
2) SUM price from a linked table within the same query
This is a simple JOIN between two tables with a SUM of the two:
select sum(tableA.X + tableB.Y)
from tableA
join tableB on tableA.keyA = tableB.keyB
What you're doing here is joining two tables together where those two tables are linked by a key field. In this case, this is a natural join which operates as you would expect (i.e. get me everything from the left table which has a matching record in the right table).
3) Explain the difference between the 4 joins; LEFT, RIGHT, OUTER, INNER
Consider two tables A and B. The concept of "LEFT" and "RIGHT" in this case are slightly clearer if you read your SQL from left to right. So, when I say:
select x from A join B ...
The left table is "A" and the right table is "B". Now, when you explicitly say "LEFT" the SQL statement you are declaring which of the two tables you are joining is the primary table. What I mean by this is: Which table do I scan through first? Incidentally, if you omit the LEFT or RIGHT, then SQL implicitly uses LEFT.
For INNER and OUTER you are declaring what to do when matches don't exist in one of the tables. INNER declares that you want everything in the primary table (as declared using LEFT or RIGHT) where there is a matching record in the secondary table. Hence, if the primary table contains keys "X", "Y" and "Z", and the secondary table contains keys "X" and "Z", then an INNER will only return "X" and "Z" records from the two tables.
When OUTER is used, we're saying: Give me everything from the primary table and anything that matches from the secondary table. Hence, in the previous example, we'd get "X", "Y" and "Z" records in the output record set. However, there would be NULLs in the fields which should have come from the secondary table for key value "Y" as it doesn't exist in the secondary table.
4) Copy data from one table to another based on SELECT and WHERE criteria
This is pretty trivial and I'm surprised you've never encountered it. It's a simple nested SELECT in an INSERT statement (this may not be supported by your database - if not, try the next option):
insert into new_table select * from old_table where x = y
This assumes the tables have the same structure. If you have different structures then you'll need to specify the columns:
insert into new_table (list, of, fields)
select list, of, fields from old_table where x = y
Let's say you have 2 tables named :
[OrderLine] with the columns [Id, OrderId, ProductId, Qty, Status]
[Product] with [Id, Name, Price]
1) all orderline of command having more than 1 line (it's technically the same as looking for duplicates on OrderId :) :
select OrderId, count(*)
from OrderLine
group by OrderId
having count(*) > 1
2) total price for all order line of the order 1000
select sum(p.Price * ol.Qty) as Price
from OrderLine ol
inner join Product p on ol.ProductId = p.Id
where ol.OrderId = 1000
3) difference between joins:
a inner join b => take all a that has a match with b. if b is not found, a will be not be returned
a left join b => take all a, match them with b, include a even if b is not found
a righ join b => b left join a
a outer join b => (a left join b) union ( a right join b)
4) copy order lines to a history table :
insert into OrderLinesHistory
(CopiedOn, OrderLineId, OrderId, ProductId, Qty)
select
getDate(), Id, OrderId, ProductId, Qty
from
OrderLine
where
status = 'Closed'
To answer #4 and to perhaps show at least some understanding of SQL and the fact this isn't HW, just me trying to learn best practise;
SET NOCOUNT ON;
DECLARE #rc int
if #what = 1
BEGIN
select id from color_mapper where product = #productid and color = #colorid;
select #rc = ##rowcount
if #rc = 0
BEGIN
exec doSavingSPROC #colorid, #productid;
END
END
END