Implementing LIMIT against rows of single table while using LEFT JOIN - sql

Given a fairly stereotypical scenario with an item table referenced by an images table holding multiple images for the same item, I'm trying to figure out how to retrieve a specific number of items, while collecting all of the image rows.
The setup is trivial, and looks like:
CREATE TABLE items (
id INTEGER PRIMARY KEY,
...
...
)
CREATE TABLE images (
item_id INTEGER PRIMARY KEY,
url STRING,
...
...
)
LIMIT 20 clamps the maximum number of rows in the entire result set, and incidentally truncates mid-way through an item record.
I'm currently doing two queries, but besides being architecturally not at all ideal, it's practically proving quite awkward to coordinate (which makes a lot of sense since I'm definitely doing it wrong). I'm not finding any info on how to coordinate LIMIT with LEFT JOINs, so I thought I'd ask. Thanks!
NB. Similar questions to this do indeed exist, but are asking how to do things like retrieving eg the first (say) 5 images for each item. I'm looking to retrieve (say) 10 items and get all the images for each.

A query like:
SELECT * FROM items ORDER BY ? LIMIT 10
will return 10 rows (at most) from items.
You need to provide the column(s) for the ORDER BY clause if you have some specific sorting condition in mind.
If not then remove the ORDER BY clause, but in this case nothing guarantees the resultset that you will get.
So all you have to do is LEFT join the above query to images:
SELECT it.*, im.*
FROM (SELECT * FROM items ORDER BY ? LIMIT 10) it
LEFT JOIN images im
ON im.item_id = it.id

Related

SQL question: how do I find the count of IDs that are always mapped to a 'true' field in another table

I have a database that collects a list of document packages in one table and each individual page in another table
Each page has a PackageID connecting the two tables.
I'm trying to find the count of all packages where ALL pages connected to it have a boolean field (stored on the page table) of true. Even if 1/20 of the pages connected to the packageID is false, I don't want that packageID counted
Right now all I have is:
SELECT COUNT(DISTINCT pages.package_id)
FROM pages
WHERE boolean_field = true
But I'm not sure how to add that if one page w/ that package_id has the boolean_field != true than I don't want it counted. I also want to know the count of those packages that have any that are false.
I'm not sure if I need a subquery, if statement, having clause, or what.
Any direction even if it's what operators I should study on would be super helpful. Thanks :).
select count(*)
from
(
select package_id
from pages
group by package_id
having min(boolean_field) = 1
) tmp
Another way to express this is:
select count(*)
from packages p
where not exists (select 1
from pages pp
where pp.package_id = p.package_id and
not pp.boolean_field
);
The advantage of this approach is that it avoids aggregation, which can be a big win performance wise. It can also take advantage of an index on pages(package_id, boolean_field).

Difference in where clause and join

im very new to SQL and currently working with joins the first time in my life. What I am trying to figure out right now is trying to get the difference between to queries.
Query 1:
SELECT name
FROM actor
JOIN casting ON id = actorid
where (SELECT COUNT(ord) FROM casting join actor on actorid = actor.id AND ord=1) >= 30
GROUP BY name
Query 2:
SELECT name
FROM actor
JOIN casting ON id = actorid
AND (SELECT COUNT(ord) FROM casting WHERE actorid = actor.id AND ord=1)>=30)
GROUP BY name
So I would think that doing
FROM casting join actor on actorid = actor.id
in the subquery is the same as
FROM casting WHERE actorid = actor.id.
But apparently it is not. Could anyone help me out and explain why?
Edit: If anyone is wondering: The queries are based on question 13 from http://sqlzoo.net/wiki/More_JOIN_operations
Actually, the part that really looks like a "where" statement is only what's after the keyword ON. We sometimes fall on queries performing some data filtering directly at this stage, but its actual purpose is to specify the criteria used
A "join" is a very common operation that consists of associating the rows of two distinct tables according to a common criteria. For example, if you have, on one side, a table containing a client list in which each of them has a unique client number, and on a other side a order list table in which each order contains the client's number, then you may want to "resolve" the number of the latter table into its name, address, and so on.
Before SQL92 (26 years ago), the only way to achieve this was to write something like this :
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client,orders
WHERE orders.clientNumber = client.clientNumber
AND orders.totalprice > 100.00
Selecting something from two (or more) tables induces a "cartesian product" which actually consists of associating every row from the first set which every row of the second one. This means that if your first table contains 3 rows and the second one 8 rows, the resulting set would be 24-row wide. And out of these, you use the WHERE clause to exclude basically everything and retain only rows in which the client number is the same on both side.
We understand that the size of the resulting set before filtering can grow exponentially if the contents of the different tables exceed a few rows (which is always the case) and it can get even worse if you imply more than two tables. Also, on the programmer's side, it rapidly becomes rather unreadable.
Therefore, if this is what you actually want to do, you now can explicitly tell the server about it, and specify the criteria at first, which will avoid unnecessary growing temporary subsets, while still letting you filter the results with WHERE if needed.
SELECT client.name,
client.adress,
orders.product,
orders.totalprice
FROM client
JOIN orders
ON orders.clientNumber = client.clientNumber
WHERE orders.totalprice > 100.00
It becomes critical when performing multiple JOIN in a single query, especially when performing both INNER and OUTER joins.
In the 2nd query your nested query takes the actor.id from its root query and only counts the results from that. In the 1st query your nested query counts results from all actors instead of only the specified one.

Paginated UNION ALL result - best performance

We have a situation where we need results from 4 different tables combined into one list and paginate it through OFFSET/FETCH.
What want to select records from tables a, b, c & d, order them by CreatedDatetime and then OFFSET X, FETCH Y. Tables are quite big (in terms of numbers of rows) and it sounds horrible to do just UNION ALL and then pagination because it would mean probably compiling whole list of records and then taking paginated part.
Problem is that none of the tables can be taken as reference to extract Start/End Datetime window because every collection might but also might not contain records from any of the table. For example, ending result might contain records from any combination of tables a; a/b; a/b/c; a/b/c/d; b; b/c;.... and we need fixed size number to be returned (paging size, for example, being 20).
Any ideas on how to most effectively approach this?
UPDATE
Based on question from #HABO
There are unfortunately no special clues like that about queries. We are showing user activities in the system. There are different kinds of it (tables we select over). Now, query pops up data for administrator who views the activities. How administrator will look at data may vary drastically: some users will have thousands of activities in last few hours and admin will want to see them all. In other cases, users will have 3 actions in a day and admin will see just first page of data.
PS. It's not a pure log tables as activities act as state machines over time, each having their states, which we also look for in these queries.
if you know the page size (eg 100) then you can simply write 4 Top 100 queries (order by Create Date) - Then do a Union ALL on the result.
That way even if all the first 100 records come from 1 table you are covered.
For Subsequent Paging queries - You'll need to record the last displayed row from each table and use this as your High-Water mark for the next fetch - (Select top 100 FROM TableA Where RowID > #HighWater)
Should be fairly efficient...
This is where a cache comes in useful. You can either cache the result of the query in your application layer and do the paging there if it is not too large, or cache the results of the query in a table (or temp table) if it is large.
There would be filters i suppose. From what you say, those may vary a lot. So at the worst scenario, all columns can be filters.
My suggestion is to use 5 views, one for each table and a final one union them. Just make sure all filter columns go up the physical tables as straightforward as possible.
Finally, select the master view and fetch but be careful of the order by clause. Make sure order by has unique data combination else you might have cases where a row change pages on a simple plain refresh. If there is user order by defined, force add some key columns at the end.
How to safely ensure order by to have distinct values for 100% safe fetch/offset:
At the 4 views create a new column with a simple constant number as value, e.g. 1, 2, 3, 4 AS [TableSource]
Make sure you select the PK of each table. If you don't have, you have to create one in the views, probably using ROW_NUMBER or NEWID, as [Pk] for example.
Finally, when selecting from the master view, you ORDER BY CreateDate, Pk, TableSource. This way you are 100% safe that within the same set of data any row will be placed exactly at the same position, resulting correct paging.
Example of safely isolating a page of 30 rows order by CreateDate:
SELECT * FROM (
SELECT src, id, ROW_NUMBER() OVER(ORDER BY dt DESC,src,id)rn FROM (
SELECT 1 src, id, dt FROM table1 /*WHERE x=y*/ UNION ALL
SELECT 2 src, id, dt FROM table2 /*WHERE x=y*/ UNION ALL
SELECT 3 src, id, dt FROM table3 /*WHERE x=y*/ UNION ALL
SELECT 4 src, id, dt FROM table4 /*WHERE x=y*/)alltables
)data WHERE data.rn BETWEEN 3001 AND 3030

SQL Count() - Selecting the same row more than once

I'm currently re-working the size filter i built earlier this year for our website. The size filter query works off the stock_id's found in the products query. On occassions, the same product can appear many times, due to it being a single listing as well as appearing inside a mix and match listing.
Because the stock_id remains the same however, whenever i run a query using stock_id In (#products.stock_id#), i only get the one row, when because lets say products.stock_id equals 1234,1234,1235. 1234 will only be shown once rather than twice, how do i get this record to be counted twice? I know it's not the most straightforward of questions but i don't have much time to dedicate to this post unfortunately.
EDIT 1 - Size Filter Query
<cfquery datasource="#application.datasource#" name="size_filter">
Select size_id,
size_description,
size_count
From sizes
Inner Join (
Select stock_sizeid,
Count(*) As size_count
From stock
Where stock_id In (#valueList(products.stock_id)#)
And stock_sizeid Not In ('33','218')
And stock_instock > 0
Group By stock_sizeid
) As stock
On size_id = stock_sizeid
Order By size_description Asc
</cfquery>
Cheat, and JOIN to your id list (instead of using an IN clause):
<cfquery datasource="#application.datasource#" name="size_filter">
Select size_id,
size_description,
size_count
From sizes
Inner Join (Select stock_sizeid,
Count(*) As size_count
From stock
Join (Select stock_id
From (VALUES(#valueList(products.stock_id)#)) a(stock_id)) As ids
On ids.stock_id = stock.stock_id
Where stock_sizeid Not In ('33','218')
And stock_instock > 0
Group By stock_sizeid) As stock
On size_id = stock_sizeid
Order By size_description Asc
</cfquery>
To make VALUES() give rows instead of columns, the correct form is VALUES(1), (2), ...(n). I'm assuming valueList(...) is some client-side procedure that gets run when the statement gets prepared; you'll need to provide an equivalent function to produce this output.
(Note, I've never used coldfusion, so I don't know if this actually works. This type of thing works in most RDBMSs, although some complain)
What this is doing is turning your list into a virtual table, which has all your entries in it, so the JOIN will be assessed multiple times, as normal.

SQL query to search an unique ID that can be in three different tables

I have three tables that control products, colors and sizes. Products can have or not colors and sizes. Colors can or not have sizes.
product color size
------- ------- -------
id id id
unique_id id_product (FK from product) id_product (FK from version)
stock unique_id id_version (FK from version)
title stock unique_id
stock
The unique_id column, that is present in all tables, is a serial type (autoincrement) and its counter is shared with the three tables, basically it works as a global unique ID between them.
It works fine, but i am trying to increase the query performance when i have to select some fields based in the unique_id.
As i don't know where is the unique_id that i am looking for, i am using UNION, like below:
select title, stock
from product
where unique_id = 10
UNION
select p.title, c.stock
from color c
join product p on c.id_product = p.id
where c.unique_id = 10
UNION
select p.title, s.stock
from size s
join product p on s.id_product = p.id
where s.unique_id = 10;
Is there a better way to do this? Thanks for any suggestion!
EDIT 1
Based on #ErwinBrandstetter and #ErikE answers i decided to use the below query. The main reasons is:
1) As unique_id has indexes in all tables, i will get a good performance
2) Using the unique_id i will find the product code, so i can get all columns i need using a another simple join
SELECT
p.title,
ps.stock
FROM (
select id as id_product, stock
from product
where unique_id = 10
UNION
select id_product, stock
from color
where unique_id = 10
UNION
select id_product, stock
from size
where unique_id = 10
) AS ps
JOIN product p ON ps.id_product = p.id;
PL/pgSQL function
To solve the problem at hand, a plpgsql function like the following should be faster:
CREATE OR REPLACE FUNCTION func(int)
RETURNS TABLE (title text, stock int) LANGUAGE plpgsql AS
$BODY$
BEGIN
RETURN QUERY
SELECT p.title, p.stock
FROM product p
WHERE p.unique_id = $1; -- Put the most likely table first.
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, c.stock
FROM color c
JOIN product p ON c.id_product = p.id
WHERE c.unique_id = $1;
END;
IF NOT FOUND THEN
RETURN QUERY
SELECT p.title, s.stock
FROM size s
JOIN product p ON s.id_product = p.id
WHERE s.unique_id = $1;
END IF;
END;
$BODY$;
Updated function with table-qualified column names to avoid naming conflicts with OUT parameters.
RETURNS TABLE requires PostgreSQL 8.4, RETURN QUERY requires version 8.2. You can substitute both for older versions.
It goes without saying that you need to index the columns unique_id of every involved table. id should be indexed automatically, being the primary key.
Redesign
Ideally, you can tell which table from the ID alone. You could keep using one common sequence, but add 100000000 for the first table, 200000000 for the second and 300000000 for the third - or whatever suits your needs. This way, the least significant part of the number is easily distinguishable.
A plain integer spans numbers from -2147483648 to +2147483647, move to bigint if that's not enough for you. I would stick to integer IDs, though, if possible. They are smaller and faster than bigint or text.
CTEs (experimental!)
If you cannot create a function for some reason, this pure SQL solution might do a similar trick:
WITH x(uid) AS (SELECT 10) -- provide unique_id here
, a AS (
SELECT title, stock
FROM x, product
WHERE unique_id = x.uid
)
, b AS (
SELECT p.title, c.stock
FROM x, color c
JOIN product p ON c.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM a)
AND c.unique_id = x.uid
)
, c AS (
SELECT p.title, s.stock
FROM x, size s
JOIN product p ON s.id_product = p.id
WHERE NOT EXISTS (SELECT 1 FROM b)
AND s.unique_id = x.uid
)
SELECT * FROM a
UNION ALL
SELECT * FROM b
UNION ALL
SELECT * FROM c;
I am not sure whether it avoids additional scans like I hope. Would have to be tested. This query requires at least PostgreSQL 8.4.
Upgrade!
As I just learned, the OP runs on PostgreSQL 8.1.
Upgrading alone would speed up the operation a lot.
Query for PostgreSQL 8.1
As you are limited in your options, and a plpgsql function is not possible, this function should perform better than the one you have. Test with EXPLAIN ANALYZE - available in v8.1.
SELECT title, stock
FROM product
WHERE unique_id = 10
UNION ALL
SELECT p.title, ps.stock
FROM product p
JOIN (
SELECT id_product, stock
FROM color
WHERE unique_id = 10
UNION ALL
SELECT id_product, stock
FROM size
WHERE unique_id = 10
) ps ON ps.id_product = p.id;
I think it's time for a redesign.
You have things that you're using as bar codes for items that are basically all the same in one respect (they are SerialNumberItems), but have been split into multiple tables because they are different in other respects.
I have several ideas for you:
Change the Defaults
Just make each product required to have one color "no color" and one size "no size". Then you can query any table you want to find the info you need.
SuperType/SubType
Without too much modification you could use the supertype/subtype database design pattern.
In it, there is a parent table where all the distinct detail-level identifiers live, and the shared columns of the subtype tables go in the supertype table (the ways that all the items are the same). There is one subtype table for each different way that the items are distinct. If mutual exclusivity of the subtype is required (you can have a Color or a Size but not both), then the parent table is given a TypeID column and the subtype tables have an FK to both the ParentID and the TypeID. Looking at your design, in fact you would not use mutual exclusivity.
If you use the pattern of a supertype table, you do have the issue of having to insert in two parts, first to the supertype, then the subtype. Deleting also requires deleting in reverse order. But you get a great benefit of being able to get basic information such as Title and Stock out of the supertype table with a single query.
You could even create schema-bound views for each subtype, with instead-of triggers that convert inserts, updates, and deletes into operations on the base table + child table.
A Bigger Redesign
You could completely change how Colors and Sizes are related to products.
First, your patterns of "has-a" are these:
Product (has nothing)
Product->Color
Product->Size
Product->Color->Size
There is a problem here. Clearly Product is the main item that has other things (colors and sizes) but colors don't have sizes! That is an arbitrary assignment. You may as well have said that Sizes have Colors--it doesn't make a difference. This reveals that your table design may not be best, as you're trying to model orthogonal data in a parent-child type of relationship. Really, products have a ColorAndSize.
Furthermore, when a product comes in colors and sizes, what does the uniqueid in the Color table mean? Can such a product be ordered without a size, having only a color? This design is assigning a unique ID to something that (it seems to me) should never be allowed to be ordered--but you can't find this information out from the Color table, you have to compare the Color and Size tables first. It is a problem.
I would design this as: Table Product. Table Size listing all distinct sizes possible for any product ever. Table Color listing all distinct colors possible for any product ever. And table OrderableProduct that has columns ProductId, ColorID, SizeID, and UniqueID (your bar code value). Additionally, each product must have one color and one size or it doesn't exist.
Basically, Color and Size are like X and Y coordinates into a grid; you are filling in the boxes that are allowable combinations. Which one is the row and which the column is irrelevant. Certainly, one is not a child of the other.
If there are any reasonable rules, in general, about what colors or sizes can be applied to various sub-groups of products, there might be utility in a ProductType table and a ProductTypeOrderables table that, when creating a new product, could populate the OrderableProduct table with the standard set—it could still be customized but might be easier to modify than to create anew. Or, it could define the range of colors and sizes that are allowable. You might need separate ProductTypeAllowedColor and ProductTypeAllowedSize tables. For example, if you are selling T-shirts, you'd want to allow XXXS, XXS, XS, S, M, L, XL, XXL, XXXL, and XXXXL, even if most products never use all those sizes. But for soft drinks, the sizes might be 6-pack 8oz, 24-pack 8oz, 2 liter, and so on, even if each soft drink is not offered in that size (and soft drinks don't have colors).
In this new scheme, you only have one table to query to find the correct orderable product. With proper indexes, it should be blazing fast.
Your Question
You asked:
in PostgreSQL, so do you think if i use indexes on unique_id i will get a satisfactory performance?
Any column or set of columns that you use to repeatedly look up data must have an index! Any other pattern will result in a full table scan each time, which will be awful performance. I am sure that these indexes will make your queries lightning fast as it will take only one leaf-level read per table.
There's an easier way to generate unique IDs using three separate auto_increment columns. Just prepend a letter to the ID to uniquify it:
Colors:
C0000001
C0000002
C0000003
Sizes:
S0000001
S0000002
S0000003
...
Products:
P0000001
P0000002
P0000003
...
A few advantages:
You don't need to serialize creation of ids across tables to ensure uniqueness. This will give better performance.
You don't actually need to store the letter in the table. All IDs in the same table start with the same letter, so you only need to store the number. This means that you can use an ordinary auto_increment column to generate your IDs.
If you have an ID you only need to check the first character to see which table it can be found in. You don't even need to make a query to the database if you just want to know whether it's a product ID or a size ID.
A disadvantage:
It's no longer a number. But you can get around that by using 1,2,3 instead of C,S,P.
Your query will be pretty much efficient, as long as you have an index on unique_id, on every table and indices on the joining columns.
You could turn those UNION into UNION ALL but the won't be any differnce on performance, for this query.
This is a bit different. I don't understand the intended behaviour if stocks exists in more than one of the {product,color,zsize} tables. (UNION will remove duplicates, but for the row-as-a-whole, eg the {product_id,stock} tuples. That makes no sense to me. I just take the first. (Note the funky self-join!!)
SELECT p.title
, COALESCE (p2.stock, c.stock, s.stock) AS stock
FROM product p
LEFT JOIN product p2 on p2.id = p.id AND p2.unique_id = 10
LEFT JOIN color c on c.id_product = p.id AND c.unique_id = 10
LEFT JOIN zsize s on s.id_product = p.id AND s.unique_id = 10
WHERE COALESCE (p2.stock, c.stock, s.stock) IS NOT NULL
;