Faceted search count in SQL - sql

I'm trying to implement faceted search count in SQL. For simplicity, I'll take the data that already exists on https://www.w3schools.com/sql/trysql.asp?filename=trysql_select_all. A product has a category and a category belongs to many products, so it's a one-to-many relationship. I'm interested in filtering products by category, so if there are multiple categories selected, the query will get products whose category Id can be found in the list of Id's that the user filtered by (So it's an OR operation between categories). But this is not the challenge that I'm currently facing.
The query below tries to answer the question: For every category that exists, how many products would I get if that category was among the selected categories?
SELECT
cat.CategoryId,
p.Count
FROM Categories AS cat
LEFT JOIN (SELECT
COUNT(DISTINCT ProductId) AS Count
FROM Products AS p
WHERE p.CategoryId IN #CategoryIds
OR p.CategoryId = cat.CategoryId) AS p
The #CategoryIds is a parameter that is going to be handled by an ORM. For a more concrete scenario, you can just replace it with the list (1, 2) (so you can consider the case in which the user wants to filter all products that have the category 1 or 2).
The issue is that the word "cat" (on the last line) is not recognised so the query just throws an error.
Is there a way to make the second table recognise the first table's alias "cat" that I want to LEFT JOIN with? Or is there a better solution to this problem that I didn't take into consideration?

LEFT JOIN requires predicate. Some DBMS, like MS SQL Server, supports CROSS APPLY. This query should be equivalent to following one, ready to run on every SQL Database known to me:
SELECT
cat.CategoryId,
COUNT(ProductId)
FROM Categories AS cat
LEFT JOIN Products P ON p.CategoryId=cat.CategoryId OR p.CategoryId IN [list]
GROUP BY cat.CategoryId
Or, if you are using SQL Server:
SELECT
cat.CategoryId,
p.Count
FROM Categories AS cat
CROSS APPLY (SELECT COUNT(DISTINCT ProductId) AS Count
FROM Products AS p
WHERE p.CategoryId IN #CategoryIds
OR p.CategoryId = cat.CategoryId) AS p

Related

How to correctly use COUNT() when multiple joins are used?

I have the following schema:
Multiple webinar entities can have multiple categories hence the webinarcategorymapping table.
What I need to achieve is find the most popular webinars (by likes number) of a specific category.
For doing this, I've written the query below:
select
webinar.id, webinar.name as "webinar", webinar.publishat,
string_agg(category.name, ',' order by category.name) as categories,
count("like".likeableid) as "likes_count"
from
webinar
join "like" on webinar.id = "like".likeableid and "like".likeabletype = 'webinar'
join webinarcategorymapping on webinarcategorymapping.webinarid = webinar.id
join category on category.id = webinarcategorymapping.categoryid
group by "like".likeableid, webinar.id
having
string_agg(category.name, ',' order by category.name) ilike '%CategoryName%'
and count("like".likeableid) > 0
order by count("like".likeableid) desc;
Due to the many-to-many relationship between category and webinar I've decided to join all categories for every webinar into a comma-separated value by using string_agg. This way I'll be able to perform the search by category by using ilike %search_term%.
In the like table the likeabletype must be equal to webinar and the likeableid filed is the id of an entity on which the like is made. So, in my case, when querying the like table I need to use likeabletype='webinar' and likeableid = webinar.id conditions.
The problem is that is gives me incorrect likes_count results (I guess it's due to multiple joins that duplicate many rows).
However using count(distinct "like".likeableid) doesn't help as it just gives me 1 for every row.
What should I change in my query in order to get correct result from count() of likes?
What I need to achieve is find the most popular webinars (by likes number) of a specific category.
You can aggregate the likes in a subquery and just filter on the categories:
select w.id, w.name as "webinar", w.publishat, num_likes
from webinar w join
(select l.likableid, count(*) as num_likes
from "like" l
where l.likeabletype = 'webinar'
group by l.likeableid
) l
on w.id = l.likeableid join
webinarcategorymapping wcm
on wcm.webinarid = w.id join
category c
on c.id = wcm.categoryid
where c.name = ?
order by num_likes desc;

Improve SQL query by replacing inner query

I'm trying to simplify this SQL query (I replaced real table names with metaphorical), primarily get rid of the inner query, but I'm brain frozen can't think of any way to do it.
My major concern (aside from aesthetics) is performance under heavy loads
The purpose of the query is to count all books grouping by genre found on any particular shelve where the book is kept (hence the inner query which is effectively telling which shelve to count books on).
SELECT g.name, count(s.book_id) occurances FROM genre g
LEFT JOIN shelve s ON g.shelve_id=s.id
WHERE s.id=(SELECT genre_id FROM book WHERE id=111)
GROUP BY s.genre_id, g.name
It seems like you want to know many books that are on a shelf are in the same genre as book 111: if you liked book "X", we have this many similar books in stock.
One thing I noticed is the WHERE clause in the original required a value for the shelve table, effectively converting it to an INNER JOIN. And speaking of JOINs, you can JOIN instead of the nested select.
SELECT g.name, count(s.book_id) occurances
FROM genre g
INNER JOIN shelve s ON s.id = b.shelve_id
INNER JOIN book b on b.genre_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Thinking about it more, I might also start with book rather than genre. In the end, the only reason you need the genre table at all is to find the name, and therefore matching to it by id may be more effective.
SELECT g.name, count(s.book_id) occurances
FROM book b
INNER JOIN shelve s ON s.id = b.genre_id
INNER JOIN genre g on g.shelve_id = s.id
WHERE b.id=111
GROUP BY g.id, g.name
Not sure they meet your idea of "simpler" or not, but they are alternatives.
... unless matching shelve.id with book.genre_id is a typo in the question. It seems very odd the two tables would share the same id values, in which case these will both be wrong.

many to many select query

I'm trying to write code to pull a list of product items from a SQL Server database an display the results on a webpage.
A requirement of the project is that a list of categories is displayed at the right hand side of the page as a list of checkboxes (all categories selected by default) and a user can uncheck categories and re-query the database to view products's in only the categories they want.
Heres where it starts to get a bit hairy.
Each product can be assinged to multiple categories using a product categories table as below...
Product table
[product_id](PK),[product_name],[product_price],[isEnabled],etc...
Category table
[CategoryID](PK),[CategoryName]
ProductCagetory table
[id](PK),[CategoryID](FK),[ProductID](FK)
I need to select a list of products that match a set of category ID's passed to my stored procedure where the products have multiple assigned categories.
The categort id's are passed to the proc as a comma delimited varchar i.e. ( 3,5,8,12 )
The SQL breaks this varchar value into a resultset in a temp table for processing.
How would I go aout writing this query?
One problem is passing the array or list of selected categories into the server. The subject was covered at large by Eland Sommarskog in the series of articles Arrays and Lists in SQL Server. Passing the list as a comma separated string and building a temp table is one option. There are alternatives, like using XML, or a Table-Valued-Parameter (in SQL Server 2008) or using a table #variable instead of a #temp table. The pros and cons of each are covered in the article(s) I linked.
Now on how to retrieve the products. First things first: if all categories are selected then use a different query that simply retrieves all products w/o bothering with categories at all. This will save a lot of performance and considering that all users will probably first see a page w/o any category unselected, the saving can be significant.
When categories are selected, then building a query that joins products, categories and selected categories is fairly easy. Making it scale and perform is a different topic, and is entirely dependent on your data schema and actual pattern of categories selected. A naive approach is like this:
select ...
from Products p
where p.IsEnabled = 1
and exists (
select 1
from ProductCategories pc
join #selectedCategories sc on sc.CategoryID = pc.CategoryID
where pc.ProductID = p.ProductID);
The ProductsCategoriestable must have an index on (ProductID, CategoryID) and one on (CategoryID, ProductID) (one of them is the clustered, one is NC). This is true for every solution btw. This query would work if most categories are always selected and the result contains most products anyway. But if the list of selected categories is restrictive then is better to avoid the scan on the potentially large Products table and start from the selected categories:
with distinctProducts as (
select distinct pc.ProductID
from ProductCategories pc
join #selectedCategories sc on pc.CategoryID = sc.CategoryID)
select p.*
from Products p
join distinctProducts dc on p.ProductID = dc.ProductID;
Again, the best solution depends largely on the shape of your data. For example if you have a very skewed category (one categoru alone covers 99% of products) then the best solution would have to account for this skew.
This gets all products that are at least in all of the desired categories (no less):
select * from product p1 join (
select p.product_id from product p
join ProductCategory pc on pc.product_id = p.product_id
where pc.category_id in (3,5,8,12)
group by p.product_id having count(p.product_id) = 4
) p2 on p1.product_id = p2.product_id
4 is the number of categories in the set.
This gets all products that are exactly in all of the desired categories (no more, no less):
select * from product p1 join (
select product_id from product p1
where not exists (
select * from product p2
join ProductCategory pc on pc.product_id = p2.product_id
where p1.product_id = p2.product_id
and pc.category_id not in (3,5,8,12)
)
group by product_id having count(product_id) = 4
) p2 on p1.product_id = p2.product_id
The double negative can be read as: get all products for which there are no categories that are not in the desired category list.
For the products in any of the desired categories, it's as simple as:
select * from product p1 where exists (
select * from product p2
join ProductCategory pc on pc.product_id = p2.product_id
where
p1.product_id = p2.product_id and
pc.category_id in (3,5,8,12)
)
This should do. Yo don't have to break the comma delimited category ids.
select distinct p.*
from product p, productcategory pc
where p.product_id = pc.productid
and pc.categoryid in ( place your comma delimited category ids here)
This will give the products which are in any of the passed in category ids i.e., as per JNK's comment its an OR not ALL. Please specify if you want an AND i.e, the product needs to be selected only if it is in ALL the categories specified in the comma separated list.
If you need anything else than product_id from products then you can write something like this (and adding the extra fields that you need):
SELECT distinct(p.product_id)
FROM product_table p
JOIN productcategory_table pc
ON p.product_id=pc.product_id
WHERE pc.category_id in (3,5,8,12);
on the other hand if you need really just the product_id you can simply select them from productcategory_table:
SELECT distinct(product_id)
FROM productcategory_table
WHERE category_id in (3,5,8,12);
This should be fairly close to what you are looking for
SELECT product.*
FROM product
JOIN ProductCategory ON ProductCategory.ProductID = Product.product_id
JOIN #my_temp ON #my_temp.category_id = ProductCategory.CategoryID
EDIT
As noted in the comments this will produce duplicates for those products appearing in multiple categories. To correct this then specify DISTINCT before the column list. I have included all product columns in the list product.* as I do not know which columns you are looking for but you should probably change that to the specific columns that you want

distinct group by join problem

Here's what I want to achieve:
I have a number of categories, each one with products in it.
I want to produce a report that shows various information about those products for each category. So I have a query that looks something like:
select
category,
count(products),
sum(product_price),
from product
group by category
So far so good.
But now I also want to get some category-specific information from a table that has information by category. So effectively I want to say:
join category_info on category
except that that will create a join for each row of each group, rather than just one join for each group.
What I really want to be able to say to sql is 'for each group, take the distinct category value, of which there's guaranteed to only be one since I'm grouping on it, and then use that to join to the category info table'
How can I accomplish this in SQL? By the way, I'm using Oracle 10g..
Many thanks!
select a.category, a.Count, a.SumPrice
ci.OtherColumn
from (
select p.category,
count(p.products) as Count,
sum(p.product_price) as SumPrice,
from product p
group by category
) a
inner join category_info ci on a.category = ci.category

Uses of unequal joins

Of all the thousands of queries I've written, I can probably count on one hand the number of times I've used a non-equijoin. e.g.:
SELECT * FROM tbl1 INNER JOIN tbl2 ON tbl1.date > tbl2.date
And most of those instances were probably better solved using another method. Are there any good/clever real-world uses for non-equijoins that you've come across?
Bitmasks come to mind. In one of my jobs, we had permissions for a particular user or group on an "object" (usually corresponding to a form or class in the code) stored in the database. Rather than including a row or column for each particular permission (read, write, read others, write others, etc.), we would typically assign a bit value to each one. From there, we could then join using bitwise operators to get objects with a particular permission.
How about for checking for overlaps?
select ...
from employee_assignments ea1
, employee_assignments ea2
where ea1.emp_id = ea2.emp_id
and ea1.end_date >= ea2.start_date
and ea1.start_date <= ea1.start_date
Whole-day inetervals in date_time fields:
date_time_field >= begin_date and date_time_field < end_date_plus_1
Just found another interesting use of an unequal join on the MCTS 70-433 (SQL Server 2008 Database Development) Training Kit book. Verbatim below.
By combining derived tables with unequal joins, you can calculate a variety of cumulative aggregates. The following query returns a running aggregate of orders for each salesperson (my note - with reference to the ubiquitous AdventureWorks sample db):
select
SH3.SalesPersonID,
SH3.OrderDate,
SH3.DailyTotal,
SUM(SH4.DailyTotal) RunningTotal
from
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH3
join
(select SH1.SalesPersonID, SH1.OrderDate, SUM(SH1.TotalDue) DailyTotal
from Sales.SalesOrderHeader SH1
where SH1.SalesPersonID IS NOT NULL
group by SH1.SalesPersonID, SH1.OrderDate) SH4
on SH3.SalesPersonID = SH4.SalesPersonID AND SH3.OrderDate >= SH4.OrderDate
group by SH3.SalesPersonID, SH3.OrderDate, SH3.DailyTotal
order by SH3.SalesPersonID, SH3.OrderDate
The derived tables are used to combine all orders for salespeople who have more than one order on a single day. The join on SalesPersonID ensures that you are accumulating rows for only a single salesperson. The unequal join allows the aggregate to consider only the rows for a salesperson where the order date is earlier than the order date currently being considered within the result set.
In this particular example, the unequal join is creating a "sliding window" kind of sum on the daily total column in SH4.
Dublicates;
SELECT
*
FROM
table a, (
SELECT
id,
min(rowid)
FROM
table
GROUP BY
id
) b
WHERE
a.id = b.id
and a.rowid > b.rowid;
If you wanted to get all of the products to offer to a customer and don't want to offer them products that they already have:
SELECT
C.customer_id,
P.product_id
FROM
Customers C
INNER JOIN Products P ON
P.product_id NOT IN
(
SELECT
O.product_id
FROM
Orders O
WHERE
O.customer_id = C.customer_id
)
Most often though, when I use a non-equijoin it's because I'm doing some kind of manual fix to data. For example, the business tells me that a person in a user table should be given all access roles that they don't already have, etc.
If you want to do a dirty join of two not really related tables, you can join with a <>.
For example, you could have a Product table and a Customer table. Hypothetically, if you want to show a list of every product with every customer, you could do somthing like this:
SELECT *
FROM Product p
JOIN Customer c on p.SKU <> c.SSN
It can be useful. Be careful, though, because it can create ginormous result sets.