Return all data when grouping on a field - sql

I have the following 2 tables (there are more fields in the real tables):
create table publisher(id serial not null primary key,
name text not null);
create table product(id serial not null primary key,
name text not null,
publisherRef int not null references publisher(id));
Sample data:
insert into publisher (id,name) values (1,'pub1'),(2,'pub2'),(3,'pub3');
insert into product (name,publisherRef) values('p1',1),('p2',2),('p3',2),('p4',2),('p5',3),('p6',3);
And I would like the query to return:
name, numProducts
pub2, 3
pub3, 2
pub1, 1
A product is published by a publisher. Now I need a list of name, id of all publishers which have at least one product, ordered by the total number of products each publisher has.
I can get the id of the publishers ordered by number of products with:
select publisherRef AS id, count(*)
from product
order by count(*) desc;
But I also need the name of each publisher in the result. I thought I could use a subquery like:
select *
from publisher
where id in (
select publisherRef
from product
order by count(*) desc)
But the order of rows in the subquery is lost in the outer SELECT.
Is there any way to do this with a single sql query?

SELECT pub.name, pro.num_products
FROM (
SELECT publisherref AS id, count(*) AS num_products
FROM product
GROUP BY 1
) pro
JOIN publisher pub USING (id)
ORDER BY 2 DESC;
db<>fiddle here
Or (since the title mentions "all data") return all columns of the publisher with pub.*. After products have been aggregated in the subquery, you are free to list anything in the outer SELECT.
This only lists publisher which
have at least one product
And the result is ordered by
the total number of products each publisher has
It's typically faster to aggregate the "n"-table before joining to the "1"-table. Then use an [INNER] JOIN (not a LEFT JOIN) to exclude publishers without products.
Note that the order of rows in an IN expression (or items in the given list - there are two syntax variants) is insignificant.
The column alias in publisherref AS id is totally optional to use the simpler USING clause for identical column names in the following join condition.
Aside: avoid CaMeL-case names in Postgres. Use unquoted, legal, lowercase names exclusively to make your life easier.
Are PostgreSQL column names case-sensitive?

Related

Why sometimes a subquery can work like using 'group by'

I'm new to sql and can't understand why sometimes a subquery can work like using 'group by'.
Say, there are two tables in a data base.
'food' is a table crated by:
CREATE TABLE foods (
id integer PRIMARY KEY,
type_id integer,
name text
);
'foods_episodes' is a table created by:
CREATE TABLE foods_episodes (
food_id integer,
episode_id integer
);
Now I'm using the following two sqls and generating the same result.
SELECT name, (SELECT count(*) FROM foods_episodes WHERE food_id=f.id) AS frequency
FROM foods AS f
ORDER BY name;
SELECT name, count(*) AS frequency
FROM foods_episodes,
foods AS f
WHERE food_id=f.id
GROUP BY name;
So why the subquery in the first sql works like it group the result by name?
When I run the subquery alone:
SELECT count(*)
FROM foods_episodes,
foods f
WHERE food_id=f.id
the result is just one row. Why using this sql as a subquery can generate multi-rows result?
The first query isn't actually grouping by name. If you have more than 1 record with the same name (different ID), you will see it being displayed twice (hence, not grouped by).
The first query uses what is called a correlated subquery, it calculates the subquery (the inner SELECT) once for each row of the outmost select. Because the FROM in this outmost SELECT is just from the table foods, you will get one record for each food + the results of the subquery, thus no need to group.

How to refactor complicated SQL query which is broken

Here is the simplified model of the domain
In a nutshell, unit grants documents to to a customer. There are two types of units: main units and their child units. Both belong to the same province, and to one province may belong multiple cities. Document has numerous events (processing history). Customer belongs to one city and province.
I have to write query, which returns random set of documents, given a target main unit code. Here is the criteria:
Return 10 documents where the newest event_code = 10
Each document must belong to a different customer living in any city of the unit's region (prefer different cities)
Return the Customers newest Document which meets the criteria
There must be both document types present in the result
Result (customers chosen) should be random with each query
But...
If there's not enough customers, try to use multiple documents of the same customer as a last resort
If there aren't enough documents either, return as much as possible
If there's not a single instance of another document type, then return all the same
There may be million of rows, and the query must be as fast as possible, it is executed frequently.
I'm not sure how to structure this kind of complex query in a sane manner. I'm using Oracle and PL/SQL. Here is something I tried, but it isn't working as expected (returns wrong data). How should I refactor this query and get the random result, and also honor all those borderline rules? I'm also worried about the performance regarding the joins and wheres.
CURSOR c_documents IS
WITH documents_cte AS
SELECT d.document_id AS document_id, d.create_dt AS create_dt,
c.customer_id
FROM documents d
JOIN customers c ON (c.customer_id = d.customer_id AND
c.province_id = (SELECT region_id FROM unit WHERE unit_code = 1234))
WHERE exists (
SELECT 1
FROM event
where document_id = d.document_id AND
event_code = 10
AND create_dt =
SELECT MAX(create_dt)
FROM event
WHERE document_id = d.document_id)
SELECT * FROM documents_cte d
WHERE create_dt = (SELECT MAX(create_dt)
from documents_cte
WHERE customer_id = d.customer_id)
How to correctly make this query with efficiency, randomness in mind? I'm not asking for exact solution, but guidelines at least.
I'd avoid hierarchic tables whenever possible. In your case you are using a hierarchic table to allow for an unlimited depth, but at last it's just two levels you store: provinces and their cities. That should better be just two tables: one for provinces and one for cities. Not a big deal, but that would make your data model simpler and easier to query.
Below I am starting with a WITH clause to get a city table, as such doesn't exist. Then I go step by step: get the customers belonging to the unit, then get their documents and rank them. At last I select the ranked documents and randomly take 10 of the best ranked ones.
with cities as
(
select
c.region_id as city_id,
o.region_id as province_id
from region c
join region p on p.region_id = c.parent_region_id
)
, unit_customers as
(
select customer_id
from customer
where city_id in
(
select city_id
from cities
where
(
select region_id
from unit
where unit_code = 1234
) in (city_id, province_id)
)
)
, ranked_documents as
(
select
document.*,
row_number(partition by customer_id order by create_dt desc) as rn
from document
where customer_id in -- customers belonging to the unit
(
select customer_id
from unit_customers
)
and document_id in -- documents with latest event code = 10
(
select document_id
from event
group by document_id
having max(event_code) keep (dense_rank last order by create_dt) = 10
)
)
select *
from ranked_documents
order by rn, dbms_random.value
fetch first 10 rows only;
This doesn't take into account to get both document types, as this contradicts the rule to get the latest documents per customer.
FETCH FIRST is availavle as of Oracle 12c. In earlier versions you would use one more subquery and another ROW_NUMBER instead.
As to speed, I'd recommend these indexes for the query:
create index idx_r1 on region(region_id); -- already exists for region_id = primary key
create index idx_r2 on region(parent_region_id, region_id);
create index idx_u1 on unit(unit_code, region_id);
create index idx_c1 on customer(city_id, customer_id);
create index idx_e1 on event(document_id, create_dt, event_code);
create index idx_d1 on document(document_id, customer_id, create_dt);
create index idx_d2 on document(customer_id, document_id, create_dt);
One of the last two will be used, the other not. Check which with EXPLAIN PLAN and drop the unused one.

how to perform these queries?

I have these three tables:
create table albums(sernum number primary key,
Albname varchar2(30) not null,
Artist varchar2(20) not null,
Pdate number(4),
Recompany varchar2(10),
Media char(2) not null);
create table tracks(sernum number not null,
song varchar2(50) not null,
primary key(sernum, song),
foreign key(sernum) references albums(sernum));
create table performers(sernum number not null,
Artist varchar2(30) not null,
Instrument varchar2(50) not null,
primary key(sernum, Artist, Instrument),
foreign key (sernum) references albums(sernum));
I want to perform two queries in sql oracle:
list the names of the artists that used all instruments.
list the names of the albums containing the maximum number or songs.
here is my tries:
select distinct(a.Artist) from albums a where a.Artist like (select p.Artist, distinct(p.Instrument) from performers p) group by a.Artist, p.Instrument;
select a.Albname from albums a, inner join tracks t on where a.sernum in(select max(t.sernum) group by t.sernum);
Query 1 - get artists who have played all instruments:
SELECT
p.Artist
FROM
(
SELECT Artist, count(distinct Instrument) as InstrumentCount
FROM performers
GROUP BY artist
) p
JOIN
(
SELECT COUNT(DISTINCT Instrument) as InstrumentCount
FROM performers
) i
ON p.InstrumentCount = i.InstrumentCount
Explanation: 1st subquery gets the count of instruments played by each artist. 2nd subquery gets the count of unique instruments. The two are joined together based on this instrument count to give us only those artists whose instrument counts match the maximum.
--
Query 2 - Get albums containing the maximum number of songs:
WITH
AlbumTrackCount
(
SELECT
sernum,
COUNT(1) as TrackCount
FROM tracks
GROUP BY sernum
)
SELECT
a.Albname
FROM albums a
JOIN AlbumTrackCount atc
ON a.sernum = atc.sernum
AND atc.TrackCount =
(
SELECT MAX(TrackCount)
FROM AlbumTrackCount
)
Explanation: the WITH up top establishes a subquery we'll reuse; it gets us the track count within each album. Down below, we join the albums with this album track count, and add a filter that only those albums with a track count equal to the maximum track count of any of the albums. Note that this is different from the top query, which just got every instrument ever; here, it is important to first count up the tracks within each album, and then get the maximum of those counts.
Below are some of the issues with your queries:
SELECT DISTINCT (a.artist)
FROM albums a
WHERE a.artist LIKE (SELECT p.artist,
distinct(p.Instrument)
from performers p)
group by a.Artist, p.Instrument;
LIKE indicates that you're going to use a wildcard. When comparing against a sub-query in the where clause, you typically use in as the operator.
DISTINCT is not a function. It always applies to all of the columns in a SELECT statement.
DISTINCT and GROUP BY serve very similar purposes. You would rarely use both in the same statement.
You can't reference a column from a correlated sub-query (i.e. a query in the where clause), in the outer query.
SELECT a.albname
FROM albums a,
inner join tracks t
on
where a.sernum in(select max(t.sernum) group by t.sernum);
Your using both a comma and inner join to connect two tables. The comma indicates pre-SQL:1999 syntax, whereas INNER JOIN is SQL:1999. While, technically you can use both in a single FROM clause, you can't use both between a single pair of tables. Also, you shouldn't use both. Sticj to SQL:1999.
Your ON clause is empty. You should probably be joining your two tables here. If you really want to not have a join condition, change the join to CROSS JOIN (to re-iterate: you almost certainly don't actually want this).
You have a SELECT statement without a FROM clause. That is not allowed.

SQL Join IN Query with AND?

I have the following tables:
Option
-------
id - int
name - varchar
Product
---------
id - int
name -varchar
ProductOptions
------------------
id - int
product_id - int
option_id - int
If I have a list of option ids, how can I retrieve all Products that have all the options with the list of ids that I have? I know that SQL "IN" will use an "OR" i need an "AND". Thank you!
If the ids are not repeated, you can retrieve the ids of the options you need and count how many they are. Then, you just
SELECT product_id FROM ProductOptions
WHERE option_id IN ( OPTIONS )
GROUP BY product_id
HAVING COUNT(product_id) = NEEDED;
Without the GROUP BY, if you had five option ids, and product 27 had fifteen options among which there were those five, you'd get five rows with the same product_id. The GROUP BY joins those rows. Since you want ALL options, and options have all different IDs, asking "rows with all of them" is equivalent to asking "rows with as many options as the desired option set size".
Plus, you run the big query on ProductOptions only, which should be really fast.
One way to approach queries like this is with a group by and having clause. It is best if you start with your list of required options in a list:
with list as (
select <optionname1> as optionname union all
select <optionname2 union all . . .
)
select ProductId
from list l left outer join
Options o
on l.optionname = o.name
ProductOptions po join
on po.option_id = o.option_id left outer join
group by ProductId
having count(distinct o.optionname) = count(distinct l.optionname)
This guarantees that all are in the list. By the way, I used SQL Server syntax to generate the list.
If you have the list in other formats, such as a delimited string, there are other options. There are other possibilities depending on the database you are using. However, the above idea should work on any database, with two caveats:
The with statement might just become a subquery in the FROM clause where "list" is.
The method for creating the list (a table of constants) varies among databases
If you have list of Id's you have basically only 2 options.
- Either to call as many selects as many id's you have
- or you have to use IN () or OR.
The usage of IN would be recommended however, as calling one statement is usually more performant (moreover in case you have index on all your id columns, no table scan should be required).
I'd use following statement:
select Product.* from Product, Option, ProductOption where Option.id IN ( 1, 2, ... ) and option.id = ProductOption.option_id and Product.product_id = Product.id
One more remark, why do you have id column in ProductOptions table? It's useless from my point of view, you should rather have composite primary key from columns product_id and option_id (as this couple is unique).
Will this work?:
select p.id, p.name
from Product as p inner join
ProductOptions as po on p.id=po.product_id
where po.option_id in (1,2,3,4)

mysql query to get the count of each element in a column

i have a table called category in which i have main category ids and names and each main category has sub category ids and names.i also have a product table with a category id column which either has a numeric sub category id or a 2 letter main category id eg:EL for electronics..my problem is how to get top categories ie., number of products in each category in descending order.
category
{
sub_cat_id - numeric
sub_cat_name - varchar
main_cat_id - varchar (2 characters)
main_cat_name
}
products
{
categoryid,//this can be either main_cat_id or sub_cat_id
}
pls help....
if there is no namespace clash between main category id and sub category id, you could :
select main_cat_id , count(*) as total
from category
where ( main_cat_id in (select categoryid from products)
OR
sub_cat_id in (select categoryid from products)
)
group by main_cat_id
order by total desc
however , prima facie there seems to be a problem with the design of the category table. sub_cat should be a different table with appropriate constraints.
Seems like it should be straight-forward, can you post the "CREATE TABLE" statements and a few sample rows (I wasn't able to "reverse engineer" the "CREATE TABLE" from your syntax above...)
Actually, I don't think you can do that with a single query. Due to the double nature of column categoryid in the table products (i.e. that it may be a foreign key or a category name), you'd have to somehow combine that column with the main_cat_id column of the table category following a join, and do a GROUP BY on the merged column, but that is not possible, AFAIK.
You can, of course, do two separate SQL queries (one for counting products directly in main categories, and another one for counting those in subcategories), and combine their results with some server side scripting.