Count columns of joined table - sql

I am writing a query to summarize the data in a Postgres database:
SELECT products.id,
products.NAME,
product_types.type_name AS product_type,
delivery_types.delivery,
products.required_selections,
Count(s.id) AS selections_count,
Sum(CASE
WHEN ss.status = 'WARNING' THEN 1
ELSE 0
END) AS warning_count
FROM products
JOIN product_types
ON product_types.id = products.product_type_id
JOIN delivery_types
ON delivery_types.id = products.delivery_type_id
LEFT JOIN selections_products sp
ON products.id = sp.product_id
LEFT JOIN selections s
ON s.id = sp.selection_id
LEFT JOIN selection_statuses ss
ON ss.id = s.selection_status_id
LEFT JOIN listings l
ON ( s.listing_id = l.id
AND l.local_date_time BETWEEN
To_timestamp('2014/12/01', 'YYYY/mm/DD'
) AND
To_timestamp('2014/12/30', 'YYYY/mm/DD') )
GROUP BY products.id,
product_types.type_name,
delivery_types.delivery
Basically we have a product with selections, these selections have listings and the listings have a local_date. I need a list of all products and how many listings they have between the two dates. No matter what I do, I get a count of all selections (a total). I feel like I'm overlooking something. The same concept goes for warning_count. Also, I don't really understand why Postgres requires me to add a group by here.
The schema looks like this (the parts you would care about anyway):
products
name:string
, product_type:fk
, required_selections:integer
, deliver_type:fk
selections_products
product_id:fk
, selection_id:fk
selections
selection_status_id:fk
, listing_id:fk
selection_status
status:string
listing
local_date:datetime

The way you have it you LEFT JOIN to all selections irregardless of listings.local_date_time.
There is room for interpretation, we would need to see actual table definitions with all constraints and data types to be sure. Going out on a limb, my educated guess is you can fix your query with the use of parentheses in the FROM clause to prioritize joins:
SELECT p.id
, p.name
, pt.type_name AS product_type
, dt.delivery
, p.required_selections
, count(s.id) AS selections_count
, sum(CASE WHEN ss.status = 'WARNING' THEN 1 ELSE 0 END) AS warning_count
FROM products p
JOIN product_types pt ON pt.id = p.product_type_id
JOIN delivery_types dt ON dt.id = p.delivery_type_id
LEFT JOIN ( -- LEFT JOIN!
selections_products sp
JOIN selections s ON s.id = sp.selection_id -- INNER JOIN!
JOIN listings l ON l.id = s.listing_id -- INNER JOIN!
AND l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
LEFT JOIN selection_statuses ss ON ss.id = s.selection_status_id
) ON sp.product_id = p.id
GROUP BY p.id, pt.type_name, dt.delivery;
This way, you first eliminate all selections outside the given time frame with [INNER] JOIN before you LEFT JOIN to products, thus keeping all products in the result, including those that aren't in any applicable selection.
Related:
Join four tables involving LEFT JOIN without duplicates
While selecting all or most products, this can be rewritten to be faster:
SELECT p.id
, p.name
, pt.type_name AS product_type
, dt.delivery
, p.required_selections
, COALESCE(s.selections_count, 0) AS selections_count
, COALESCE(s.warning_count, 0) AS warning_count
FROM products p
JOIN product_types pt ON pt.id = p.product_type_id
JOIN delivery_types dt ON dt.id = p.delivery_type_id
LEFT JOIN (
SELECT sp.product_id
, count(*) AS selections_count
, count(*) FILTER (WHERE ss.status = 'WARNING') AS warning_count
FROM selections_products sp
JOIN selections s ON s.id = sp.selection_id
JOIN listings l ON l.id = s.listing_id
LEFT JOIN selection_statuses ss ON ss.id = s.selection_status_id
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
GROUP BY 1
) s ON s.product_id = p.id;
It's cheaper to aggregate and count selections and warnings per product_id first, and then join to products. (Unless you only retrieve a small selection of products, then it's cheaper to reduce related rows first.)
Related:
Why does the following join increase the query time significantly?
Also, I don't really understand why Postgres requires me to add a group by here.
Since Postgres 9.1, the PK column in GROUP BY covers all columns of the same table. That does not cover columns of other tables, even if they are functionally dependent. You need to list those explicitly in GROUP BY if you don't want to aggregate them.
My second query avoids this problem on the outset by aggregating before the join.
Aside: chances are, this doesn't do what you want:
l.local_date_time BETWEEN To_timestamp('2014/12/01', 'YYYY/mm/DD')
AND To_timestamp('2014/12/30', 'YYYY/mm/DD')
Since date_time seems to be of type timestamp (not timestamptz!), you would include '2014-12-30 00:00', but exclude the rest of the day '2014-12-30'. And it's always better to use ISO 8601 format for dates and timestamps, which is means the same with every locale and datestyle setting. Hence:
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
This includes all of '2014-12-30', and nothing else. No idea why you chose to exclude '2014-12-31'. Maybe you really want to include all of Dec. 2014?
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2015-01-01'

Related

Count with subselect really slow in postgres

I have this query:
SELECT c.name, COUNT(t.id)
FROM Cinema c
JOIN CinemaMovie cm ON cm.cinema_id = c.id
JOIN Ticket t ON cm.id = cinema_movie_id
WHERE cm.id IN (
SELECT cm1.id
FROM CinemaMovie cm1
JOIN Movie m1 ON m1.id = cm1.movie_id
JOIN Ticket t1 ON t1.cinema_movie_id = cm1.id
WHERE m1.name = 'Hellboy'
AND t1.time >= timestamp '2019-04-18 00:00:00'
AND t1.time <= timestamp '2019-04-18 23:59:59' )
GROUP BY c.id;
and the problem is that this query runs really slow (more than 1 minute) when the table has like 20 million rows. From what I understand, the problem seems to be the inner query, as it takes a long time. Also, I have all indexes on foreign keys. What am I missing ?
Also note that when I select only by name (I omit the date) everything takes like 10 seconds.
EDIT
What I am trying to do, is count number of tickets for each cinema name, based on movie name and the timestamp on ticket.
I don't understand why you are using a subquery. Does this do what you want?
SELECT c.name, COUNT(t.id)
FROM Cinema c JOIN
CinemaMovie cm
ON cm.cinema_id = c.id JOIN
Ticket t
ON cm.id = cinema_movie_id JOIN
Movie m
ON m.id = cm.movie_id
WHERE m.name = 'Hellboy' AND
t.time >= '2019-04-18'::timestamp and
t.time < '2019-04-19'::timestamp
GROUP BY c.id, c.name;

Switching to Vertica from MySql, aggregate in where clause not working

Recently we have switched to Vertica from MySQL. I am lost on how to re-create the <=30 check inside the where clause in the query below. This currently does not work in Vertica, but does in MySQL.
Essentially, a user owns cars and cars have parts. I want to total the amount of cars and car parts in a timeframe, but only for users who have less than or equal to 30 cars.
select
count(distinct cr.id) as 'Cars',
count(distinct cp.id) as 'Car Parts'
from
users u
inner join
user_emails ue on u.id = ue.user_id
inner join
cars cr on cr.user_id = u.id
inner join
car_parts cp on cp.car_id = cr.id
where
(
select count(*) from cars where cars.user_id=u.id
) <=30
and
ue.is_real = true and ue.is_main = true
and
cr.created_at >= '2017-01-01 00:00:00' and cr.created_at <= '2017-02-17 23:59:59'
Any help or guidance is greatly appreciated!
Before my mouse flies away and my monitors goes blank, I get this error:
ERROR: Correlated subquery with aggregate function COUNT is not supported
You would use a subquery this way. You would use a window function:
select count(distinct cr.id) as Cars,
count(distinct cp.id) as CarParts
from users u join
user_emails ue
on u.id = ue.user_id join
(select cr.*, count(*) over (partition by user_id) as cnt
from cars cr
) cr
on cr.user_id = u.id join
car_parts cp
on cp.car_id = cr.id
where cr.cnt <= 30 and
ue.is_real = true and ue.is_main = true
cr.created_at >= '2017-01-01' and
cr.created_at < '2017-02-18';
Notes:
Don't enclose column aliases in single quotes. That is a bug waiting to happen. Only use single quotes for string and date constants.
You can simplify the date logic. Using < is better than <= to capture everything that happens on a particular day.

Group by a list in the where clause and then show the row whether it has a value or is null

I was working on this query for awhile and I'm trying to figure out how to get this query to show all the initials in the where clause in the output table regardless as to whether that person has a p.Id or not. This is because I will be putting this information into a pre-formatted Excel table. Thanks in advance! EDITED:
SELECT
e.Initials, COUNT(p.Id) As "NT > 7"
FROM
Employees AS e
LEFT JOIN
Projects AS p
ON
e.Id = p.NTEmployeeId AND
cast(p.NTDate as Datetime) < cast(dateadd(day, -7, (getdate())) as Datetime)
JOIN
Statuses AS s
ON
s.Id = p.StatusId AND
s.Code in ('WIPR', 'RISK', 'PEND')
WHERE
e.Initials IN('af', 'cm' , 'jy','br','dfv','rxc','tm','axk','hd','sa','rw')
GROUP BY (e.Initials);
You need an additional left join and to move the conditions on all but the first table into on clauses. Remember, the where clause will turn the outer joins to inner joins, because NULL values don't (generally) match conditions.
SELECT e.Initials, COUNT(p.Id) As "NT > 7"
FROM Employees e LEFT JOIN
Projects p
ON e.Id = p.NTEmployeeId AND
cast(p.NTDate as Date) < cast(getdate() as date) -- do date comparisons as dates, not strings
LEFT JOIN -- NEED LEFT JOIN HERE, or `ON` clause turns it into inner join
Statuses s
ON s.Id = p.StatusId AND
s.Code in ('WIPR', 'RISK', 'PEND') -- `LIKE`/`OR` isn't wrong but `IN` is easier
WHERE e.Initials IN ('af', 'cm' , 'jy', 'br','dfv', 'rxc', 'tm', 'axk', 'hd', 'sa', 'rw')
GROUP BY e.Initials;
I made a couple other changes:
Use dates for comparing dates, not strings. I think I have the logic right.
Use IN instead of chains of ORs if you can.
And for what you are doing, you might want to be counting the matching statuses rather than the matching people.
I figured out what I needed to do to get the correct answer to display. Please see below:
SELECT
e.Initials, COUNT(CASE WHEN s.Code in ('WIPR', 'RISK', 'PEND')
and cast(p.NTDate as Datetime) < cast(dateadd(day, -7, (getdate())) as Datetime) THEN p.Id END) As "NT > 7"
FROM
Employees AS e
LEFT JOIN
Projects AS p
ON
e.Id = p.NTEmployeeId
JOIN
Statuses AS s
ON
s.Id = p.StatusId
WHERE
e.Initials IN('af', 'cm' , 'jy','br','dfv','rxc','tm','axk','hxd','rw')
GROUP BY (e.Initials);

Solving SQL Query Decision making sql

I have a problem
I need to select all my contoNames from Conto which was active between FromDate And Todate
And If i dont get a match in that join i need either to set a NULL or do something else
But i have read that when using AND in your join expression it reads the And statement before it does the join and
it works like and Where statement filtering data away.
So i dont get a NULL
In this example
-- This statement gives 0 rows there is no contractId in that time span
SELECT S.Name, C.ContoName, C.fromDate, C.Todate
From Sales
Left outer Sales S on S.ContractId = C.ContractId AND '2014-12-31' BETWEEN c.fromDate AND C.Todate
This what i want to achive either to get a NULL value in my left join or do somekind of decision flow
if no contractId in Timespan then
Only join on contractId
SELECT S.Name, C.ContoName, C.fromDate, C.Todate
From Sales
INNER JOIN Sales S on S.ContractId = C.ContractId AND ( '2014-12-31'BETWEEN c.fromDate AND C.Todate OR Sales.ContractId = Conto.ContractId )
Not Realistic the AND operator works like a where statement and filtering data before it does the JOIN
SELECT S.Name, coalesce(C.ContoName,C1.ContoName)contoName,C.fromDate,C.Todate
From Sales S
LEFT JOIN Conto C on S.ContractId = C.ContractId
AND '2014-12-31' >BETWEEN c.fromDate AND C.Todate )
LEFT JOIN Conto C1 on S.ContractId = C1.contractId
Does anyone have and good idea to solve this in a nice way using tsql or standard sql
if no contractId in Timespan then Only join on contractId
I am reading this to mean "If there are contractIDs in the Timespan, then only show those. If there are none in the timespan, then show the ones that aren't in the timespan."
If I'm reading that wrong, then you need to clarify your question.
If I'm right, then you handle this with either a CASE or, as I will show, with an (AND) OR (AND) structure:
SELECT S.Name, C.ContoName,C.fromDate,C.Todate
From Sales S
LEFT JOIN Conto C
ON (
S.ContractId = C.ContractId
AND '2014-12-31' BETWEEN c.fromDate AND C.Todate
) OR (
S.ContractId = C.ContractId
AND NOT EXISTS(
SELECT * FROM Conto c1
WHERE S.ContractId = c1.ContractId
AND '2014-12-31' BETWEEN c1.fromDate AND c1.Todate
)
)

Duplicate Values in SQL

I'm using this query and have used the Select Distinct code to enusre no duplicates are pulled from the database.
However on my QTD colum the number is sometimes X2 the proper amount?
This is probably an error with the server or would my query be incorrect?
SELECT DISTINCT ad.eid, MAX(u1.email) as ops,MAX(u2.email) as rep,
(SUM(ad.cost)) as qtd_spend,
Sum(case when day < current_date AND day >='2015-01-01' then cost else 0 end) as MTD,
AVG(case when day < current_date AND day >= current_date-7 then cost else null end) as weekly_spend
FROM adcube as ad
inner JOIN advertisables as a on ad.eid = a.eid
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
WHERE day >='2015-01-01' and day < current_date
GROUP BY eid
You must have GROUP BY if you have aggregate functions (such as SUM or MAX).
What is likely the problem is in you JOINs.
I am not familiar with your data structure, but I am assuming that in your advertisables table, it contains (or CAN contain) more than one entry of the same "eid" - is this correct? Or do you have a constraint?
If this is correct, then when you join even if you have only ONE entry in the "adcube" table, once it JOINs with the multiple entries in the "advertisables" table then it pulls up TWO records (or however many match) and then the aggregate results at the select level of the statement then sum BOTH (or more) columsn.
So you should take the duplicates out of hte joining tables or factor that into account.
EDIT:
Ok, well glad to know that is the problem. You will not fix it by INNER JOINING either. You will have to do an inline select statement.
The best way to solve this, from what I understand you are trying to do, is do the following:
SELECT ad.eid
, (
select max(u1.email)
from JOIN advertisables as a
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
where a.eid = ad.eid
) as ops
, (
select max(u2.email)
from JOIN advertisables as a
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
where a.eid = ad.eid
) as rep
, (SUM(ad.cost)) as qtd_spend
, Sum(case when day < current_date AND day >='2015-01-01' then cost else 0 end) as MTD
, AVG(case when day < current_date AND day >= current_date-7 then cost else null end) as weekly_spend
FROM adcube as ad
WHERE day >='2015-01-01' and day < current_date
GROUP BY eid