Switching to Vertica from MySql, aggregate in where clause not working - sql

Recently we have switched to Vertica from MySQL. I am lost on how to re-create the <=30 check inside the where clause in the query below. This currently does not work in Vertica, but does in MySQL.
Essentially, a user owns cars and cars have parts. I want to total the amount of cars and car parts in a timeframe, but only for users who have less than or equal to 30 cars.
select
count(distinct cr.id) as 'Cars',
count(distinct cp.id) as 'Car Parts'
from
users u
inner join
user_emails ue on u.id = ue.user_id
inner join
cars cr on cr.user_id = u.id
inner join
car_parts cp on cp.car_id = cr.id
where
(
select count(*) from cars where cars.user_id=u.id
) <=30
and
ue.is_real = true and ue.is_main = true
and
cr.created_at >= '2017-01-01 00:00:00' and cr.created_at <= '2017-02-17 23:59:59'
Any help or guidance is greatly appreciated!
Before my mouse flies away and my monitors goes blank, I get this error:
ERROR: Correlated subquery with aggregate function COUNT is not supported

You would use a subquery this way. You would use a window function:
select count(distinct cr.id) as Cars,
count(distinct cp.id) as CarParts
from users u join
user_emails ue
on u.id = ue.user_id join
(select cr.*, count(*) over (partition by user_id) as cnt
from cars cr
) cr
on cr.user_id = u.id join
car_parts cp
on cp.car_id = cr.id
where cr.cnt <= 30 and
ue.is_real = true and ue.is_main = true
cr.created_at >= '2017-01-01' and
cr.created_at < '2017-02-18';
Notes:
Don't enclose column aliases in single quotes. That is a bug waiting to happen. Only use single quotes for string and date constants.
You can simplify the date logic. Using < is better than <= to capture everything that happens on a particular day.

Related

Is there any way to filter a "only" single value without two subqueries?

So my database has transactions from users on different stores and i would like to obtain ONLY the users that purchased on a single store , not multiple. I am currently working with these two subqueries but is a lil slow to load.
and user.id in (
select
u.id
from usertable u
join ordertable o on o.user_id=u.id
join storetable s on s.id = o.store_id
where s.id *(store id)* = #### and o.created at time zone 'Country/City' < '2020-11-01'
and o.status not ilike 'canceled%'
and o.order_kind = 'NORMAL'
)
and u.id not in (
select
u.id
from usertable u
join ordertable o on o.user_id=u.id
join storetable s on s.id = o.store_id
where not in s.id *(store id)* (####) and o.created at time zone 'Country/City' < '2020-11-01'
and o.status not ilike 'canceled%'
and o.order_kind = 'NORMAL'
)
I would like to obtain the users that made a purchase on an store "####" before 11/2020 and only on that store. Any ideas to make this faster?
If I understand correctly, just use aggregation and having:
select o.user_id
from ordertable o
where o.created at time zone 'Country/City' < '2020-11-01' and
o.status not ilike 'canceled%' and
o.order_kind = 'NORMAL'
group by o.user_id
having min(o.store_id) = max(o.store_id) and
min(o.store_id) = ####;
Note that neither users nor stores is needed for the query. All the information is in ordertable.

Count with subselect really slow in postgres

I have this query:
SELECT c.name, COUNT(t.id)
FROM Cinema c
JOIN CinemaMovie cm ON cm.cinema_id = c.id
JOIN Ticket t ON cm.id = cinema_movie_id
WHERE cm.id IN (
SELECT cm1.id
FROM CinemaMovie cm1
JOIN Movie m1 ON m1.id = cm1.movie_id
JOIN Ticket t1 ON t1.cinema_movie_id = cm1.id
WHERE m1.name = 'Hellboy'
AND t1.time >= timestamp '2019-04-18 00:00:00'
AND t1.time <= timestamp '2019-04-18 23:59:59' )
GROUP BY c.id;
and the problem is that this query runs really slow (more than 1 minute) when the table has like 20 million rows. From what I understand, the problem seems to be the inner query, as it takes a long time. Also, I have all indexes on foreign keys. What am I missing ?
Also note that when I select only by name (I omit the date) everything takes like 10 seconds.
EDIT
What I am trying to do, is count number of tickets for each cinema name, based on movie name and the timestamp on ticket.
I don't understand why you are using a subquery. Does this do what you want?
SELECT c.name, COUNT(t.id)
FROM Cinema c JOIN
CinemaMovie cm
ON cm.cinema_id = c.id JOIN
Ticket t
ON cm.id = cinema_movie_id JOIN
Movie m
ON m.id = cm.movie_id
WHERE m.name = 'Hellboy' AND
t.time >= '2019-04-18'::timestamp and
t.time < '2019-04-19'::timestamp
GROUP BY c.id, c.name;

How to optimize SQL Server query

I am copying data from one table to another table. While copying I am doing some calculation to modify one column.
SQL Server query:
INSERT INTO rat_proj_duration_map_2
SELECT
r.*,
r.hour_val / (CASE
WHEN week_val = 1 AND
(SELECT TOP 1
hrswk
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE calwk = 2
AND r.uid = u.uid
AND yr = 2016)
> 0 THEN (SELECT TOP 1
hrswk
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE calwk = 2
AND r.uid = u.uid
AND yr = 2016)
WHEN (SELECT
hrswk
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE r.week_val = us.calwk
AND r.uid = u.uid
AND yr = 2016)
< 1 AND
(SELECT
MAX(hrswk)
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE r.uid = u.uid
AND yr = 2016)
> 0 THEN (SELECT
MAX(hrswk)
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE r.uid = u.uid
AND yr = 2016)
WHEN (SELECT
COUNT(*)
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE r.uid = u.uid
AND yr = 2016)
<= 0 THEN 1
ELSE (SELECT
hrswk
FROM UserProfileRATinterface_view us
INNER JOIN users u
ON u.username = us.username
WHERE r.week_val = us.calwk
AND r.uid = u.uid
AND yr = 2016)
END) * 100 AS percentage_val
FROM rat_proj_duration_map r
When I run this query I getting time out issue.
TCP Provider: Timeout error [258]
SQL Server is not in my hand to increase time out value.
Is it possible to optimize my SQL query?
Are you sure this query is logically correct? You have several TOP 1s without specific ORDER BY, scalar comparison of subselect without TOP (which, I assume, may return more than one row if you are using top in other subselects with same source).
And yes - this query can be optimized. You can obtain all the values you need with a single subselect statement and avoid multiple execution of same subselects for each row of rat_proj_duration_map which you are having now:
INSERT INTO rat_proj_duration_map_2
SELECT
r.*,
r.hour_val / (CASE
WHEN week_val = 1 AND us.min_hrswk_2 > 0
THEN us.min_hrswk_2
WHEN us.min_hrswk_week_val <1
AND max_hrswk > 0
THEN max_hrswk
WHEN us.cnt <= 0
THEN 1
ELSE min_hrswk_week_val
END) * 100 as percentage_val
FROM
rat_proj_duration_map r
OUTER APPLY
(
SELECT
count(*) as cnt,
MIN(CASE WHEN calcw = 2 THEN hrswk END) as min_hrswk_2,
MIN(CASE WHEN calcw = r.week_val THEN hrswk END) as min_hrswk_week_val,
MAX(hrswk) as max_hrswk
FROM UserProfileRATinterface_view us
inner join users u on u.username=us.username
WHERE r.uid=u.uid and yr=2016
) us
But I can't be sure if original logic is correct. And the idea of that case to me looks like this:
...
r.hour_val / COALESCE(NULLIF(us.min_hrswk_2, 0),
NULLIF(us.min_hrswk_week_val, 0), NULLIF(max_hrswk, 0), 1)
...
The subqueries in your case clause seem to be essentially the same. You could simplify the whole command by defining a grouped version (... where yr=2016 group by u.uid) of this subquery (preferrably as a common table expression) and then work with that. This could potentially save a lot of redundant operations.
The following might work (have not tested it):
;WITH usrall as (
SELECT u.uid ui, hrswk hw, r.week wk, us.calwk cw
FROM UserProfileRATinterface_view us
INNER JOIN users u on u.username=us.username
WHERE r.uid=u.uid and yr=2016
), usrgrp as (
SELECT ui gui, MAX(hrswk) ghw, count(*) gcnt FROM usrall group by ui
), denom as (
SELECT gui dui, COALESCE( MAX(w2.hw), MAX(wkwc.hw), MAX(gwh) ) dnm
FROM usrgrp
LEFT JOIN usrall w2 ON w2.ui=gui AND w2.cw=2 AND w2.hw>0
LEFT JOIN usrall wkcw ON wkcw.ui=gui AND wkcw.wk=wkcw.cw AND wkwc.hw<1
GROUP BY gui
)
SELECT r.*, r.hour_val / d.dnm
FROM rat_proj_duration_map r
INNER JOIN denom d ON d.dui=u.uid
Essentially I have tried (I hope it works :-/) to replace the case construct by a COALESCE() function that checks the three possible calculated values one after the other. The first non-null value is accepted.
As I said: I have not tested it. Good luck

Count columns of joined table

I am writing a query to summarize the data in a Postgres database:
SELECT products.id,
products.NAME,
product_types.type_name AS product_type,
delivery_types.delivery,
products.required_selections,
Count(s.id) AS selections_count,
Sum(CASE
WHEN ss.status = 'WARNING' THEN 1
ELSE 0
END) AS warning_count
FROM products
JOIN product_types
ON product_types.id = products.product_type_id
JOIN delivery_types
ON delivery_types.id = products.delivery_type_id
LEFT JOIN selections_products sp
ON products.id = sp.product_id
LEFT JOIN selections s
ON s.id = sp.selection_id
LEFT JOIN selection_statuses ss
ON ss.id = s.selection_status_id
LEFT JOIN listings l
ON ( s.listing_id = l.id
AND l.local_date_time BETWEEN
To_timestamp('2014/12/01', 'YYYY/mm/DD'
) AND
To_timestamp('2014/12/30', 'YYYY/mm/DD') )
GROUP BY products.id,
product_types.type_name,
delivery_types.delivery
Basically we have a product with selections, these selections have listings and the listings have a local_date. I need a list of all products and how many listings they have between the two dates. No matter what I do, I get a count of all selections (a total). I feel like I'm overlooking something. The same concept goes for warning_count. Also, I don't really understand why Postgres requires me to add a group by here.
The schema looks like this (the parts you would care about anyway):
products
name:string
, product_type:fk
, required_selections:integer
, deliver_type:fk
selections_products
product_id:fk
, selection_id:fk
selections
selection_status_id:fk
, listing_id:fk
selection_status
status:string
listing
local_date:datetime
The way you have it you LEFT JOIN to all selections irregardless of listings.local_date_time.
There is room for interpretation, we would need to see actual table definitions with all constraints and data types to be sure. Going out on a limb, my educated guess is you can fix your query with the use of parentheses in the FROM clause to prioritize joins:
SELECT p.id
, p.name
, pt.type_name AS product_type
, dt.delivery
, p.required_selections
, count(s.id) AS selections_count
, sum(CASE WHEN ss.status = 'WARNING' THEN 1 ELSE 0 END) AS warning_count
FROM products p
JOIN product_types pt ON pt.id = p.product_type_id
JOIN delivery_types dt ON dt.id = p.delivery_type_id
LEFT JOIN ( -- LEFT JOIN!
selections_products sp
JOIN selections s ON s.id = sp.selection_id -- INNER JOIN!
JOIN listings l ON l.id = s.listing_id -- INNER JOIN!
AND l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
LEFT JOIN selection_statuses ss ON ss.id = s.selection_status_id
) ON sp.product_id = p.id
GROUP BY p.id, pt.type_name, dt.delivery;
This way, you first eliminate all selections outside the given time frame with [INNER] JOIN before you LEFT JOIN to products, thus keeping all products in the result, including those that aren't in any applicable selection.
Related:
Join four tables involving LEFT JOIN without duplicates
While selecting all or most products, this can be rewritten to be faster:
SELECT p.id
, p.name
, pt.type_name AS product_type
, dt.delivery
, p.required_selections
, COALESCE(s.selections_count, 0) AS selections_count
, COALESCE(s.warning_count, 0) AS warning_count
FROM products p
JOIN product_types pt ON pt.id = p.product_type_id
JOIN delivery_types dt ON dt.id = p.delivery_type_id
LEFT JOIN (
SELECT sp.product_id
, count(*) AS selections_count
, count(*) FILTER (WHERE ss.status = 'WARNING') AS warning_count
FROM selections_products sp
JOIN selections s ON s.id = sp.selection_id
JOIN listings l ON l.id = s.listing_id
LEFT JOIN selection_statuses ss ON ss.id = s.selection_status_id
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
GROUP BY 1
) s ON s.product_id = p.id;
It's cheaper to aggregate and count selections and warnings per product_id first, and then join to products. (Unless you only retrieve a small selection of products, then it's cheaper to reduce related rows first.)
Related:
Why does the following join increase the query time significantly?
Also, I don't really understand why Postgres requires me to add a group by here.
Since Postgres 9.1, the PK column in GROUP BY covers all columns of the same table. That does not cover columns of other tables, even if they are functionally dependent. You need to list those explicitly in GROUP BY if you don't want to aggregate them.
My second query avoids this problem on the outset by aggregating before the join.
Aside: chances are, this doesn't do what you want:
l.local_date_time BETWEEN To_timestamp('2014/12/01', 'YYYY/mm/DD')
AND To_timestamp('2014/12/30', 'YYYY/mm/DD')
Since date_time seems to be of type timestamp (not timestamptz!), you would include '2014-12-30 00:00', but exclude the rest of the day '2014-12-30'. And it's always better to use ISO 8601 format for dates and timestamps, which is means the same with every locale and datestyle setting. Hence:
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2014-12-31'
This includes all of '2014-12-30', and nothing else. No idea why you chose to exclude '2014-12-31'. Maybe you really want to include all of Dec. 2014?
WHERE l.local_date_time >= '2014-12-01'
AND l.local_date_time < '2015-01-01'

Duplicate Values in SQL

I'm using this query and have used the Select Distinct code to enusre no duplicates are pulled from the database.
However on my QTD colum the number is sometimes X2 the proper amount?
This is probably an error with the server or would my query be incorrect?
SELECT DISTINCT ad.eid, MAX(u1.email) as ops,MAX(u2.email) as rep,
(SUM(ad.cost)) as qtd_spend,
Sum(case when day < current_date AND day >='2015-01-01' then cost else 0 end) as MTD,
AVG(case when day < current_date AND day >= current_date-7 then cost else null end) as weekly_spend
FROM adcube as ad
inner JOIN advertisables as a on ad.eid = a.eid
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
WHERE day >='2015-01-01' and day < current_date
GROUP BY eid
You must have GROUP BY if you have aggregate functions (such as SUM or MAX).
What is likely the problem is in you JOINs.
I am not familiar with your data structure, but I am assuming that in your advertisables table, it contains (or CAN contain) more than one entry of the same "eid" - is this correct? Or do you have a constraint?
If this is correct, then when you join even if you have only ONE entry in the "adcube" table, once it JOINs with the multiple entries in the "advertisables" table then it pulls up TWO records (or however many match) and then the aggregate results at the select level of the statement then sum BOTH (or more) columsn.
So you should take the duplicates out of hte joining tables or factor that into account.
EDIT:
Ok, well glad to know that is the problem. You will not fix it by INNER JOINING either. You will have to do an inline select statement.
The best way to solve this, from what I understand you are trying to do, is do the following:
SELECT ad.eid
, (
select max(u1.email)
from JOIN advertisables as a
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
where a.eid = ad.eid
) as ops
, (
select max(u2.email)
from JOIN advertisables as a
LEFT JOIN organizations as o on o.id = a.id
LEFT outer JOIN users as u1 on o.ops_organization_id = u1.organization_id
LEFT outer JOIN users as u2 on o.sales_organization_id = u2.organization_id
where a.eid = ad.eid
) as rep
, (SUM(ad.cost)) as qtd_spend
, Sum(case when day < current_date AND day >='2015-01-01' then cost else 0 end) as MTD
, AVG(case when day < current_date AND day >= current_date-7 then cost else null end) as weekly_spend
FROM adcube as ad
WHERE day >='2015-01-01' and day < current_date
GROUP BY eid