Group By seems to add inordinate amount of computation in a simple Postgres query - sql

I have a Postgres query as such:
select id,
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id
order by id;
It's nothing fancy but works fine -- only takes about 3 minutes when querying a large database to return the result.
My question is, why does this slight modification to the above code cause the time for the query to more-than quadruple?
select id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id, b.ad_description
order by id;
What is going on? The mere inclusion of one simple column of (albeit unique) information is bogging my query down. I am somehow asking Postgres to do a tremendously larger amount of work. For the life of me, I don't see how.
I'd like to preemptively apologize for not including any raw data. I'm hoping this simplified example of what I'm really facing is clear enough for some kind soul to make an enlightening comment. I can say that I'm going over a million rows in each table.
Thanks in advance.

i think the size of the return set is the issue. since your result set now has a millionish rows, that would explain the extra time (although it does seem excessive). a couple of things. first, the first column in the select is not bound to a range variable, it probably doesn't matter. but, you might want to make it select b.id, b.ad_description instead of id, b.ad_description. second, normally i use a group by when one of the columns is an aggregate that isn't in the group by statement, like a 'count()' or something. maybe i'm missing something, but, you might get the same result with:
select distinct b.id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
order by b.id, b.ad_description;
you might want to play with some of the numbers in the postgresql.conf file to beef up workspace/query space.
finally, i have this funny hunch that the order by and the group by matching might also help performance.
the LIMIT 5 wouldn't help much with performance, because the entire query would need to be finished before the first 5 rows would come out.
-g

Related

Query optimization on Postgresql

SQL optimization problem, which of the two solutions below is the most efficient?
I have the table from the image, I need to group the data by CPF and date and know if the CPFs had at least one login_ok = true on a specific date. Both solutions below satisfy my need but the goal is to find the best query.
We can have multiple login_ok = true and login_ok = false for CPFs on a specific date. I just need to know if there was at least one login_ok = true
I already have two solutions, I want to discuss how to make another more efficient
Maybe this would work for your problem:
SELECT
t2.CPF,
t2.data
FROM (
SELECT
CPF,
date(data) AS data
from db_risco.site_rn_login
WHERE login_ok
) t2
GROUP BY 1,2
ORDER BY t2.data
DISTINCT would also work, and I doubt it would pose any performance threat in your case. Usually it evals expressions (like date(data)) before checking for uniqueness.
By using a subquery, in this case, you can select upfront which CPFs to include and then extract date. Finally you'd group by on a quite smaller number os lines, since those were previously selected.
PostgreSQL has the function BOOL_OR to check whether the expression is true for at least one row. It is likely to be optimised for this kind of task.
select cpf, date(data) as data, bool_or(login_ok) as status_login
from db_risco.site_rn_login
group by cpf, date(data);
An index on (cpf, date(data)) or even on (cpf, date(data), login_ok) could help speed up the query.
On a side note: You may also want to order your results with ORDER BY. Don't rely on GROUP BY doing this. The order of the rows resulting from a query is arbitrary without a GROUP BY clause.

How do I find the previous line without writing an inefficient subquery?

So I have this query (and have encountered or coded a bunch of similar ones during my life :) ) which is extremely inefficient in terms of performance, due to the subquery.
I'm running pgsql currently but have had this issue with mysql and mssql as well.
Sometimes I can use MAX() but here I have 2 different columns: runners.id (the one I need to find my data) and runners.updated_at (the one on which I could do MAX()).
Any tips?
SELECT
ROUND(CAST(AVG(DATE_PART('day', current_claim_event.updated_at - claims.created_at)) AS NUMERIC),1)
AS average, count(*)
FROM claims_events current_claim_event
INNER JOIN claims ON claims.id = current_claim_event.claim_id
WHERE current_claim_event.id = (
SELECT runners.id
FROM claims_events runners
WHERE runners.claim_id = current_claim_event.claim_id
ORDER BY runners.updated_at DESC
LIMIT 1
);

Order by in subquery behaving differently than native sql query?

So I am honestly a little puzzled by this!
I have a query that returns a set of transactions that contain both repair costs and an odometer reading at the time of repair on the master level. To get an accurate Cost per mile reading I need to do a subquery to get both the first meter reading between a starting date and an end date, and an ending meter.
(select top 1 wf2.ro_num
from wotrans wotr2
left join wofile wf2
on wotr2.rop_ro_num = wf2.ro_num
and wotr2.rop_fac = wf2.ro_fac
where wotr.rop_veh_num = wotr2.rop_veh_num
and wotr.rop_veh_facility = wotr2.rop_veh_facility
AND ((#sdate = '01/01/1900 00:00:00' and wotr2.rop_tran_date = 0)
OR ([dbo].[udf_RTA_ConvertDateInt](#sdate) <= wotr2.rop_tran_date
AND [dbo].[udf_RTA_ConvertDateInt](#edate) >= wotr2.rop_tran_date))
order by wotr2.rop_tran_date asc) as highMeter
The reason I have the tables aliased as xx2 is because those tables are also used in the main query, and I don't want these to interact with each other except to pull the correct vehicle number and facility.
Basically when I run the main query it returns a value that is not correct; it returns the one that is second(keep in mind that the first and second have the same date.) But when I take the subquery and just copy and paste it into it's own query and run it, it returns the correct value.
I do have a work around for this, but I am just curious as to why this happening. I have searched quite a bit and found not much(other than the fact that people don't like order bys in subqueries). Talking to one of my friends that also does quite a bit of SQL scripting, it looks to us as if the subquery is ordering differently than the subquery by itsself when you have multiple values that are the same for the order by(i.e. 10 dates of 08/05/2016).
Any ideas would be helpful!
Like I said I have a work around that works in this one case, but don't know yet if it will work on a larger dataset.
Let me know if you want more code.

Subqueries and AVG() on a subtraction

Working on a query to return the average time from when an employee begins his/her shift and then arrives at the first home (this DB assumes they are salesmen).
What I have:
SELECT l.OFFICE_NAME, crew.EMPLOYEE_NAME, //avg(first arrival time)
FROM LOCAL_OFFICE l, CREW_WORK_SCHEDULE crew,
WHERE l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
You can see the AVG() command is commented out, because I know the time that they arrive at work, and the time they get to the first house, and can find the value using this:
(SELECT MIN(c.ARRIVE)
FROM ORDER_STATUS c
WHERE c.USER_ID = crew.CREW_ID)
-(SELECT START_TIME
FROM CREW_SHIFT_CODES
WHERE WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE)
Would the best way be to simply put the above into the the AVG() parentheses? Just trying to learn the best methods to create queries. If you want more info on any of the tables, etc. just ask, but hopefully they're all named so you know what they're returning.
As per my comment, the example you gave would only return one record to the AVG function, and so not do very much.
If the sub-query was returning multiple records, however, your suggestion of placing the sub-query inside the AVG() would work...
SELECT
AVG((SELECT MIN(sub.val) FROM sub WHERE sub.id = main.id GROUP BY sub.group))
FROM
main
GROUP BY
main.group
(Averaging a set of minima, and so requiring two levels of GROUP BY.)
In many cases this gives good performance, and is maintainable. But sometimes the sub-query grows large, and it can be better to reformat it using an inline view...
SELECT
main.group,
AVG(sub_query.val)
FROM
main
INNER JOIN
(
SELECT
sub.id,
sub.group,
MIN(sub.val) AS val
FROM
sub
GROUP BY
sub.id
sub.group
)
AS sub_query
ON sub_query.id = main.id
GROUP BY
main.group
Note: Although this looks as though the inline view will calculate a lod of values that are not needed (and so be inefficient), most RDBMS optimise this so only the required records get processes. (The optimiser knows how the inner query is being used by the outer query, and builds the execution plan accordingly.)
Don't think of subqueries: they're often quite slow. In effect, they are row by row (RBAR) operations rather than set based
join all the table together
I've used a derived table to calculate the 1st arrival time
Aggregate
Soemthing like
SELECT
l.OFFICE_NAME, crew.EMPLOYEE_NAME,
AVG(os.minARRIVE - cs.START_TIME)
FROM
LOCAL_OFFICE l
JOIN
CREW_WORK_SCHEDULE crew On l.LOCAL_OFFICE_ID = crew1.LOCAL_OFFICE_ID
JOIN
CREW_SHIFT_CODES cs ON cs.WORK_SHIFT_CODE = crew.WORK_SHIFT_CODE
JOIN
(SELECT MIN(ARRIVE) AS minARRIVE, USER_ID
FROM ORDER_STATUS
GROUP BY USER_ID
) os ON oc.USER_ID = crew.CREW_ID
GROUP B
l.OFFICE_NAME, crew.EMPLOYEE_NAME
This probably won't give correct data because of the minARRIVE grouping: there isn't enough info from ORDER_STATUS to show "which day" or "which shift". It's simply "first arrival for that user for all time"
Edit:
This will give you average minutes
You can add this back to minARRIVE using DATEADD, or change to hh:mm with some %60 (modul0) and /60 (integer divide
AVG(
DATEDIFF(minute, os.minARRIVE, os.minARRIVE)
)

Slow Query - Help with Optimization

Hey guys. This is a follow-on from this question:
After getting the right data and making some tweaks based on requests from business, I've now got this mini-beast on my hands. This query should return the total number of new jobseeker registrations and the number of new uploaded CV's:
SELECT COUNT(j.jobseeker_id) as new_registrations,
(
SELECT
COUNT(c.cv_id)
FROM
tb_cv as c, tb_jobseeker, tb_industry
WHERE
UNIX_TIMESTAMP(c.created_at) >= '1241125200'
AND
UNIX_TIMESTAMP(c.created_at) <= '1243717200'
AND
tb_jobseeker.industry_id = tb_industry.industry_id
)
AS uploaded_cvs
FROM
tb_jobseeker as j, tb_industry as i
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND
i.industry_id = j.industry_id
GROUP BY i.description, MONTH(j.created_at)
Notes:
- The two values in the UNIX TIMESTAMP functions are passed in as parameters from the report module in our backend.
Every time I run it, MySQL chokes and lingers silently into the ether of the Interweb.
Help is appreciated.
Update: Hey guys. Thanks a lot for all the thoughtful and helpful comments. I'm only 2 weeks into my role here, so I'm still learning the schema. So, this query is somewhere between a thumbsuck and an educated guess. Will start to answer all your questions now.
tb_cv is not connected to the other tables in the sub-query. I guess this is the root cause for the slow query. It causes generation of a Cartesian product, yielding a lot more rows than you probably need.
Other than that I'd say you need indexes on tb_jobseeker.created_at, tb_cv.created_at and tb_industry.industry_id, and you might want to get rid of the UNIX_TIMESTAMP() calls in the sub-query since they prevent use of an index. Use BETWEEN and the actual field values instead.
Here is my attempt at understanding your query and writing a better version. I guess you want to get the count of new jobseeker registrations and new uploaded CVs per month per industry:
SELECT
i.industry_id,
i.description,
MONTH(j.created_at) AS month_created,
YEAR(j.created_at) AS year_created,
COUNT(DISTINCT j.jobseeker_id) AS new_registrations,
COUNT(cv.cv_id) AS uploaded_cvs
FROM
tb_cv AS cv
INNER JOIN tb_jobseeker AS j ON j.jobseeker_id = cv.jobseeker_id
INNER JOIN tb_industry AS i ON i.industry_id = j.industry_id
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND cv.created_at BETWEEN '2009-05-01' AND '2009-05-31'
GROUP BY
i.industry_id,
i.description,
MONTH(j.created_at),
YEAR(j.created_at)
A few things I noticed while writing the query:
you GROUP BY values you don't output in the end. Why? (I've added the grouped field to the output list.)
you JOIN three tables in the sub-query while only ever using values from one of them. Why? I don't see what it would be good for, other than filtering out CV records that don't have a jobseeker or an industry attached — which I find hard to imagine. (I've removed the entire sub-query and used a simple COUNT instead.)
Your sub-query returns the same value every time. Did you maybe mean to correlate it in some way, to the industry maybe?.
The sub-query runs once for every record in a grouped query without being wrapped in an aggregate function.
First and foremost it may be worth moving the 'UNIX_TIMESTAMP' conversions to the other side of the equation (that is, perform a reverse function on the literal timestamp values at the other side of the >= and <=). That'll avoid the inner query having to perform the conversions for every record, rather than once for the query.
Also, why does the uploaded_cvs query not have any where clause linking it to the outer query? Am I missing something here?