SQL optimization problem, which of the two solutions below is the most efficient?
I have the table from the image, I need to group the data by CPF and date and know if the CPFs had at least one login_ok = true on a specific date. Both solutions below satisfy my need but the goal is to find the best query.
We can have multiple login_ok = true and login_ok = false for CPFs on a specific date. I just need to know if there was at least one login_ok = true
I already have two solutions, I want to discuss how to make another more efficient
Maybe this would work for your problem:
SELECT
t2.CPF,
t2.data
FROM (
SELECT
CPF,
date(data) AS data
from db_risco.site_rn_login
WHERE login_ok
) t2
GROUP BY 1,2
ORDER BY t2.data
DISTINCT would also work, and I doubt it would pose any performance threat in your case. Usually it evals expressions (like date(data)) before checking for uniqueness.
By using a subquery, in this case, you can select upfront which CPFs to include and then extract date. Finally you'd group by on a quite smaller number os lines, since those were previously selected.
PostgreSQL has the function BOOL_OR to check whether the expression is true for at least one row. It is likely to be optimised for this kind of task.
select cpf, date(data) as data, bool_or(login_ok) as status_login
from db_risco.site_rn_login
group by cpf, date(data);
An index on (cpf, date(data)) or even on (cpf, date(data), login_ok) could help speed up the query.
On a side note: You may also want to order your results with ORDER BY. Don't rely on GROUP BY doing this. The order of the rows resulting from a query is arbitrary without a GROUP BY clause.
Related
i am trying to update a colleague's MS Access application (with vb-code). I am rather experienced in writing SQL queries but i am not able to solve the following problem.
The query i am looking to fix uses a pass-through-query's result and a local MS Access table and joins them togehter in the where-clause (i tried using the normal way with ON but it seems this doesn't work when there is a pass through query involved). I have little experience with joining tables in the where-clause but is there such a thing that i cant use certain columns (of both tables) in the where-clause when joining tables in the where-clause? -> When i use a filter criteria such as columnA <> 'somerandomtext' (which is always satisfied, just to point out the problem) the query result is empty. When i delete the latter criteria in the where clause, the query returns results (although too many because i cant filter them accordingly).
Furthermore: I checked the pass-through-query, the results are correct. I checked the MS-Access table, the data in the table is correct. Therefore, i think i might be doing something wrong in the query where i join the two mentioned above.
THIS QUERY WORKS AS INTENDED AND RETURNS RESULTS:
SELECT t.tr_id, t.ser_num, t.contrgnt_id, t.pos_ekey, t.sernum AS cmdty, format(t.vol,""##,###,###.00"") AS volume, t.unit_def, t.value_date,
format(t.coup,""##,###,###.00"") AS fixprice,
format(s.calcvarprice,""##,###,###.00"") AS marketprice,
format(s.calcamount,""##,###,###.00"") AS payamount,
format(s.settlevarprice, ""##,###,##0.00"") as settleprice,
format(s.settleamount, ""##,###,##0.00"") as settleamount, s.sync
FROM pms_trans AS t, settledata AS s
WHERE t.tr_id=s.tr_id And t.is_booked='N' And t.value_date>='01.01.2021' And t.value_date<='01.04.2021'
THIS QUERY SOMEHOW RETURNS 0 RESULTS:
SELECT t.tr_id, t.ser_num, t.contrgnt_id, t.pos_ekey, t.sernum AS cmdty, format(t.vol,""##,###,###.00"") AS volume, t.unit_def, t.value_date,
format(t.coup,""##,###,###.00"") AS fixprice,
format(s.calcvarprice,""##,###,###.00"") AS marketprice,
format(s.calcamount,""##,###,###.00"") AS payamount,
format(s.settlevarprice, ""##,###,##0.00"") as settleprice,
format(s.settleamount, ""##,###,##0.00"") as settleamount, s.sync
FROM pms_trans AS t, settledata AS s
WHERE t.tr_id=s.tr_id And t.is_booked='N' And t.pos_ekey <> 'BGGS' And t.value_date>='01.01.2021' And t.value_date<='01.04.2021'
As mentioned before, i suspect that there are some limitations when joining via where-clause (although i didnt find sufficient information online).
Best Regards and thank you in advance,
Peter
First, apply the correct syntax for the date expressions:
WHERE t.tr_id=s.tr_id And t.is_booked='N' And t.pos_ekey <> 'BGGS' And t.value_date >= #2021-01-01# And t.value_date <= #2021-04-01#
Next, double-check the values for pos_ekey. For example, try to apply the filter, <> 'BGGS', directly on the field in table view.
Query is
SELECT DISTINCT A.X1, A.X2, A.X3, TO_DATE(A.EVNT_SCHED_DATE,'DD-Mon-YYYY') AS EVNT_SCHED_DATE,
A.X4, A.MOVEMENT_TYPE, TRIM(A.EFFECTIVE_STATUS) AS STATUS, A.STATUS_TIME, A.TYPE,
A.LEG_NUMBER,
CASE WHEN A.EFFECTIVE_STATUS='BT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='NLT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='MKUP' THEN 'MKUP'
END AS STATUS
FROM PHASE1.DY_STATUS_ZONE A
WHERE A.LAST_LEG_FLAG='Y'
AND SCHLD_DATE>='01-Apr-2019'--TO_DATE(''||MNTH_DATE||'','DD-Mon-YYYY')
AND SCHLD_DATE<='20-Feb-2020'--TO_DATE(''||TILL_DATE||'','DD-Mon-YYYY')
AND A.MOVEMENT_TYPE IN ('P')
AND (EXCEPTIONAL_FLAG='N' OR EXCEPTION_TYPE='5') ---------SS
PHASE1.DY_STATUS_ZONE has 710246 records in it , Please guide if this query can be optimized ?
You could try adding an index which covers the WHERE clause:
CREATE INDEX idx ON PHASE1.DY_STATUS_ZONE (LAST_LEG_FLAG, SCHLD_DATE, MOVEMENT_TYPE,
EXCEPTIONAL_FLAG, EXCEPTION_TYPE);
Depending on the cardinality of your data, the above index may or may not be used.
The problem might be the select distinct. This can be hard to optimize because it removes duplicates. Even if no rows are duplicated, Oracle still does the work. If it is not needed remove it.
For your particular query, I would write it as:
WHERE A.LAST_LEG_FLAG = 'Y' AND
SCHLD_DATE >= DATAE '2019-04-01 AND
SCHLD_DATE <= DATE '2020-02-20' AND
A.MOVEMENT_TYPE = 'P' AND
EXCEPTIONAL_FLAG IN ('N', '5')
The date formats don't affect performance. Just readability and maintainability.
For this query, the optimal index is probably: (LAST_LEG_FLAG, MOVEMENT_TYPE, SCHLD_DATE, EXCEPTIONAL_FLAG). The last two columns might be switched, if EXCEPTIONAL_FLAG is more selective than SCHLD_DATE.
However, if this returns many rows, then the SELECT DISTINCT will be the gating factor for the query. And that is much more difficult to optimize.
I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))
I have a Postgres query as such:
select id,
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id
order by id;
It's nothing fancy but works fine -- only takes about 3 minutes when querying a large database to return the result.
My question is, why does this slight modification to the above code cause the time for the query to more-than quadruple?
select id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
group by id, b.ad_description
order by id;
What is going on? The mere inclusion of one simple column of (albeit unique) information is bogging my query down. I am somehow asking Postgres to do a tremendously larger amount of work. For the life of me, I don't see how.
I'd like to preemptively apologize for not including any raw data. I'm hoping this simplified example of what I'm really facing is clear enough for some kind soul to make an enlightening comment. I can say that I'm going over a million rows in each table.
Thanks in advance.
i think the size of the return set is the issue. since your result set now has a millionish rows, that would explain the extra time (although it does seem excessive). a couple of things. first, the first column in the select is not bound to a range variable, it probably doesn't matter. but, you might want to make it select b.id, b.ad_description instead of id, b.ad_description. second, normally i use a group by when one of the columns is an aggregate that isn't in the group by statement, like a 'count()' or something. maybe i'm missing something, but, you might get the same result with:
select distinct b.id, b.ad_description
from ads_1 as a
join ads_2 as b
on a.id_key = b.id_key
where b.date between '2014-01-01' and '2014-01-02'
order by b.id, b.ad_description;
you might want to play with some of the numbers in the postgresql.conf file to beef up workspace/query space.
finally, i have this funny hunch that the order by and the group by matching might also help performance.
the LIMIT 5 wouldn't help much with performance, because the entire query would need to be finished before the first 5 rows would come out.
-g
Hey guys. This is a follow-on from this question:
After getting the right data and making some tweaks based on requests from business, I've now got this mini-beast on my hands. This query should return the total number of new jobseeker registrations and the number of new uploaded CV's:
SELECT COUNT(j.jobseeker_id) as new_registrations,
(
SELECT
COUNT(c.cv_id)
FROM
tb_cv as c, tb_jobseeker, tb_industry
WHERE
UNIX_TIMESTAMP(c.created_at) >= '1241125200'
AND
UNIX_TIMESTAMP(c.created_at) <= '1243717200'
AND
tb_jobseeker.industry_id = tb_industry.industry_id
)
AS uploaded_cvs
FROM
tb_jobseeker as j, tb_industry as i
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND
i.industry_id = j.industry_id
GROUP BY i.description, MONTH(j.created_at)
Notes:
- The two values in the UNIX TIMESTAMP functions are passed in as parameters from the report module in our backend.
Every time I run it, MySQL chokes and lingers silently into the ether of the Interweb.
Help is appreciated.
Update: Hey guys. Thanks a lot for all the thoughtful and helpful comments. I'm only 2 weeks into my role here, so I'm still learning the schema. So, this query is somewhere between a thumbsuck and an educated guess. Will start to answer all your questions now.
tb_cv is not connected to the other tables in the sub-query. I guess this is the root cause for the slow query. It causes generation of a Cartesian product, yielding a lot more rows than you probably need.
Other than that I'd say you need indexes on tb_jobseeker.created_at, tb_cv.created_at and tb_industry.industry_id, and you might want to get rid of the UNIX_TIMESTAMP() calls in the sub-query since they prevent use of an index. Use BETWEEN and the actual field values instead.
Here is my attempt at understanding your query and writing a better version. I guess you want to get the count of new jobseeker registrations and new uploaded CVs per month per industry:
SELECT
i.industry_id,
i.description,
MONTH(j.created_at) AS month_created,
YEAR(j.created_at) AS year_created,
COUNT(DISTINCT j.jobseeker_id) AS new_registrations,
COUNT(cv.cv_id) AS uploaded_cvs
FROM
tb_cv AS cv
INNER JOIN tb_jobseeker AS j ON j.jobseeker_id = cv.jobseeker_id
INNER JOIN tb_industry AS i ON i.industry_id = j.industry_id
WHERE
j.created_at BETWEEN '2009-05-01' AND '2009-05-31'
AND cv.created_at BETWEEN '2009-05-01' AND '2009-05-31'
GROUP BY
i.industry_id,
i.description,
MONTH(j.created_at),
YEAR(j.created_at)
A few things I noticed while writing the query:
you GROUP BY values you don't output in the end. Why? (I've added the grouped field to the output list.)
you JOIN three tables in the sub-query while only ever using values from one of them. Why? I don't see what it would be good for, other than filtering out CV records that don't have a jobseeker or an industry attached — which I find hard to imagine. (I've removed the entire sub-query and used a simple COUNT instead.)
Your sub-query returns the same value every time. Did you maybe mean to correlate it in some way, to the industry maybe?.
The sub-query runs once for every record in a grouped query without being wrapped in an aggregate function.
First and foremost it may be worth moving the 'UNIX_TIMESTAMP' conversions to the other side of the equation (that is, perform a reverse function on the literal timestamp values at the other side of the >= and <=). That'll avoid the inner query having to perform the conversions for every record, rather than once for the query.
Also, why does the uploaded_cvs query not have any where clause linking it to the outer query? Am I missing something here?