I'm analyzing the code of someone else and stumbled on a query like this:
SELECT ...
FROM ...
GROUP BY date_field + 1
What's the difference between this query and one such as the following?
SELECT ...
FROM ...
GROUP BY date_field
If the result to both queries is the same, are they different in terms of performance? Thanks in advance.
There is no difference logically. The original code probably has something like:
select date_field + 1, . . .
from . . .
group by date_field + 1;
The group by key then directly matches the expression in the select. However, this is not necessary. SQL supports expressions on the group by keys, and + 1 does no affect the outcome.
It might affect the execution plan, however. I am not overly familiar with the optimization engine in Teradata, but two things come to mind:
The engine may not recognize that it could use an index on date_field.
The engine may get confused on statistics for the aggregation column, choosing a suboptimal aggregation algorithm.
Related
Query is
SELECT DISTINCT A.X1, A.X2, A.X3, TO_DATE(A.EVNT_SCHED_DATE,'DD-Mon-YYYY') AS EVNT_SCHED_DATE,
A.X4, A.MOVEMENT_TYPE, TRIM(A.EFFECTIVE_STATUS) AS STATUS, A.STATUS_TIME, A.TYPE,
A.LEG_NUMBER,
CASE WHEN A.EFFECTIVE_STATUS='BT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='NLT' THEN 'NLT'
WHEN A.EFFECTIVE_STATUS='MKUP' THEN 'MKUP'
END AS STATUS
FROM PHASE1.DY_STATUS_ZONE A
WHERE A.LAST_LEG_FLAG='Y'
AND SCHLD_DATE>='01-Apr-2019'--TO_DATE(''||MNTH_DATE||'','DD-Mon-YYYY')
AND SCHLD_DATE<='20-Feb-2020'--TO_DATE(''||TILL_DATE||'','DD-Mon-YYYY')
AND A.MOVEMENT_TYPE IN ('P')
AND (EXCEPTIONAL_FLAG='N' OR EXCEPTION_TYPE='5') ---------SS
PHASE1.DY_STATUS_ZONE has 710246 records in it , Please guide if this query can be optimized ?
You could try adding an index which covers the WHERE clause:
CREATE INDEX idx ON PHASE1.DY_STATUS_ZONE (LAST_LEG_FLAG, SCHLD_DATE, MOVEMENT_TYPE,
EXCEPTIONAL_FLAG, EXCEPTION_TYPE);
Depending on the cardinality of your data, the above index may or may not be used.
The problem might be the select distinct. This can be hard to optimize because it removes duplicates. Even if no rows are duplicated, Oracle still does the work. If it is not needed remove it.
For your particular query, I would write it as:
WHERE A.LAST_LEG_FLAG = 'Y' AND
SCHLD_DATE >= DATAE '2019-04-01 AND
SCHLD_DATE <= DATE '2020-02-20' AND
A.MOVEMENT_TYPE = 'P' AND
EXCEPTIONAL_FLAG IN ('N', '5')
The date formats don't affect performance. Just readability and maintainability.
For this query, the optimal index is probably: (LAST_LEG_FLAG, MOVEMENT_TYPE, SCHLD_DATE, EXCEPTIONAL_FLAG). The last two columns might be switched, if EXCEPTIONAL_FLAG is more selective than SCHLD_DATE.
However, if this returns many rows, then the SELECT DISTINCT will be the gating factor for the query. And that is much more difficult to optimize.
I'm stuck in an (apparently) extremely trivial task that I can't make work , and I really feel no chance than to ask for advice.
I used to deal with PHP/MySQL more than 10 years ago and I might be quite rusty now that I'm dealing with an SQLite DB using Qt5.
Basically I'm selecting some records while wanting to make some math operations on the fetched columns. I recall (and re-read some documentation and examples) that the keyword "AS" is going to conveniently rename (alias) a value.
So for example I have this query, where "X" is an integer number that I render into this big Qt string before executing it with a QSqlQuery. This query lets me select all the electronic components used in a Project and calculate how many of them to order (rounding to the nearest multiple of 5) and the total price per component.
SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category,
Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer,
Inventory.price, UsedItems.qty_used as used_qty,
UsedItems.qty_used*X AS To_Order,
ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5,
Inventory.price*Nearest5 AS TotPrice
FROM Inventory
LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid
WHERE UsedItems.pid='1'
ORDER BY RefDes, value ASC
So, for example, I aliased UsedItems.qty_used as used_qty. At first I tried to use it in the next field, multiplying it by X, writing "used_qty*X AS To_Order" ... Query failed. Well, no worries, I had just put the original tab.field name and it worked.
Going further, I have a complex calculation and I want to use its result on the next field, but the same issue popped out: if I alias "ROUND(...)" AS Nearest5, and then try to use this value by multiplying it in the next field, the query will fail.
Please note: the query WORKS, but ONLY if I don't use aliases in the following fields, namely if I don't use the alias Nearest5 in the TotPrice field. I just want to avoid re-writing the whole ROUND(...) thing for the TotPrice field.
What am I missing/doing wrong? Either SQLite does not support aliases on the same query or I am using a wrong syntax and I am just too stuck/confused to see the mistake (which I'm sure it has to be really stupid).
Column aliases defined in a SELECT cannot be used:
For other expressions in the same SELECT.
For filtering in the WHERE.
For conditions in the FROM clause.
Many databases also restrict their use in GROUP BY and HAVING.
All databases support them in ORDER BY.
This is how SQL works. The issue is two things:
The logic order of processing clauses in the query (i.e. how they are compiled). This affects the scoping of parameters.
The order of processing expressions in the SELECT. This is indeterminate. There is no requirement for the ordering of parameters.
For a simple example, what should x refer to in this example?
select x as a, y as x
from t
where x = 2;
By not allowing duplicates, SQL engines do not have to make a choice. The value is always t.x.
You can try with nested queries.
A SELECT query can be nested in another SELECT query within the FROM clause;
multiple queries can be nested, for example by following the following pattern:
SELECT *,[your last Expression] AS LastExp From (SELECT *,[your Middle Expression] AS MidExp FROM (SELECT *,[your first Expression] AS FirstExp FROM yourTables));
Obviously, respecting the order that the expressions of the innermost select query can be used by subsequent select queries:
the first expressions can be used by all other queries, but the other intermediate expressions can only be used by queries that are further upstream.
For your case, your query may be:
SELECT *, PRC*Nearest5 AS TotPrice FROM (SELECT *, ROUND((UsedItems.qty_used*X/5)+0.5)*5*CAST((X > 0) AS INT) AS Nearest5 FROM (SELECT Inventory.id, UsedItems.pid, UsedItems.RefDes, Inventory.name, Inventory.category, Inventory.type, Inventory.package, Inventory.value, Inventory.manufacturer, Inventory.price AS PRC, UsedItems.qty_used*X AS To_Order FROM Inventory LEFT JOIN UsedItems ON Inventory.id=UsedItems.cid WHERE UsedItems.pid='1' ORDER BY RefDes, value ASC))
I am using an Oracle SQL Db and I am trying to count the number of terms starting with X letter in a dictionnary.
Here is my query :
SELECT Substr(Lower(Dict.Term),0,1) AS Initialchar,
Count(Lower(Dict.Term))
FROM Dict
GROUP BY Substr(Lower(Dict.Term),0,1)
ORDER BY Substr(Lower(Dict.Term),0,1);
This query is working as expected, but the thing that I'm not really happy about is the fact that I have to rewrite the long "Substr(Lower(Dict.Term),0,1)" in the GROUP BY and ORDER BY clause. Is there any way to reuse the one I defined in the SELECT part ?
Thanks
You can use a subquery. Because Oracle follows the SQL standard, substr() starts counting at 1. Although Oracle does explicitly allow 0 ("If position is 0, then it is treated as 1"), I find it misleading because "0" and "1" refer to the same position.
So:
select first_letter, count(*)
from (select d.*, substr(lower(d.term), 1, 1) as first_letter
from dict d
) d
group by first_letter
order by first_letter;
Not directly. The output columns can only be referred to in the ORDER BY clause, but not used in any other way. The only way would be to make it into a subselect, but it wouldn't be any clearer and might cause issues with performance.
I prefer subquery factoring for this purpose.
with init as (
select substr(lower(d.term), 1, 1) as Initialchar
from dict d)
select Initialchar, count(*)
from init
group by Initialchar
order by Initialchar;
Contrary to opposite meaning, IMO this makes the query much clearer and defines natural order; especially while using more subqueries.
I'm not aware about performance caveats, but there are some limitation, such as it not possible to use with clause within another with clause: ORA-32034: unsupported use of WITH clause.
I was always bothered by how should I approach those, which solution is better. I guess the sample code should explain it better.
Lets imagine we have a table that has 3 columns:
(int)Id
(nvarchar)Name
(int)Value
I want to get the basic columns plus a number of calculations on the Value column, but with each of the calculation being based on a previous one, In other words something like this:
SELECT
*,
Value + 10 AS NewValue1,
Value / NewValue1 AS SomeOtherValue,
(Value + NewValue1 + SomeOtherValue) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
Obviously this will not work. NewValue1, SomeOtherValue and YetAnotherValue are on the same level in the query so they can't refer to each other in the calculations.
I know of two ways to write queries that will give me the desired result. The first one involves repeating the calculations.
SELECT
*,
Value + 10 AS NewValue1,
Value / (Value + 10) AS SomeOtherValue,
(Value + (Value + 10) + (Value / (Value + 10))) / 10 AS YetAnotherValue
FROM
MyTable
WHERE
Name LIKE "A%"
The other one involves constructing a multilevel query like this:
SELECT
t2.*,
(t2.Value + t2.NewValue1 + t2.SomeOtherValue) / 10 AS YetAnotherValue
FROM
(
SELECT
t1.*,
t1.Value / t1.NewValue1 AS SomeOtherValue
FROM
(
SELECT
*,
Value + 10 AS NewValue1
FROM
MyTable
WHERE
Name LIKE "A%"
) t1
) t2
But which one is the right way to approach the problem or simply "better"?
P.S. Yes, I know that "better" or even "good" solution isn't always the same thing in SQL and will depend on many factors.
I have tired a number of different combination of calculations in both variants. They always produced the same execution plan, so it could be assumed that there is no difference in the performance aspect. From the code usability perspective the first approach i obviously better as the code is more readable and compact.
There is no "right" way to write such queries. SQL Server, as with most databases (MySQL being a notable exception), does not create intermediate tables for each subquery. Instead, it optimizes the query as a whole and often moves all the calculations for the expressions into a single processing node.
The reason that column aliases cannot be re-used at the same level goes to the ANSI standard definition. In particular, nothing in the standard specifies the order of evaluation for the individual expressions. Without knowing the order, SQL cannot guarantee that the variable is defined before evaluated.
I often write multi-level queries -- either using subqueries or CTEs -- to make queries more readable and more maintainable. But then again, I will also copy logic from one variable to the other because it is expedient. In my opinion, this is something that the writer of the query needs to decide on, taking into account whether the query is part of the code for a system that needs to be maintained, local coding standards, whether the query is likely to be modified, and similar considerations.
I'm trying to create a faster query, right now i have large databases. My table sizes are 5 col, 530k rows, and 300 col, 4k rows (sadly i have 0 control over architecture, otherwise I wouldn't be having this silly problem with a poor db).
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1, table2
WHERE table2.foo_0 = table1.foo_0
AND table1.bar1 >= NOW()
AND foo_20="tada"
ORDER BY
date desc
LIMIT 0,10
I've indexed the table2.foo_0 and table1.foo_0 along with foo_20 in hopes that it would allow for faster querying.. i'm still at nearly 7 second load time.. is there something else I can do?
Cheers
I think an index on bar1 is the key. I always run into performance issues with dates because it has to compare each of the 530K rows.
Create the following indexes:
CREATE INDEX ix_table1_0_1 ON table1 (foo_1, foo_0)
CREATE INDEX ix_table2_20_0 ON table2 (foo_20, foo_0)
and rewrite you query as this:
SELECT cast( table2.foo_1 AS datetime ) as date,
table1.*, table2.foo_2, foo_3, foo_4, foo_5, foo_6, foo_7, foo_8, foo_9, foo_10, foo_11, foo_12, foo_13, foo_14, foo_15, foo_16, foo_17, foo_18, foo_19, foo_20, foo_21
FROM table1
JOIN table2
ON table2.foo_0 = table1.foo_0
AND table2.foo_20 = "tada"
WHERE table1.bar1 >= NOW()
ORDER BY
table1.foo_1 DESC
LIMIT 0, 10
The first index will be used for ORDER BY, the second one will be used for JOIN.
You, though, may benefit more from creating the first index like this:
CREATE INDEX ix_table1_0_1 ON table1 (bar, foo_0)
which may apply more restrictive filtering on bar.
I have a blog post on this:
Choosing index
, which advices on how to choose which index to create for cases like that.
Indexing table1.bar1 may improve the >=NOW comparison.
A compound index on table2.foo_0 and table2.foo_20 will help.
An index on table2.foo_1 may help the sort.
Overall, pasting the output of your query with EXPLAIN prepended may also give some hints.
table2 needs a compound index on foo_0, foo_20, and bar1.
An index on table1.foo_0, table1.bar1 could help too, assuming that foo_20 belongs to table1.
See How to use MySQL indexes and Optimizing queries with explain.
Use compound indexes that corresponds to your WHERE equalities (in general leftmost col in the index), WHERE commparison to abolute value (middle), and ORDER BY clause (right, in the same order).