Oracle query with order by perfomance issue - sql

I have really complicated query:
select * from (
select * from tbl_user ...
where ...
and date_created between :date_from and :today
...
order by date_created desc
) where rownum <=50;
Currently query is fast enough because of where clause (only 3 month before today, date_from = today - 90 days).
I have to remove this clause, but it causes performance degradation.
What if first calculate date_from by `
SELECT MIN(date_created) where...
and then insert this value into main query? Set of data will be the same. Will it improve performance? Does it make sense?
Could anyone have any assumption about optimization?

Using an order by operation will of course cause the query to take a little longer to return. That being said, it is almost always faster to sort in the DB than it is to sort in your application logic.
It's hard to really optimize without the full query and schema information, but I'll take a stab at what seems like the most obvious to me.
Converting to Rank()
Your query could be a lot more efficient if you use a windowed rank() function. I've also converted it to use a common table expression (aka CTE). This doesn't improve performance, but does make it easier to read.
with cte as (
select
*
, rank() over (
partition by
-- insert what fields differentiate your rows here
-- unlike a group by clause, this doesn't need to be
-- every field
order by
date_created desc
)
from
tbl_user
...
where
...
and date_created between :date_from and :today
)
select
*
from
cte
where
rk <= 50
Indexing
If date_created is not indexed, it probably should be.
Take a look at your autotrace results. Figure out what filters have the highest cost. These are probably unindexed, and maybe should be.
If you post your schema, I'd be happy to make better suggestions.

Related

Remove case insensitive duplicates in sql (postgres)

I have a postgresql database, and I'm trying to delete (or even just get the ids) of the older of the duplicates I have in my table, but only those who are because of case sensitivity, for example helLo and hello.
The table is quite large and my nested query takes a really long time, I wonder if there is a better, more efficient way to do my query in one go, and not split it up to multiple queries, cause there's a lot of ids in question
SELECT * FROM some_table AS out
WHERE (SELECT count(*) FROM some_table AS in
WHERE out.text != in.text
AND LOWER(in.text) = LOWER(out.text)
AND in.created_at > out.created_at) > 1
Thanks!
Can you try
SELECT LOWER(text), ROW_NUMBER() OVER( PARTITION by LOWER(text) ORDER by created_at ) as rn
FROM some_table
You can then use the rn column as a filter
To help this query, create an expression index on LOWER(text). Include created_at in the index to help the date comparisons.
CREATE INDEX text_lower ON some_table(LOWER(text), created_at);
It's hard to test this without your data, though.

How to speed up group-based duplication-count queries on unindexed tables

When I need to know the number of rows containing more than n duplicates for certain colulmn c, I can do it like this:
WITH duplicateRows AS (
SELECT COUNT(1)
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
This leads to an unwanted behaviour: SQL Server counts all rows grouped by i, which (when no index is on this table) leads to horrible performance.
However, when altering the script such that SQL Server doesn't have to count all the rows doesn't solve the problem:
WITH duplicateRows AS (
SELECT 1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT COUNT(1) FROM duplicateRows
Although SQL Server now in theory can stop counting after n + 1, it leads to the same query plan and query cost.
Of course, the reason is that the GROUP BY really introduces the cost, not the counting. But I'm not at all interested in the numbers. Is there another option to speed up the counting of duplicate rows, on a table without indexes?
The greatest two costs in your query are the re-ordering for the GROUP BY (due to lack of appropriate index) and the fact that you're scanning the whole table.
Unfortunately, to identify duplicates, re-ordering the whole table is the cheapest option.
You may get a benefit from the following change, but I highly doubt it would be significant, as I'd expect the execution plan to involve a sort again anyway.
WITH
sequenced_data AS
(
SELECT
ROW_NUMBER() OVER (PARTITION BY fieldC) AS sequence_id
FROM
yourTable
)
SELECT
COUNT(*)
FROM
sequenced_data
WHERE
sequence_id = (n+1)
Assumes SQLServer2005+
Without indexing the GROUP BY solution is the best, every PARTITION-based solution involving both table(clust. index) scan and sort, instead of simple scan-and-counting in GROUP BY case
If the only goal is to determine if there are ANY rows in ANY group (or, to rephrase that, "there is a duplicate inside the table, given the distinction of column c"), adding TOP(1) to the SELECT queries could perform some performance magic.
WITH duplicateRows AS (
SELECT TOP(1)
1
FROM [table]
GROUP BY c
HAVING COUNT(1) > n
) SELECT 1 FROM duplicateRows
Theoretically, SQL Server doesn't need to determine all groups, so as soon as the first group with a duplicate is found, the query is finished (but worst-case will take as long as the original approach). I have to say though that this is a somewhat imperative way of thinking - not sure if it's correct...
Speed and "without indexes" almost never go together.
Athough as others here have mentioned I seriously doubt that it will have performance benefits. Perhaps you could try restructuring your query with PARTITION BY.
For example:
WITH duplicateRows AS (
SELECT a.aFK,
ROW_NUMBER() OVER(PARTITION BY a.aFK ORDER BY a.aFK) AS DuplicateCount
FROM Address a
) SELECT COUNT(DuplicateCount) FROM duplicateRows
I haven't tested the performance of this against the actual group by clause query. It's just a suggestion of how you could restructure it in another way.

How does sql optimization work internally?

My previous question:
Date of max id: sql/oracle optimization
In my previous question, I was finding different ways of finding the date of the record with the highest id number. Below are several of the offered solutions, and their 'cost' as calculated by explain plan.
select date from table where id in (
select max(id) from table)
has a cost of 8
select date from table where rownum < 2 order by id desc;
has a cost of 5
select date from (select date from table order by id desc) where rownum < 2;
also has a cost of 5
with ranked_table as (select rownum as rn, date from table order by id desc)
select date from ranked_table where rn = 1;
has a cost of 906665
SELECT t1.date
FROM table t1
LEFT OUTER JOIN table t2
ON t1.id < t2.id
WHERE t2.id IS NULL;
has a cost of 1438619
Obviously the index on id is doing its job. But I was wondering, in what cases would the last two perform at least as well, if not better? I want to understand the benefits of doing it that way.
This was done in Oracle. All varieties can be discussed, but kindly say what your answer applies to.
Use solution #1 if you want the most portable SQL that will work on a wide variety of other brands of RDBMS (i.e. not all brands support rownum):
select date from table where id in (select max(id) from table);
Use solution #3 if you want the most efficient solution for Oracle:
select date from (select date from table order by id desc) where rownum < 2;
Note that solution #2 doesn't always give the right answer, because it returns the "first" two rows before it has sorted them by id. If this happens to return the rows with the highest id values, it's only by coincidence.
select date from table where rownum < 2 order by id desc;
Regarding the more complex queries #4 and #5 that give such a high cost, I agree I wouldn't recommend using them for such a simple task as fetching the row with the highest id. But understanding how to use subquery factoring and self-joins can be useful for solving other more complex types of queries, where the simple solutions simply don't do the job.
Example: given a hierarchy of threaded forum comments, show the "hottest" comments with the most direct replies.
Almost all decent databases have introduced instructions called optimizer hints which are not portable, there are default costs on joining tables, you can advice the query optimizer to use nested loop joins or dynamic table hash joins. A good explaination for oracle you find in oracle performance tuning guide

Max and Min Time query

how to show max time in first row and min time in second row for access using vb6
What about:
SELECT time_value
FROM (SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
)
ORDER BY time_value DESC;
That should do the job unless there are no rows in SomeTable (or your DBMS does not support the notation).
Simplifying per suggestion in comments - thanks!
SELECT MIN(time_column) AS time_value FROM SomeTable
UNION
SELECT MAX(time_column) AS time_value FROM SomeTable
ORDER BY time_value DESC;
If you can get two values from one query, you may improve the performance of the query using:
SELECT MIN(time_column) AS min_time,
MAX(time_column) AS max_time
FROM SomeTable;
A really good optimizer might be able to deal with both halves of the UNION version in one pass over the data (or index), but it is quite easy to imagine an optimizer tackling each half of the UNION separately and processing the data twice. If there is no index on the time column to speed things up, that could involve two table scans, which would be much slower than a single table scan for the two-value, one-row query (if the table is big enough for such things to matter).

How do I calculate a moving average using MySQL?

I need to do something like:
SELECT value_column1
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
Except in addition to value_column1, I also need to retrieve a moving average of the previous 20 values of value_column1.
Standard SQL is preferred, but I will use MySQL extensions if necessary.
This is just off the top of my head, and I'm on the way out the door, so it's untested. I also can't imagine that it would perform very well on any kind of large data set. I did confirm that it at least runs without an error though. :)
SELECT
value_column1,
(
SELECT
AVG(value_column1) AS moving_average
FROM
Table1 T2
WHERE
(
SELECT
COUNT(*)
FROM
Table1 T3
WHERE
date_column1 BETWEEN T2.date_column1 AND T1.date_column1
) BETWEEN 1 AND 20
)
FROM
Table1 T1
Tom H's approach will work. You can simplify it like this if you have an identity column:
SELECT T1.id, T1.value_column1, avg(T2.value_column1)
FROM table1 T1
INNER JOIN table1 T2 ON T2.Id BETWEEN T1.Id-19 AND T1.Id
I realize that this answer is about 7 years too late. I had a similar requirement and thought I'd share my solution in case it's useful to someone else.
There are some MySQL extensions for technical analysis that include a simple moving average. They're really easy to install and use: https://github.com/mysqludf/lib_mysqludf_ta#readme
Once you've installed the UDF (per instructions in the README), you can include a simple moving average in a select statement like this:
SELECT TA_SMA(value_column1, 20) AS sma_20 FROM table1 ORDER BY datetime_column1
When I had a similar problem, I ended up using temp tables for a variety of reasons, but it made this a lot easier! What I did looks very similar to what you're doing, as far as the schema goes.
Make the schema something like ID identity, start_date, end_date, value. When you select, do a subselect avg of the previous 20 based on the identity ID.
Only do this if you find yourself already using temp tables for other reasons though (I hit the same rows over and over for different metrics, so it was helpful to have the small dataset).
My solution adds a row number in table. The following example code may help:
set #MA_period=5;
select id1,tmp1.date_time,tmp1.c,avg(tmp2.c) from
(select #b:=#b+1 as id1,date_time,c from websource.EURUSD,(select #b:=0) bb order by date_time asc) tmp1,
(select #a:=#a+1 as id2,date_time,c from websource.EURUSD,(select #a:=0) aa order by date_time asc) tmp2
where id1>#MA_period and id1>=id2 and id2>(id1-#MA_period)
group by id1
order by id1 asc,id2 asc
In my experience, Mysql as of 5.5.x tends not to use indexes on dependent selects, whether a subquery or join. This can have a very significant impact on performance where the dependent select criteria change on every row.
Moving average is an example of a query which falls into this category. Execution time may increase with the square of the rows. To avoid this, chose a database engine which can perform indexed look-ups on dependent selects. I find postgres works effectively for this problem.
In mysql 8 window function frame can be used to obtain the averages.
SELECT value_column1, AVG(value_column1) OVER (ORDER BY datetime_column1 ROWS 19 PRECEDING) as ma
FROM table1
WHERE datetime_column1 >= '2009-01-01 00:00:00'
ORDER BY datetime_column1;
This calculates the average of the current row and 19 preceding rows.