so we have many Oracle views that we expose to other teams to work with, and they run queries against the views to extract data.
Recently we realized that one of the views we expose, a user did a select * with a date range and the query just doesn't return in a timely fashion. After investigation, we decided to 'optimize' the view by converting a select subquery into a left join, something that I know normally improves query performance.
Previous view definition :
select a.date, (select name from table_b b where b.id = a.id), a.id
from table_a a
New view definition :
select a.date, b.name, a.id
from table_a a left join table_b b on a.id = b.id
We tested it with the user and his queries are now much more performant so the change was rolled out to production. A day later we realized another user had been using this view within some complex query, and his query went from running 2 hours everyday to > 7 hours or not completing at all.
So i guess my question is, how do I deal with this tuning issue, where improving one query's performance degrades another query's performance? I'm in the process of rollback such that I can examine the two different query plans, but I'm not sure what insight I can gain from the plan differences. I checked the table statistics and they all look good.
"user did a select * with a date range ".
Date range scans are notoriously hard to tune. A plan which is great for date '2018-04-01' to date '2018-04-02' may well suck for date '2017-04-01' to date '2018-04-01'. And of course vice versa.
So what you may be suffering from here is that your user is using bind variables for the date range values. Bind variables are normally good for performance because they allow Oracle to re-use the same execution plan for all executions of the query with any value for those variables. This is a good thing when the pertinent values have a normal distribution. Then we save the cost of a hard parse and use an efficient path. This is called Bind Variable Peeking.
However, when the data is an uneven distribution or when we are are specifying ranges we need a different strategy. The overhead of a hard parse is trivial compared to the cost of using an indexed read to retrieve 20% of the rows in a table. So you need a different approach, one which doesn't rely on bind variables. Ideally you can work with your users, understand what they're doing and help them write better queries. However, the Oracle database does have features like Adaptive Cursors which allows the database to assess whether the cached plan is still good for new values of bind variables. This doesn't guarantee good performance but can help in situations where we have users running ad hoc queries. Find out more.
" the underlying tables was partitioned by date and also indexed by date hence I believe the date range should not be an issue."
Belief is not the same as proof. If the date range is within a single partition then maybe it's not the issue. If the queried range spans several partitions then it's a potential culprit. Consider: if your table is partitioned into one day sections then a date range of date '2017-04-01' to date '2018-04-01' would scan 365 partitions. Partition pruning won't do much for you then. But if you don't think it's worth investigating that's cool.
"my general question was how to tune one thing without breaking another (that you may not be aware of)"
As I think you know already, this is not possible. The best we can hope for is to tune a query to perform optimally under the conditions we know about. If it were possible to write a query so that it executed perfectly in any scenario then all those Oracle tuning consultants would not make the fine livings that they do.
Related
I am working on a ledger with a table of Transactions. Each entry has a transaction_id, account_id, timestamp and other metadata. I need to query for all Transactions for a given account_id with a between operator on timestamp
My planned approach was to build an index on account_id, transaction_id and timestamp. However I have noted a limitation on inequalities and indexes from the AWS Documentation and I had planned applying this to timestamp
Query performance is improved only when you use an equality predicate;
for example, fieldName = 123456789.
QLDB does not currently honor inequalities in query predicates. One
implication of this is that range filtered scans are not implemented.
...
Warning
QLDB requires an index to efficiently look up a document. Without an
indexed field in the WHERE predicate clause, QLDB needs to do a table
scan when reading documents. This can cause more query latency and can
also lead to more concurrency conflicts.
Transactions would be generated and grow indefinetly over time, and I would need to be able to query a weeks worth of transactions at a time.
Current Query:
SELECT *
FROM Transactions
WHERE "account_id" = 'test_account' and "timestamp" BETWEEN `2020-07-05T00:00Z` AND `2020-07-12T00:00Z`
I know it is possible to stream the data to a database more suited for this query, such as dynamodb, but I would like to know if my performance concerns performing the above query is valid, and if it is, what is the recommended indexes and query to ensure this scales and does not result in a scan across all transactions for the given account_id?
Thanks for your question (well written and researched)!
QLDB, at the time of writing, does not support range indexes. So, the short answer is "you can't."
I'd be interested to know what the intention behind your query is. For example, is getting a list of transactions between two dates something you need to do to form a new transaction or is it something you need for reporting purposes (e.g. displaying a user statement).
Nearly every use-case I've encountered thus far is the latter (reporting), and is much better served by replicating data to something like ElasticSearch or Redshift. Typically this can be done with a couple of lines of code in a Lambda function and the cost is extremely low.
QLDB has a history() function that works like a charm to generate statements, since you can pass one or two dates as arguments for start and/or end dates.
You see, this is where QLDB gets tricky: when you think of it as a relational database.
The caveat here is that you would have to change your transactions to be updates in the account table rather than new inserts in a different table. This is because, by design, QLDB gives you the ledger of any table. Meaning you can later check all versions of that record and filter them as well.
Here's an example of what a history query would look like in an Accounts table:
SELECT ha.data.* from Accounts a by Accounts_id
JOIN history(Accounts, `2022-04-10T00:00:00.000Z`, `2022-04-13T23:59:59.999Z`) ha
ON ha.metadata.id = Accounts_id
WHERE a.account_id = 1234
This different segment by Accounts_id is what QLDB uses to get the index on your history table and how you can join both tables by an indexed column. In this case, account_id.
I have a Sql-Server-2008 database that I am querying from on the regular that was over 30 million entries (joy!). Unfortunately this database cannot be drastically changed because it is still in use for R/D.
When I query from this database, it takes FOREVER. By that I mean I haven't been patient enough to wait for results (after 2 mins I have to cancel to avoid locking the R/D department out). Even if I use a short date range (more than a few months), it is basically impossible to get any results from it. I am querying with requirements from 4 of the columns and unfortunately have to use an inner-join for another table (which I've been told is very costly in terms of query efficiency -- but it unavoidable). This inner joined table has less than 100k entries.
What I was wondering, is it is possible to organize the table to have it defaultly be ordered by date to reduce the number of results it has to search through?
If this is not possible, is there anything I can do to reduce query times? Is there any other useful information that could assist me in coming up with a solution?
I have included a sample of the query that I use:
SELECT DISTINCT N.TestName
FROM [DalsaTE].[dbo].[ResultsUut] U
INNER JOIN [DalsaTE].[dbo].[ResultsNumeric] N
ON N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
AND N.ResultsUutId = U.ResultsUutId
WHERE U.DeviceName = 'BO-32-3HK60-00-R'
AND U.StartDateTime > '2011-11-25 01:10:10.001'
ORDER BY N.TestName
Any help or suggestions are appreciated!
It sounds like datetime may be a text based field and subsequently an index isn't being used?
Could you try the following to see if you have any speed improvement:
select distinct N.TestName
from [DalsaTE].[dbo].[ResultsUut] U
inner join [DalsaTE].[dbo].[ResultsNumeric] N
on N.ModeDescription = 'Mode 8: Low Gain - Green-Blue'
and N.ResultsUutId = U.ResultsUutId
where U.DeviceName = 'BO-32-3HK60-00-R'
and U.StartDateTime > cast('2011-11-25 01:10:10.001' as datetime)
order by N.TestName
It would also be worth trying changing your inner join to a left outer join as those occasionally perform faster for no conceivable reason (at least one that I'm not aware of).
you can add an index based on your date column, which should improve your query time. You can either use an alter table command, or use the table designer.
Is the sole purpose of the join to provide sorting? If so, a quick thing to try would be to remove this, and see how much of a difference it makes - at least then you'll know where to focus your attention.
Finally, SQL server management studio has some useful tools such as execution plans that can help diagnose performance issues. Good luck!
There are a number of problems which may be causing delays in the execution of your query.
Indexes (except the primary key) do not reorder the data, they merely create an index (think phonebook) which orders a number of values and points back to the primary key.
Without seeing the type of data or the existing indexes, it's difficult, but at the very least, the following ASCENDING indexes might help:
[DalsaTE].[dbo].[ResultsNumeric] ModeDescription and ResultsUutId and TestName
[DalsaTE].[dbo].[ResultsUut] StartDateTime and DeviceName and ResultsUutId
Without the indexes above, the sample query you gave can be completed without performing a single lookup on the actual table data.
I need to write a query that will group a large number of records by periods of time from Year to Hour.
My initial approach has been to decide the periods procedurally in C#, iterate through each and run the SQL to get the data for that period, building up the dataset as I go.
SELECT Sum(someValues)
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
I've subsequently discovered I can group the records using Year(), Month() Day(), and datepart(week, date) and datepart(hh, date).
SELECT Sum(someValues)
FROM table1
GROUP BY Year(deliveryDate), Month(deliveryDate), Day(deliveryDate)
My concern is that using datepart in a group by will lead to worse performance than running the query multiple times for a set period of time due to not being able to use the index on the datetime field as efficiently; any thoughts as to whether this is true?
Thanks.
As with anything performance related Measure
Checking the query plan up for the second approach will tell you any obvious problems in advance (a full table scan when you know one is not needed) but there is no substitute for measuring. In SQL performance testing that measurement should be done with appropriate sizes of test data.
Since this is a complex case, you are not simply comparing two different ways to do a single query but comparing a single query approach against a iterative one, aspects of your environment may play a major role in the actual performance.
Specifically
the 'distance' between your application and the database as the latency of each call will be wasted time compared to the one big query approach
Whether you are using prepared statements or not (causing additional parsing effort for the database engine on each query)
whether the construction of the ranges queries itself is costly (heavily influenced by 2)
If you put a formula into the field part of a comparison, you get a table scan.
The index is on field, not on datepart(field), so ALL fields must be calculated - so I think your hunch is right.
you could do something similar to this:
SELECT Sum(someValues)
FROM
(
SELECT *, Year(deliveryDate) as Y, Month(deliveryDate) as M, Day(deliveryDate) as D
FROM table1
WHERE deliveryDate BETWEEN #fromDate AND # toDate
) t
GROUP BY Y, M, D
If you can tolerate the performance hit of joining in yet one more table, I have a suggestion that seems odd but works real well.
Create a table that I'll call ALMANAC with columns like weekday, month, year. You can even add columns for company specific features of a date, like whether the date is a company holiday or not. You might want to add a starting and ending timestamp, as referenced below.
Although you might get by with one row per day, when I did this I found it convenient to go with one row per shift, where there are three shifts in a day. Even at that rate, a period of ten years was only a little over 10,000 rows.
When you write the SQL to populate this table, you can make use of all the date oriented built in functions to make the job easier. When you go to do queries you can use the date column as a join condition, or you may need two timestamps to provide a range for catching timestamps within the range. The rest of it is as easy as working with any other kind of data.
I was looking for similar solution for reporting purposes, and came across this article called Group by Month (and other time periods). It shows various ways, good and bad, to group by the datetime field. Definitely worth looking at.
I think that you should benchmark it to get reliable results , but, IMHO and my first thought would be that letting the DB take care of it (your 2nd approach) would be much faster then when you do it in your client code.
With your first approach, you have multiple roundtrips to the DB, which I think will be far more expensive. :)
You may want to look at a dimensional approach (this is simliar to what Walter Mitty has suggested), where each row has a foreign key to a date and/or time dimension. This allows very flexible summations through the join to this table where these parts are precalculated. In these cases, the key is usually a natural integer key of the form YYYYMMDD and HHMMSS which is relatively performant and also human readable.
Another alternative might be indexed views, where there are separate expressions for each of the date parts.
Or calculated columns.
But performance has to be tested and execution plans examined...
The Query I'm writing runs fine when looking at the past few days, once I go over a week it crawls (~20min). I am joining 3 tables together. I was wondering what things I should look for to make this run faster. I don't really know what other information is needed for the post.
EDIT: More info: db is Sybase 10. Query:
SELECT a.id, a.date, a.time, a.signal, a.noise,
b.signal_strength, b.base_id, b.firmware,
a.site, b.active, a.table_key_id
FROM adminuser.station AS a
JOIN adminuser.base AS b
ON a.id = b.base_id
WHERE a.site = 1234 AND a.date >= '2009-03-20'
I also took out the 3rd JOIN and it still runs extremely slow. Should I try another JOIN method?
I don't know Sybase 10 that well, but try running that query for say 10-day period and then 10 times, for each day in a period respectively and compare times. If the time in the first case is much higher, you've probably hit the database cache limits.
The solution is than to simply run queries for shorter periods in a loop (in program, not SQL). It works especially well if table A is partitioned by date.
You can get a lot of information (assuming you're using MSSQL here) by running your query in SQL Server Management Studio with the Include Actual Execution Plan option set (in the Query menu).
This will show you a diagram of the steps that SQLServer performs in order to execute the query - with relative costs against each step.
The next step is to rework the query a little (try doing it a different way) then run the new version and the old version at the same time. You will get two execution plans, with relative costs not only against each step, but against the two versions of the query! So you can tell objectively if you are making progress.
I do this all the time when debugging/optimizing queries.
Make sure you have indexes on the foreign keys.
It sounds more like you have a memory leak or aren't closing database connections in your client code than that there's anything wrong with the query.
[edit]
Nevermind: you mean quering over a date range rather than the duration the server has been active. I'll leave this up to help others avoid the same confusion.
Also, it would help if you could post the sql query, even if you need to obfuscate it some first, and it's a good bet to check if there's an index on your date column and the number of records returned by the longer range.
You may want to look into using a PARTITION for the date ranges, if your DB supports it. I've heard this can help significantly.
Grab the book "Professional SQL Server 2005 Performance Tuning" its pretty great.
You didn't mention your database. If it's not SQL Server, the specifics of how to get the data might be different, but the advice is fundamentally the same.
Look at indexing, for sure, but the first thing to do is to follow Blorgbeard's advice and scan for execution plans using Management Studio (again, if you are running SQL Server).
What I'm guessing you'll see is that for small date ranges, the optimizer picks a reasonable query plan, but that when the date range is large, it picks something completely different, likely involving either table scans or index scans, and possibly joins that lead to very large temporary recordsets. The execution plan analyzer will reveal all of this.
A scan means that the optimizer thinks that grinding over the whole table or the whole index is cheaper for what you are trying to do than seeking specific values.
What you eventually want to do is get indexes and the syntax of your query set up such that you keep index seeks in the query plan for your query regardless of the date range, or, failing that, that the scans you require are filtered as well as you can manage to minimize temporary recordset size and thereby avoid excessive reads and I/O.
SELECT
a.id, a.date, a.time, a.signal, a.noise,a.site, b.active, a.table_key_id,
b.signal_strength, b.base_id, b.firmware
FROM
( SELECT * FROM adminuser.station
WHERE site = 1234 AND date >= '2009-03-20') AS a
JOIN
adminuser.base AS b
ON
a.id = b.base_id
Kind of rewrote the query, so as to first filter the desired rows then perform a join rather than perform a join then filter the result.
Rather than pulling * from the sub-query you can just select the columns you want, which might be little helpful.
May be this will of little help, in speeding things.
While this is valid in MySql, I am not sure of the sysbase syntax though.
What techniques can be applied effectively to improve the performance of SQL queries? Are there any general rules that apply?
Use primary keys
Avoid select *
Be as specific as you can when building your conditional statements
De-normalisation can often be more efficient
Table variables and temporary tables (where available) will often be better than using a large source table
Partitioned views
Employ indices and constraints
Learn what's really going on under the hood - you should be able to understand the following concepts in detail:
Indexes (not just what they are but actually how they work).
Clustered indexes vs heap allocated tables.
Text and binary lookups and when they can be in-lined.
Fill factor.
How records are ghosted for update/delete.
When page splits happen and why.
Statistics, and how they effect various query speeds.
The query planner, and how it works for your specific database (for instance on some systems "select *" is slow, on modern MS-Sql DBs the planner can handle it).
The biggest thing you can do is to look for table scans in sql server query analyzer (make sure you turn on "show execution plan"). Otherwise there are a myriad of articles at MSDN and elsewhere that will give good advice.
As an aside, when I started learning to optimize queries I ran sql server query profiler against a trace, looked at the generated SQL, and tried to figure out why that was an improvement. Query profiler is far from optimal, but it's a decent start.
There are a couple of things you can look at to optimize your query performance.
Ensure that you just have the minimum of data. Make sure you select only the columns you need. Reduce field sizes to a minimum.
Consider de-normalising your database to reduce joins
Avoid loops (i.e. fetch cursors), stick to set operations.
Implement the query as a stored procedure as this is pre-compiled and will execute faster.
Make sure that you have the correct indexes set up. If your database is used mostly for searching then consider more indexes.
Use the execution plan to see how the processing is done. What you want to avoid is a table scan as this is costly.
Make sure that the Auto Statistics is set to on. SQL needs this to help decide the optimal execution. See Mike Gunderloy's great post for more info. Basics of Statistics in SQL Server 2005
Make sure your indexes are not fragmented. Reducing SQL Server Index Fragmentation
Make sure your tables are not fragmented. How to Detect Table Fragmentation in SQL Server 2000 and 2005
Use a with statment to handle query filtering.
Limit each subquery to the minimum number of rows possible.
then join the subqueries.
WITH
master AS
(
SELECT SSN, FIRST_NAME, LAST_NAME
FROM MASTER_SSN
WHERE STATE = 'PA' AND
GENDER = 'M'
),
taxReturns AS
(
SELECT SSN, RETURN_ID, GROSS_PAY
FROM MASTER_RETURNS
WHERE YEAR < 2003 AND
YEAR > 2000
)
SELECT *
FROM master,
taxReturns
WHERE master.ssn = taxReturns.ssn
A subqueries within a with statement may end up as being the same as inline views,
or automatically generated temp tables. I find in the work I do, retail data, that about 70-80% of the time, there is a performance benefit.
100% of the time, there is a maintenance benefit.
I think using SQL query analyzer would be a good start.
In Oracle you can look at the explain plan to compare variations on your query
Make sure that you have the right indexes on the table. if you frequently use a column as a way to order or limit your dataset an index can make a big difference. I saw in a recent article that select distinct can really slow down a query, especially if you have no index.
The obvious optimization for SELECT queries is ensuring you have indexes on columns used for joins or in WHERE clauses.
Since adding indexes can slow down data writes you do need to monitor performance to ensure you don't kill the DB's write performance, but that's where using a good query analysis tool can help you balanace things accordingly.
Indexes
Statistics
on microsoft stack, Database Engine Tuning Advisor
Some other points (Mine are based on SQL server, since each db backend has it's own implementations they may or may not hold true for all databases):
Avoid correlated subqueries in the select part of a statement, they are essentially cursors.
Design your tables to use the correct datatypes to avoid having to apply functions on them to get the data out. It is far harder to do date math when you store your data as varchar for instance.
If you find that you are frequently doing joins that have functions in them, then you need to think about redesigning your tables.
If your WHERE or JOIN conditions include OR statements (which are slower) you may get better speed using a UNION statement.
UNION ALL is faster than UNION if (And only if) the two statments are mutually exclusive and return the same results either way.
NOT EXISTS is usually faster than NOT IN or using a left join with a WHERE clause of ID = null
In an UPDATE query add a WHERE condition to make sure you are not updating values that are already equal. The difference between updating 10,000,000 records and 4 can be quite significant!
Consider pre-calculating some values if you will be querying them frequently or for large reports. A sum of the values in an order only needs to be done when the order is made or adjusted, rather than when you are summarizing the results of 10,000,000 million orders in a report. Pre-calculations should be done in triggers so that they are always up-to-date is the underlying data changes. And it doesn't have to be just numbers either, we havea calculated field that concatenates names that we use in reports.
Be wary of scalar UDFs, they can be slower than putting the code in line.
Temp table tend to be faster for large data set and table variables faster for small ones. In addition you can index temp tables.
Formatting is usually faster in the user interface than in SQL.
Do not return more data than you actually need.
This one seems obvious but you would not believe how often I end up fixing this. Do not join to tables that you are not using to filter the records or actually calling one of the fields in the select part of the statement. Unnecessary joins can be very expensive.
It is an very bad idea to create views that call other views that call other views. You may find you are joining to the same table 6 times when you only need to once and creating 100,000,00 records in an underlying view in order to get the 6 that are in your final result.
In designing a database, think about reporting not just the user interface to enter data. Data is useless if it is not used, so think about how it will be used after it is in the database and how that data will be maintained or audited. That will often change the design. (This is one reason why it is a poor idea to let an ORM design your tables, it is only thinking about one use case for the data.) The most complex queries affecting the most data are in reporting, so designing changes to help reporting can speed up queries (and simplify them) considerably.
Database-specific implementations of features can be faster than using standard SQL (That's one of the ways they sell their product), so get to know your database features and find out which are faster.
And because it can't be said too often, use indexes correctly, not too many or too few. And make your WHERE clauses sargable (Able to use indexes).