Fine tuning oracle query with pipelined function - sql

I have a query (that powers an Oracle Application Express Report) that I was told by our users was executing "slowly" or at an unacceptable speed (wasn't given an actual load time for the page and the query is the only thing on the page).
The query involves many tables and actually references a pipelined function which identifies the currently logged-in users to our website and returns a custom "table" of records they have permission to based upon a custom security scheme we have.
My main question is around Oracle's caching of queries and how they could be affected by our setup.
When I took the query out of the webpage and ran it in Sql Developer (and manually specified a user ID to simulate a logged-in user to the website), the performance went from 71 seconds to 19 seconds to .5 seconds. Clearly, Oracle is utilizing its caching mechanism to make subsequent runs faster.
How is this affected by?:
The fact that different users will get different tables from the
pipe-lined function (all the same columns, just different number of
rows and the values in the rows). Does the pipe-lining prevent
caching from working? Am I only seeing caching because I'm running
a very isolated test?
Further more - is caching easily influenced by the number of people using the system? I'm not sure how "much" can get cached. Therefore, if we have 50 concurrent users that are accessing different parts of the website that are loading different queries all day long, is it likely that oracle won't be able to cache many/any of them because it is constantly seeing different request for queries?
Sorry my question isn't very technical.
I'm a developer who has been asked to help out in this seemingly DBA question.
Also, this is complicated because I can't really determine what the actual load times are since our users don't report that level of detail.
Any thoughts on:
how I can determine if this query is actually slow?
what the average processing time would be?
and how to proceed with fine tuning if it is a problem?
Thanks!

It doesn't sound like this has anything to do with APEX, pipelined table functions, or query caching. It sounds like you are describing the effects of plain old data caching (most likely at the database level but potentially at the operating system and disk subsystem layers).
As a very basic overview, data is stored in rows, rows are stored in blocks (most commonly 8 kb in size), blocks are stored in extents (generally a few MB in size), and extents roll up to segments (i.e. a table). Oracle maintains a buffer cache where the most recently accessed blocks are stored. When you run a query, Oracle figures out which blocks it needs to read in order to get your data (this is the query plan). It then looks to see whether those blocks are in the buffer cache or whether they have to be read from disk. Obviously, reading a block from cache is much more efficient than reading it off the disk since RAM is much faster than disk. If you run the same query with the same set of bind variable values multiple times in a row, you'll be accessing the same set of blocks each time but more and more of the blocks you care about are going to be in the cache. So you'd generally expect that the second and third time that you call the query, you'll see faster performance.
If you run the query with a different set of bind variable values, if the second set of bind variable values causes Oracle to access many of the same blocks, those executions will benefit from the data the prior test cached. Otherwise, you'd be back to square 1 potentially reading all the data you need off disk. Most likely, you'll see some combination of the two.
Remember as well that it is not just Oracle that is caching data. Frequently, the operating system will be caching the most active pieces of the underlying Oracle data files. And the I/O subsystem will be caching the most recently accessed data as well. So even if Oracle thinks that it needs to go out to fetch a block because it is not in the database's buffer cache, the file system or the I/O subsystem may have cached that data so it may not require an actual physical read off of disk. These other caches behave similarly where running the same query multiple times in a row is likely to cause the cache to be "warm" and improve the performance of the later runs.

Related

SQL Server select with large varchar columns take time to load

I am trying to run a simple select query and it has column called instructions with varchar(8000) in the select column list. The table has
90,000 records and it took my SQL server management studio console to 10 seconds to return and display the full table data
SELECT id, name, instructions, etc.... FROM TABLE;
however when i remove the instructions from the select list it took only a 1 second to execute and display the result. Can any one please help me to understand the theory behind this
Thanks
Keth
There are some obvious things here that impact the time, and a few more subtle ones around it. The topic of the underlying storage of SQL Server and how it stores / retrieves this data is a book in itself, of which there are many. (I'd personally recommend Kalen Delaney but everyone will have their own preference and I appreciate we should keep away from subjectivity on SO).
90k rows of instructions potentially have to be marshalled across your network connection if you were connected from another machine than the server.
The SSMS console itself, has to display these, which itself takes time.
depending on the size of what you are reading vs your buffer cache and other queries being executed you could be putting pressure on your cache and generating more physical IO load for the server as a whole.
As mentioned in comments, more data is being read, but does this mean more is being read from the disk? This one is far more subtle when looked at in detail.
In terms of the disk IO issue, depending on when the instructions are placed in the row and the settings for the column around inlining of data. It might be that the instructions for the row are stored inline with the row, which means no additional disk IO is actually occurring to read them vs not read them, its more a case of whether SQL Server bothers to decode the value from the page in memory.
The varchar(8000) though might not be inline with the rest of the data, it could be on a row_overflow_data page, sometimes referred to as short large object (SLOB), in which case the instruction field itself stores a pointer where the data is stored, and when you read the instructions it causes SQL Server to have to read another entirely random page (and extent) elsewhere on the disk per row.
Depending how / when instructions are added, you could see a huge level of fragmentation / lack of contiguous extents being allocated for these instructions, although depending on the IO subsystem, this may be immaterial to the problem.
There are a lot of unknowns at this point which makes it harder to give anything definitive - you are in the 'it depends' area of the DB, which would need a lot more specifics and investigation to be able to point at a specific cause, vs the more general (and not entirely complete) list above.
As Tim Biegeleisen mentioned, do not read the instructions unless you need to.

Is it possible to reject excessively large queries on specific views?

I'm working with MS-SQL Server, and we have several views that have the potential to return enormous amounts of processed data, enough to spike our servers to 100% resource usage for 30 minutes straight with a single query (if queried irresponsibly).
There is absolutely no business case in which such huge amounts of data would need to be returned from these views, so we'd like to lock it down to make sure nobody can DoS our SQL servers (intentionally or otherwise) by simply querying these particular views without proper where clauses etc.
Is it possible, via triggers or another method, to check the where clause etc. and confirm whether a given query is "safe" to execute (based on thresholds we determine), and reject the query if it doesn't meet our guidelines?
Or can we configure the server to reject given execution plans based on estimated time-to-completion etc.?
One potential way to reduce the overall cost of certain queries coming from a certain group of people is to use the resource governor. You can throttle how much CPU and/or memory is used up be a particular user/group. This is effective if you have a "wild west" kind of environment where some users submit bad queries that eat your resources alive. See here.
Another thing to consider is to set your MAXDOP (max degree of parallelism) to prevent any single query from taking all of the available CPU threads. That is, if MAXDOP is 1, then any query can only take 2 CPU threads to process. This is useful to prevent a large query from letting smaller quick ones processing. See here.
Kind of hacky but put a top x in every view
You cannot enforce it at the SQL side but on the app size they could use a TimeOut. But if they lack QC they probably lack the discipline for TimeOut. If you have some queries going 30 minutes they are probably setting a value longer than the default.
I'm not convinced about Blam's top X in each view. Without a corresponding ORDER BY clause the data will be returned in an indeterminate order. There may benefits to CDC's MAXDOP suggestion. Not so much for itself, but for the other queries that want to run at the same time.
I'd be inclined to look at moving to stored procedures. Then you can require input parameters and evaluate them before the query gets run in earnest. If, for example, a date range is too big, you can restrict it. You should also find out who is running the expensive query and what they really need. Seems like they might benefit from some ETL. Just some ideas.

DBGrid Filter, Delphi.

I've recently delved into the world of Delphi, for my current mini project I'm obtaining data via an SQL query and then using the filter property to display exactly what I want.
I discovered the filter by mistake and now prefer it instead of making multiple connections or calls to the database. For example, I'm returning a person object that may own many cars, the app has a check box and depending which one is selected it will update the filter to display only he cars that are blue or pink or whatever.
As far as I understand it, the filter works like a where clause but on the Dataset that is returned from the initial query. So, my question is: Is it faster to use the filter property when working with a small dataset in this manner and I am completely wrong in thinking that Dataset is returned, stored and then the filter is applied to that as opposed to constantly being updated?
I've looked online, the resources do lead me to believe that it is more efficient but I'm still unsure. Thanks for any help.
A filter on a dataset does indeed work (or at least behave) like a WHERE clause, and in some cases can be very fast.
The issues with depending on filters are:
Increased network traffic. You're moving considerably more data from the server to the client that isn't needed, because you're just filtering it out anyway.
Filters are applied to the data row-by-row. A WHERE clause can be optimized by the server to be all (or at least partially) based on existing indexes, whereas the client does not have those indexes available.
Increased memory and CPU use on the client to maintain data it isn't using in memory and to process the rows for filtering.
Data updated by other users or processes is not visible to the client app, as you're now working with all of the data in local memory and not refreshing from the server.
IMO, using a filter on all but a trivial dataset isn't a good option, and if the amount of data is that small you can move the entire dataset into a TClientDataSet and keep it in memory yourself anyway. Like every other optimization being considered, the proper answer depends on the needs of your application and the actual data in question, and should be benchmarked using that criteria to determine what is actually the better solution.
Two different animals. You're asking if it's less overhead to repeatedly query a database or do the filtering exclusively on the client side.
If your app and db are both running on the same machine, then it's probably a toss-up.
But if you're running this in a client-server, n-tier, or partitioned mobile application, and this is a common operation, then I'd say you're probably better off cacheing a larger set of results made in a single query on the client side and using filters to allow the users to see different views of the results. That reduces the bandwidth to the host and the users enjoy faster response times.
(It's a pet peeve of mine to be searching for cars or apartments or real estate and I check or un-check a box to change the view, and I have to wait 5-10 seconds for the app to reply.)
That said, you might also want to consider the overall size of the data, it's temporality, how often it's updated, and see if it's worthwhile loading down significant chunks to the client and localize even more of the specialized views. Pull down whole records and cache them locally to offer users faster response times. And minimize reloading of cached records whenever possible.
A lot of times, the actual data is fairly small on a per-record basis. But when you add-in the media stuff, it explodes. People often don't think about that, considering only the aggregate size of each "record" including the media blobs. If the DB designer was smart, the media isn't even being stored in the DB, but elsewhere, and accessible via URLs.

SQL DB performance and repeated queries at short intervals

If a query is constantly sent to a database at short intervals, say every 5 seconds, could the number of reads generated cause problems in terms of performance or availability? If the database is Oracle are there any tricks that can be used to avoid a performance hit? If the queries are coming from an application is there a way to reduce any impact through software design?
Unless your query is very intensive or horribly written then it won't cause any noticeable issues running once every few seconds. That's not very often for queries that are generally measured in milliseconds.
You may still want to optimize it though, simply because there are better ways to do it. In Oracle and ADO.NET you can use an OracleDependency for the command that ran the query the first time and then subscribe to its OnChange event which will get called automatically whenever the underlying data would cause the query results to change.
It depends on the query. I assume the reason you want to execute it periodically is because the data being returned will changed frequently. If that's the case, then application level caching is obviously not an option.
Past that, is this query "big" in terms of the number of rows returned, tables joined, data aggregated / calculated? If so, it could be a problem if:
You are querying faster than it takes to execute the query. If you are calling it once a second, but it takes 2 seconds to run, that's going to become a problem.
If the query is touching a lot of data and you have a lot of other queries accessing the same tables, you could run into lock escalation issues.
As with most performance questions, the only real answer is to test. In this case test with realistic data in the DB and run this query concurrent with the other query load you expect on the system.
Along the lines of Samuel's suggestion, Oracle provides facilities in JDBC to do database change notification so that your application can subscribe to changes in the underlying data rather than re-running the query every few seconds. If the data is changing less frequently than you're running the query, this can be a major performance benefit.
Another option would be to use Oracle TimesTen as an in memory cache of the data on the middle tier machine(s). That will reduce the network round-trips and it will go through a very optimized retrieval path.
Finally, I'd take a look at using the query result cache to have Oracle cache the results.

What are the benefits of using database cursor?

It is based on the interview question that I faced.
Very short definition can be
It can be used to manipulate the rows
returned by a query.
Besides the use of the cursor (Points are listed here on MSDN), I have a question in my mind that if we can perform all the operations using query or stored procedure (if I'm not wrong, Like we can use Transact-SQL for ms-sql), is there any concrete point that we should use cursor?
Using cursors compared to big resultsets is like using video streaming instead of downloading an video in one swoop, and watching it when it has downloaded.
If you download, you have to have a few gigs of space and the patience to wait until the download finished. Now, no matter how fast your machine or network may be, everyone watches a movie at the same speed.
Normally any query gets sent to the server, executed, and the resultset sent over the network to you, in one burst of activity.
The cursor will give you access to the data row by row and stream every row only when you request it (can actually view it).
A cursor can save you time - because you don't need to wait for the processing and download of your complete recordset
It will save you memory, both on the server and on the client because they don't have to dedicate a big chunk of memory to resultsets
Load-balance both your network and your server - Working in "burst" mode is usually more efficient, but it can completely block your server and your network. Such delays are seldom desirable for multiuser environments. Streaming leaves room for other operations.
Allows operations on queried tables (under certain conditions) that do not affect your cursor directly. So while you are holding a cursor on a row, other processes are able to read, update and even delete other rows. This helps especially with very busy tables, many concurrent reads and writes.
Which brings us to some caveats, however:
Consistency: Using a cursor, you do (usually) not operate on a consistent snapshot of the data, but on a row. So your concurrency/consistency/isolation guarantees drop from the whole database (ACID) to only one row. You can usually inform your DBMS what level of concurrency you want, but if you are too nitpicky (locking the complete table you are in), you will throw away many of the resource savings on the server side.
Transmitting every row by itself can be very inefficient, since every packet has negotiation overhead that you might avoid by sending big, maybe compressed, chunks of data per packet. ( No DB server or client library is stupid enough to transmit every row individually, there's caching and chunking on both ends, still, it is relevant.)
Cursors are harder to do right. Consider a query with a big resultset, motivating you to use a cursor, that uses a GROUP BY clause with aggregate functions. (Such queries are common in data warehouses). The GROUP BY can completely trash your server, because it has to generate and store the whole resultset at once, maybe even holding locks on other tables.
Rule of thumb:
If you work on small, quickly created resultsets, don't use cursors.
Cursors excell on ad hoc, complex (referentially), queries of sequential nature with big resultsets and low consistency requirements.
"Sequential nature" means there are no aggregate functions in heavy GROUP BY clauses in your query. The server can lazily decide to compute 10 rows for your cursor to consume from a cache and do other stuff meanwhile.
HTH
A cursor is a tool that allows you to iterate the records in a set. It has concepts of order and current record.
Generally, SQL operates with multisets: these are sets of possibly repeating records in no given order, taken as a whole.
Say, this query:
SELECT *
FROM a
JOIN b
ON b.a = a.id
, operates on multisets a and b.
Nothing in this query makes any assumptions about the order of the records, how they are stored, in which order they should be accessed, etc.
This allows to abstract away implementation details and let the system try to choose the best possible algorithm to run this query.
However, after you have transformed all your data, ultimately you will need to access the records in an ordered way and one by one.
You don't care about how exactly the entries of a phonebook are stored on a hard drive, but a printer does require them to be feed in alphabetical order; and the formatting tags should be applied to each record individually.
That's exactly where the cursors come into play. Each time you are processing a resultset on the client side, you are using a cursor. You don't get megabytes of unsorted data from the server: you just get a tiny variable: a resultset descriptor, and just write something like this:
while (!rs.EOF) {
process(rs);
rs.moveNext();
}
That's cursor that implements all this for you.
This of course concerns database-client interaction.
As for the database itself: inside the database, you rarely need the cursors, since, as I have told above, almost all data transformations can be implemented using set operations more efficiently.
However, there are exceptions:
Analytic operations in SQL Server are implemented very poorly. A cumulative sum, for instance, could be calculated much more efficiently with a cursor than using the set-based operations
Processing data in chunks. There are cases when a set based operation should be sequentially applied to a portion of a set and the results of each chunk should be committed independently. While it's still possible to do it using set-based operations, a cursor is often a more preferred way to do this.
Recursion in the systems that do not support it natively.
You also may find this article worth reading:
The Island of Misfit Cursors
Using a cursor it is possible to read sequentially through a set of data, programmatically, so it behaves in a similar manner to conventional file access, rather than the set-based behaviour characteristic of SQL.
There are a couple of situations where this may be of use:
Where it is necessary to simulate file-based record access behaviour - for example, where a relational database is being used as the data storage mechanism for a piece of code that was previously written to use indexed files for data storage.
Where it is necessary to process data sequentially - a simple example might be to calculate a running total balance for a specific customer. (A number of relational databases, such as Oracle and SQLServer, now have analytical extensions to SQL that should greatly reduce the need for this.)
Inevitably, wikipedia has more: http://en.wikipedia.org/wiki/Database_cursor
With cursor you access one row at a time. So it is good to use it when you want manipulate with a lot of rows but with only one at a given time.
I was told at my classes, the reason to use cursor is you want to access more rows than you can fit your memory - so you can't just get all rows into a collection and then loop through it.
Sometimes a set based logic can get quite complex and opaque. In these cases and if the performance is not an issue a server side cursor can be used to replace the relational logic with a more manageable and familiar (to a non relational thinker) procedural logic resulting in easier maintenance.