SAS: Approximating runtime/processing time for a query without running - sql

Was wondering if there is a way to estimate the total run time for a query without actually processing the query?
I have found when running particular queries it might take hours and I guess it would come in handy to know the approximate completion time as there have been times where I was stuck at work due to waiting for a query to finish.
Sorry not sure if this is a silly question I'm a bit new with SAS.
Thanks guys.

This comes down to how well you know the data you're working with. There is no simple method of estimating this that is guaranteed to work in all situations, as there are so many factors that contribute to query performance. That said, there are a few heuristics you can use:
If you're reading every row from a large table, try reading a small proportion of it first before scaling up to get an idea of how much that read will contribute to the total query execution time.
Try running your query with proc sql inobs = 100 _method; to find out what sorts of joins the query planner is selecting. If there are any cartesian joins (sqxjsl in the log output), your query is going to take at least O(m*n) to run, where m and n are the numbers of rows in the tables being joined.
Check whether there are any indexes on the tables that could potentially speed up your query.

Related

The performance of retrieve all columns and multiple columns

I am learning SQL following "SQL in 10 minutes",
Reference to use wildcards to retrieve all the records, it states that:
As a rule, you are better off not using the * wildcard unless you really do need every column in the table. Even though use of wildcards may save you the time and effort needed to list the desired columns explicitly, retrieving unnecessary columns usually slows down the performance of your retrieval and your application.
However, It consume less time to retrieve all the records than to retrieve multiple fields:
As the result indicate, wildcards for 0.02 seconds V.S. 0.1 seconds
I tested several times, wildcards faster than multiple specified columns constantly, even though time consumed varied every times.
Kudos to you for attempting to validate advice you get in a book! A single test neither invalidates the advice nor invalidates the test. It is worthwhile to dive further.
The advice provided in SQL In 10 Minutes is sound advice -- and it explicitly states that the purpose related to performance. (Another consideration is that that it makes the code unstable when the database changes.) As a note: I regularly use select t.* for ad-hoc queries.
Why are the results different? There can be multiple reasons for this:
Databases do not have deterministic performance, so other considerations -- such as other processes running on the machine or resource contention -- can affect the performance.
As mentioned in a comment, caching can be the reason. Specifically, running the first query may require loading the data from disk, and it is already in memory for the first.
Another form of caching is for the execution plan, so perhaps the first execution plan is cached but not the second.
You don't mention the database, but perhaps your database has a really, really slow compiler and compiling the first takes longer than the second.
Fundamentally, the advice is sound from a common-sense perspective. Moving less data around should be more efficient. That is really what the advice is saying.
In any case, the difference between 10 milliseconds and 2 milliseconds is very short. I would not generalize this performance to larger data and say that the second is 5 times faster than the first in general. For whatever reason, it is 8 milliseconds shorter on a very small data set, one so small that performance would not be a consideration anyway.
For manual testing the data that's in a table or tables?
Then it doesn't matter much whether you used a * or the column names.
Sure, if the table has like 100 columns and you only are interested in a few? Then explicitly adding the columnnames will give you a less convulted result.
Plus, you can choose the order they appear in the result.
And using a * in a sub-query would drag all the fields into the resultset.
While if you only selected the columns you need could improve performance.
For manual testing, that normally doesn't matter much.
Whether a test SQL runs 1 seconds or 2 seconds, if it's a test or an ad-hoc query then it wouldn't bother you.
What the suggestion is more intended for, is about coding SQL's that are to be used in a production environment.
When using * in a SQL, that means that when something changes in the tables that are used in the query, that it can affect the output of that query.
Possibly leading to errors. Your boss would frown upon that!
For example, a SQL with a select * from tableA union select * from tableB that you coded a year ago suddenly starts crashing because a column was added to tableB. Ouch.
But by explicitly putting the column names, adding a column to 1 of the tables wouldn't make any difference to that SQL.
In other words.
In production, stability and performance matter much more than golf-coding.
Another thing to keep in mind is the effect of caching.
Some databases can temporarly store metadata or even data in memory.
Which can speed up the retrieval of a query that gets the same results of a query that just run before it.
So try running the following SQL's.
Which are in a different order than in the question.
And check if there's still a speed difference.
select * from products;
select prod_id, prod_name, prod_price from products;

neo4j queries response time

Testing queries responses time returns interesting results:
When executing the same query several times in a row, at first the response times get better until a certain point, then in each execute it gets a little slower or jumps inconsistently.
Running the same query while using the USING INDEX and in other times not using the USING INDEX, returns almost the same responses times range (as described in clause 1), although the profile is getting better (less db hits while using the USING INDEX).
Dropping the index and re-running the query returns the same profile as executing the query while the index exists but the query has been executed without the USING INDEX.
Is there an explanation to the above results?
What will be the best way to know if the query has been improved if although the db hits are getting better, the response times aren't?
The best way to understand how a query executes is probably to use the PROFILE command, which will actually explain how the database goes about executing the query. This should give you feedback on what cypher does with USING INDEX hints. You can also compare different formulations of the same query to see which result in fewer dbHits.
There probably is no comprehensive answer to why the query takes a variable amount of time in various situations. You haven't provided your model, your data, or your queries. It's dependent on a whole host of factors outside of just your query, for example your data model, your cache settings, whether or not the JVM decides to garbage collect at certain points, how full your heap is, what kind of indexes you have (whether or not you use USING INDEX hints) -- and those are only the factors at the neo4j/java level. At the OS level there are many other possibilities/contingencies that make precise performance measurement difficult.
In general when I'm concerned about these things I find it's good to gather a large data sample (run the query 10,0000 times) and then take an average. All of the factors that are outside of your control tend to average out in a sample like that, but if you're looking for a concrete prediction of exactly how long this next query is going to take, down to the milliseconds, that may not be realistically possible.

How do I improve performance when querying on a column that changes frequently in SQL Azure using LINQ to SQL

I have an SQL Azure database, and one of the tables contains over 400k objects. One of the columns in this table is a count of the number of times that the object has been downloaded.
I have several queries that include this particular column (call it timesdownloaded), sorted descending, in order to find the results.
Here's an example query in LINQ to SQL (I'm writing all this in C# .NET):
var query = from t in db.tablename
where t.textcolumn.StartsWith(searchfield)
orderby t.timesdownloaded descending
select t.textcolumn;
// grab the first 5
var items = query.Take(5);
This query called perhaps 90 times per minute on average.
Objects are downloaded perhaps 10 times per minute on average, so this timesdownloaded column is updated that frequently.
As you can imagine, any index involving the timesdownloaded column gets over 30% fragmented in a matter of hours. I have implemented an index maintenance plan that checks and rebuilds these indexes when necessary every few hours. This helps, but of course adds spikes in query response times whenever the indexes are rebuilt which I would like to avoid or minimize.
I have tried a variety of indexing schemes.
The best performing indexes are covering indexes that include both the textcolumn and timesdownloaded columns. When these indexes are rebuilt, the queries are amazingly quick of course.
However, these indexes fragment badly and I end up with pretty frequent delay spikes due to rebuilding indexes and other factors that I don't understand.
I have also tried simply not indexing the timesdownloaded column. This seems to perform more consistently overall, though slower of course. And when I check on the SQL query execution plan, it seems to be pretty inconsistent in how SQL tries to optimize this query. Of course it ends up with a log of logical reads as it has to fetch the timesdownloaded column from the table and not an organized index. So this isn't optimal.
What I'm trying to figure out is if I am fundamentally missing something in how I have configured or manage this database.
I'm no SQL expert, and I've yet to find a good answer for how to do this.
I've seen some suggestions that Stored Procedures could help, but I don't understand why and haven't tried to get those going with LINQ just yet.
As commented below, I have considered caching but haven't taken that step yet either.
For some context, this query is a part of a search suggestion feature. So it is called frequently with many different search terms.
Any suggestions would be appreciated!
Based on the comments to my question and further testing, I ended up using an Azure Table to cache my results. This is working really well and I get a lot of hits off of my cache and many fewer SQL queries. The overall performance of my API is much better now.
I did try using Azure In Role Caching, but that method doesn't appear to work well for my needs. It ended up using too much memory (no matter how I configured it, which I don't understand), swapping to disk like crazy and brought my little Small instances to their knees. I don't want to pay more at the moment, so Tables it is.
Thanks for the suggestions!

Different in postgresql query execution timing

I find that the two queries given below when fired on PostgreSQL generate different query execution times:
Query1:
\timing
select s0.value,s1.value,s2.value,s3.value,s4.value
from (
select f0.subject as r0,f0.predicate as r1,f0.object as r2,f1.predicate as r3,f1.object as r4
from schemaName.facts f0,schemaName.facts f1
where f1.subject=f0.subject
) facts,schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings s3,schemaName.strings s4
where s0.id=facts.r0 and s1.id=facts.r1 and s2.id=facts.r2 and s3.id=facts.r3 and s4.id=facts.r4;
Query1 rewritten:
select s0.value,s1.value,s2.value,s3.value,s4.value
from schemaName.strings s0,schemaName.strings s1,schemaName.strings s2,schemaName.strings, schemaName.facts f0,schemaName.facts f1 s3,schemaName.strings s4
where s0.id=f0.subject and s1.id=f0.predicate and s2.id=f0.object and s3.id=f1.predicate and s4.id=f1.object, f0.subject=f1.subject;
I am unable to understand the reason behind postgresql generating different query execution times. Can someone please help me understand this?
Postgresql comes with a very nice command: EXPLAIN and EXPLAIN ANALYZE. The former prints out the query plan with estimates of how long things will take, and the latter outputs the query plan while actually running the query, which allows it to place the real execution costs with the plan.
Postgresql uses a whole mess of criteria and heuristics to decide how best to run a query. Everything from sequential and random access costs (tunable in the configs) to statistical samplings of the data in the tables.
I've found that very often it will come up with the same query plan give two radically different-looking queries (assuming they give the same results), and I've seen the query structure affect the plan. The best way to see what it is doing is to ask it to explain.
All of that said: the second run will always be faster than the first, since the data is now cached. So, if you are really trying to compare runtimes, be sure to run each query at least four times, drop the first one, and average the rest.

How to improve query performance

I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.
This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)
You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial
For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.
There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.
Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)
I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID
Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?