SQL: Loop single select vs one select with IN clause

SQL: Loop single select vs one select with IN clause - sql

I'd like to ask you what is faster if use
Loop array and
call select XXX where id=
Call select XXX where id IN (list value of array)

The second one is almost always faster. Remember that in the first option, the client (usually) has to do a full database connection, log in, send the query, wait for the query to get parsed, wait for the query to get optimized, wait for the query to execute and then wait for the result to get back. In the second option, all of these steps are done once.
There may be cases where the first option is actually faster if your index-schema is bad and can't be fixed or the server is seriously wrong about how to run the disjunction that is the IN-clause and can't be told otherwise.

Doing all the work in the database should always be faster. Each time you connect to the database you incur some overhead. This overhead might be relatively minor, if the query plan is cached and the cache is optimized, but the data still needs to go back and forth.
More importantly, database engines are optimized to run queries. Some databases optimize in expressions, using a binary lookup. Parallel databases can also take advantage of multiple processors for the query. The performance only gets worse if the from is a view or if your query is more complicated.
Under some conditions, the performance difference may not really be noticeable -- say for a local database where the table is cached in memory. However, your habit should be to do such work in the database, not in the application.

Related

The performance of retrieve all columns and multiple columns

I am learning SQL following "SQL in 10 minutes",
Reference to use wildcards to retrieve all the records, it states that:
As a rule, you are better off not using the * wildcard unless you really do need every column in the table. Even though use of wildcards may save you the time and effort needed to list the desired columns explicitly, retrieving unnecessary columns usually slows down the performance of your retrieval and your application.
However, It consume less time to retrieve all the records than to retrieve multiple fields:
As the result indicate, wildcards for 0.02 seconds V.S. 0.1 seconds
I tested several times, wildcards faster than multiple specified columns constantly, even though time consumed varied every times.

Kudos to you for attempting to validate advice you get in a book! A single test neither invalidates the advice nor invalidates the test. It is worthwhile to dive further.
The advice provided in SQL In 10 Minutes is sound advice -- and it explicitly states that the purpose related to performance. (Another consideration is that that it makes the code unstable when the database changes.) As a note: I regularly use select t.* for ad-hoc queries.
Why are the results different? There can be multiple reasons for this:
Databases do not have deterministic performance, so other considerations -- such as other processes running on the machine or resource contention -- can affect the performance.
As mentioned in a comment, caching can be the reason. Specifically, running the first query may require loading the data from disk, and it is already in memory for the first.
Another form of caching is for the execution plan, so perhaps the first execution plan is cached but not the second.
You don't mention the database, but perhaps your database has a really, really slow compiler and compiling the first takes longer than the second.
Fundamentally, the advice is sound from a common-sense perspective. Moving less data around should be more efficient. That is really what the advice is saying.
In any case, the difference between 10 milliseconds and 2 milliseconds is very short. I would not generalize this performance to larger data and say that the second is 5 times faster than the first in general. For whatever reason, it is 8 milliseconds shorter on a very small data set, one so small that performance would not be a consideration anyway.

For manual testing the data that's in a table or tables?
Then it doesn't matter much whether you used a * or the column names.
Sure, if the table has like 100 columns and you only are interested in a few? Then explicitly adding the columnnames will give you a less convulted result.
Plus, you can choose the order they appear in the result.
And using a * in a sub-query would drag all the fields into the resultset.
While if you only selected the columns you need could improve performance.
For manual testing, that normally doesn't matter much.
Whether a test SQL runs 1 seconds or 2 seconds, if it's a test or an ad-hoc query then it wouldn't bother you.
What the suggestion is more intended for, is about coding SQL's that are to be used in a production environment.
When using * in a SQL, that means that when something changes in the tables that are used in the query, that it can affect the output of that query.
Possibly leading to errors. Your boss would frown upon that!
For example, a SQL with a select * from tableA union select * from tableB that you coded a year ago suddenly starts crashing because a column was added to tableB. Ouch.
But by explicitly putting the column names, adding a column to 1 of the tables wouldn't make any difference to that SQL.
In other words.
In production, stability and performance matter much more than golf-coding.
Another thing to keep in mind is the effect of caching.
Some databases can temporarly store metadata or even data in memory.
Which can speed up the retrieval of a query that gets the same results of a query that just run before it.
So try running the following SQL's.
Which are in a different order than in the question.
And check if there's still a speed difference.
select * from products;
select prod_id, prod_name, prod_price from products;

neo4j queries response time

Testing queries responses time returns interesting results:
When executing the same query several times in a row, at first the response times get better until a certain point, then in each execute it gets a little slower or jumps inconsistently.
Running the same query while using the USING INDEX and in other times not using the USING INDEX, returns almost the same responses times range (as described in clause 1), although the profile is getting better (less db hits while using the USING INDEX).
Dropping the index and re-running the query returns the same profile as executing the query while the index exists but the query has been executed without the USING INDEX.
Is there an explanation to the above results?
What will be the best way to know if the query has been improved if although the db hits are getting better, the response times aren't?

The best way to understand how a query executes is probably to use the PROFILE command, which will actually explain how the database goes about executing the query. This should give you feedback on what cypher does with USING INDEX hints. You can also compare different formulations of the same query to see which result in fewer dbHits.
There probably is no comprehensive answer to why the query takes a variable amount of time in various situations. You haven't provided your model, your data, or your queries. It's dependent on a whole host of factors outside of just your query, for example your data model, your cache settings, whether or not the JVM decides to garbage collect at certain points, how full your heap is, what kind of indexes you have (whether or not you use USING INDEX hints) -- and those are only the factors at the neo4j/java level. At the OS level there are many other possibilities/contingencies that make precise performance measurement difficult.
In general when I'm concerned about these things I find it's good to gather a large data sample (run the query 10,0000 times) and then take an average. All of the factors that are outside of your control tend to average out in a sample like that, but if you're looking for a concrete prediction of exactly how long this next query is going to take, down to the milliseconds, that may not be realistically possible.

Stored procedure vs multiple queries (activerecord)

Just got started with activerecord. I am just wondering if I am making multiple queries in a request, is it better to have a stored procudure? That way, I am making 1 SQL query as oppose to multiple.
For example, a lot of times, I check if a record exists, if not, I create it. That's 2 queries. I can have that in a stored procedure which will make only 1 queries. Is that a way to go to increase performance?
Thanks

The performance increase will be gained by the fact that the stored procedure will stay compiled and the execution plan will stay stored. If you make your query from the application, and they are few and far between, that probably won't be the case.
As far as typing goes, it's about the same amount, just in different places.

For example, a lot of times, I check if a record exists, if not, I create it. That's 2 queries.
A DBA would probably advise you to
just insert the row, or
use the equivalent of the SQL MERGE statement, or
use whatever other statement your dbms supports (and which I presume ActiveRecord implements in one way or another), like REPLACE, or ON DUPLICATE KEY UPDATE ...
and trap the error that results from inserting a duplicate row.
All those should require only one round-trip to the database. And you have to trap errors anyway--there are a lot of things that can prevent an INSERT statement from succeeding besides a duplicate key.
On the other hand, if you have a business process that really requires a succession of SQL statements, it probably makes sense to write a stored procedure. That's not quite the same thing as "I am making 1 SQL query as oppose to multiple." You still have to execute the succession of SQL statements. You pass less to the server, but that's probably not important. The server still executes all the SQL statements.

I'm not sure that I'd bother with this level of performance tuning.
As far as single row inserts, updates, deletes etc are concerned, if you've got the right indexes in place then database performance is likely to be a very small performance concern compared with ruby, DOM processing on the client, network time, etc.. You might save yourself a few milliseconds, but that's unlikely to be significant.
You'll probably appreciate saving time by using:
Model.where(:blah => etc).first_or_initialize(:whatever => 'wotsit')
... so that you can later fret over a significant performance issue.

How to improve query performance

I have a lot of records in table. When I execute the following query it takes a lot of time. How can I improve the performance?
SET ROWCOUNT 10
SELECT StxnID
,Sprovider.description as SProvider
,txnID
,Request
,Raw
,Status
,txnBal
,Stxn.CreatedBy
,Stxn.CreatedOn
,Stxn.ModifiedBy
,Stxn.ModifiedOn
,Stxn.isDeleted
FROM Stxn,Sprovider
WHERE Stxn.SproviderID = SProvider.Sproviderid
AND Stxn.SProviderid = ISNULL(#pSProviderID,Stxn.SProviderid)
AND Stxn.status = ISNULL(#pStatus,Stxn.status)
AND Stxn.CreatedOn BETWEEN ISNULL(#pStartDate,getdate()-1) and ISNULL(#pEndDate,getdate())
AND Stxn.CreatedBy = ISNULL(#pSellerId,Stxn.CreatedBy)
ORDER BY StxnID DESC
The stxn table has more than 100,000 records.
The query is run from a report viewer in asp.net c#.

This is my go-to article when I'm trying to do a search query that has several search conditions which might be optional.
http://www.sommarskog.se/dyn-search-2008.html
The biggest problem with your query is the column=ISNULL(#column, column) syntax. MSSQL won't use an index for that. Consider changing it to (column = #column AND #column IS NOT NULL)

You should consider using the execution plan and look for missing indexes. Also, how long it takes to execute? What is slow for you?
Maybe you could also not return so many rows, but that is just a guess. Actually we need to see your table and indexes plus the execution plan.
Check sql-tuning-tutorial

For one, use SELECT TOP () instead of SET ROWCOUNT - the optimizer will have a much better chance that way. Another suggestion is to use a proper inner join instead of potentially ending up with a cartesian product using the old style table,table join syntax (this is not the case here but it can happen much easier with the old syntax). Should be:
...
FROM Stxn INNER JOIN Sprovider
ON Stxn.SproviderID = SProvider.Sproviderid
...
And if you think 100K rows is a lot, or that this volume is a reason for slowness, you're sorely mistaken. Most likely you have really poor indexing strategies in place, possibly some parameter sniffing, possibly some implicit conversions... hard to tell without understanding the data types, indexes and seeing the plan.

There are a lot of things that could impact the performance of query. Although 100k records really isn't all that many.
Items to consider (in no particular order)
Hardware:
Is SQL Server memory constrained? In other words, does it have enough RAM to do its job? If it is swapping memory to disk, then this is a sure sign that you need an upgrade.
Is the machine disk constrained. In other words, are the drives fast enough to keep up with the queries you need to run? If it's memory constrained, then disk speed becomes a larger factor.
Is the machine processor constrained? For example, when you execute the query does the processor spike for long periods of time? Or, are there already lots of other queries running that are taking resources away from yours...
Database Structure:
Do you have indexes on the columns used in your where clause? If the tables do not have indexes then it will have to do a full scan of both tables to determine which records match.
Eliminate the ISNULL function calls. If this is a direct query, have the calling code validate the parameters and set default values before executing. If it is in a stored procedure, do the checks at the top of the s'proc. Unless you are executing this with RECOMPILE that does parameter sniffing, those functions will have to be evaluated for each row..
Network:
Is the network slow between you and the server? Depending on the amount of data pulled you could be pulling GB's of data across the wire. I'm not sure what is stored in the "raw" column. The first question you need to ask here is "how much data is going back to the client?" For example, if each record is 1MB+ in size, then you'll probably have disk and network constraints at play.
General:
I'm not sure what "slow" means in your question. Does it mean that the query is taking around 1 second to process or does it mean it's taking 5 minutes? Everything is relative here.
Basically, it is going to be impossible to give a hard answer without a lot of questions asked by you. All of these will bear out if you profile the queries, understand what and how much is going back to the client and watch the interactions amongst the various parts.
Finally depending on the amount of data going back to the client there might not be a way to improve performance short of hardware changes.

Make sure Stxn.SproviderID, Stxn.status, Stxn.CreatedOn, Stxn.CreatedBy, Stxn.StxnID and SProvider.Sproviderid all have indexes defined.
(NB -- you might not need all, but it can't hurt.)

I don't see much that can be done on the query itself, but I can see things being done on the schema :
Create an index / PK on Stxn.SproviderID
Create an index / PK on SProvider.Sproviderid
Create indexes on status, CreatedOn, CreatedBy, StxnID

Something to consider: When ROWCOUNT or TOP are used with an ORDER BY clause, the entire result set is created and sorted first and then the top 10 results are returned.
How does this run without the Order By clause?

LEFT JOIN vs. multiple SELECT statements

I am working on someone else's PHP code and seeing this pattern over and over:
(pseudocode)
result = SELECT blah1, blah2, foreign_key FROM foo WHERE key=bar
if foreign_key > 0
other_result = SELECT something FROM foo2 WHERE key=foreign_key
end
The code needs to branch if there is no related row in the other table, but couldn't this be done better by doing a LEFT JOIN in a single SELECT statement? Am I missing some performance benefit? Portability issue? Or am I just nitpicking?

This is definitely wrong. You are going over the wire a second time for no reason. DBs are very fast at their problem space. Joining tables is one of those and you'll see more of a performance degradation from the second query then the join. Unless your tablespace is hundreds of millions of records, this is not a good idea.

There is not enough information to really answer the question. I've worked on applications where decreasing the query count for one reason and increasing the query count for another reason both gave performance improvements. In the same application!
For certain combinations of table size, database configuration and how often the foreign table would be queried, doing the two queries can be much faster than a LEFT JOIN. But experience and testing is the only thing that will tell you that. MySQL with moderately large tables seems to be susceptable to this, IME. Performing three queries on one table can often be much faster than one query JOINing the three. I've seen speedups of an order of magnitude.

I'm with you - a single SQL would be better

There's a danger of treating your SQL DBMS as if it was a ISAM file system, selecting from a single table at a time. It might be cleaner to use a single SELECT with the outer join. On the other hand, detecting null in the application code and deciding what to do based on null vs non-null is also not completely clean.
One advantage of a single statement - you have fewer round trips to the server - especially if the SQL is prepared dynamically each time the other result is needed.
On average, then, a single SELECT statement is better. It gives the optimizer something to do and saves it getting too bored as well.

It seems to me that what you're saying is fairly valid - why fire off two calls to the database when one will do - unless both records are needed independently as objects(?)
Of course while it might not be as simple code wise to pull it all back in one call from the database and separate out the fields into the two separate objects, it does mean that you're only dependent on the database for one call rather than two...
This would be nicer to read as a query:
Select a.blah1, a.blah2, b.something From foo a Left Join foo2 b On a.foreign_key = b.key Where a.Key = bar;
And this way you can check you got a result in one go and have the database do all the heavy lifting in one query rather than two...
Yeah, I think it seems like what you're saying is correct.

The most likely explanation is that the developer simply doesn't know how outer joins work. This is very common, even among developers who are quite experienced in their own specialty.
There's also a widespread myth that "queries with joins are slow." So many developers blindly avoid joins at all costs, even to the extreme of running multiple queries where one would be better.
The myth of avoiding joins is like saying we should avoid writing loops in our application code, because running a line of code multiple times is obviously slower than running it once. To say nothing of the "overhead" of ++i and testing i<20 during every iteration!

You are completely correct that the single query is the way to go. To add some value to the other answers offered let me add this axiom: "Use the right tool for the job, the Database server should handle the querying work, the code should handle the procedural work."
The key idea behind this concept is that the compiler/query optimizers can do a better job if they know the entire problem domain instead of half of it.

Considering that in one database hit you have all the data you need having one single SQL statement would be better performance 99% of the time. Not sure if the connections is being creating dynamically in this case or not but if so doing so is expensive. Even if the process if reusing existing connections the DBMS is not getting optimize the queries be best way and not really making use of the relationships.
The only way I could ever see doing the calls like this for performance reasons is if the data being retrieved by the foreign key is a large amount and it is only needed in some cases. But in the sample you describe it just grabs it if it exists so this is not the case and therefore not gaining any performance.

The only "gotcha" to all of this is if the result set to work with contains a lot of joins, or even nested joins.
I've had two or three instances now where the original query I was inheriting consisted of a single query that had so a lot of joins in it and it would take the SQL a good minute to prepare the statement.
I went back into the procedure, leveraged some table variables (or temporary tables) and broke the query down into a lot of the smaller single select type statements and constructed the final result set in this manner.
This update dramatically fixed the response time, down to a few seconds, because it was easier to do a lot of simple "one shots" to retrieve the necessary data.
I'm not trying to object for objections sake here, but just to point out that the code may have been broken down to such a granular level to address a similar issue.

A single SQL query would lead in more performance as the SQL server (Which sometimes doesn't share the same location) just needs to handle one request, if you would use multiple SQL queries then you introduce a lot of overhead:
Executing more CPU instructions,
sending a second query to the server,
create a second thread on the server,
execute possible more CPU instructions
on the sever, destroy a second thread
on the server, send the second results
back.
There might be exceptional cases where the performance could be better, but for simple things you can't reach better performance by doing a bit more work.

Doing a simple two table join is usually the best way to go after this problem domain, however depending on the state of the tables and indexing, there are certain cases where it may be better to do the two select statements, but typically I haven't run into this problem until I started approaching 3-5 joined tables, not just 2.
Just make sure you have covering indexes on both tables to ensure you aren't scanning the disk for all records, that is the biggest performance hit a database gets (in my limited experience)

You should always try to minimize the number of query to the database when you can. Your example is perfect for only 1 query. This way you will be able later to cache more easily or to handle more request in same time because instead of always using 2-3 query that require a connexion, you will have only 1 each time.

There are many cases that will require different solutions and it isn't possible to explain all together.
Join scans both the tables and loops to match the first table record in second table. Simple select query will work faster in many cases as It only take cares for the primary/unique key(if exists) to search the data internally.

We Keep Coding

sql objective-c vba vb.net react-native apache vue.js tensorflow api pandas