Why the Jpa N+1 is something so bad? - sql

I know that we must avoid that behavior, using join fetch instead of letting the JPA manage it by making multiple queries, but the question is: why it's so bad performatic since we calling all queries in the same session?
Exemplo :
Select * from person
Select * from accounts
Select * from person p left join fetch p.accounts
My question is just about performance, what the justification for the last one be more performatic?
Thanks

Because there's more than just retrieving the data when you run a query. The other phases can be quite expensive. To name a few:
Prepare the connection.
The query is sent through the wire to the database server.
The db engine parses the query. The cache is populated.
The db engine rewrites/rephrase the query to suit internal needs.
The cache is checked. Otherwise is populated and managed.
The db engine evaluates multiple execution plans for the query.
The db engine chooses the optimal execution plan somehow.
The query is run, the data is retrieved, and this has I/O consequences.
The result set is returned throught the wire.
You may have considered the query only included the query is run phase, while in reality there are many other tasks the db is performing.
Also, once a single I/O operation retrieves many rows at once, and you would be discarding many of those unnecessarily.

Related

SQL: Loop single select vs one select with IN clause

I'd like to ask you what is faster if use
Loop array and
call select XXX where id=
Call select XXX where id IN (list value of array)
The second one is almost always faster. Remember that in the first option, the client (usually) has to do a full database connection, log in, send the query, wait for the query to get parsed, wait for the query to get optimized, wait for the query to execute and then wait for the result to get back. In the second option, all of these steps are done once.
There may be cases where the first option is actually faster if your index-schema is bad and can't be fixed or the server is seriously wrong about how to run the disjunction that is the IN-clause and can't be told otherwise.
Doing all the work in the database should always be faster. Each time you connect to the database you incur some overhead. This overhead might be relatively minor, if the query plan is cached and the cache is optimized, but the data still needs to go back and forth.
More importantly, database engines are optimized to run queries. Some databases optimize in expressions, using a binary lookup. Parallel databases can also take advantage of multiple processors for the query. The performance only gets worse if the from is a view or if your query is more complicated.
Under some conditions, the performance difference may not really be noticeable -- say for a local database where the table is cached in memory. However, your habit should be to do such work in the database, not in the application.

Single SELECT with linked server makes multiple SELECT by ID

This is my issue. I defined a linked server, let's call it LINKSERV, which has a database called LINKDB. In my server (MYSERV) I've got the MYDB database.
I want to perform the query below.
SELECT *
FROM LINKSERV.LINKDB.LINKSCHEMA.LINKTABLE
INNER JOIN MYSERV.MYDB.MYSCHEMA.MYTABLE ON MYKEYFIELD = LINKKEYFIELD
The problem is that if I take a look to the profiler, I see that in the LINKSERV server lots of SELECT are made. They looks similar to:
SELECT *
FROM LINKTABLE WHERE LINKKEYFIELD = #1
Where #1 is a parameter that is changed for every SELECT.
This is, of course, unwanted because it appears to be not performing. I could be wrong, but I suppose the problem is related to the use of different servers in the JOIN. In fact, if I avoid this, the problem disappear.
Am I right? Is there a solution? Thank you in advance.
What you see may well be the optimal solution, as you have no filter statements that could be used to limit the number of rows returned from the remote server.
When you execute a query that draws data from two or more servers, the query optimizer has to decide what to do: pull a lot of data to the requesting server and do the joins there, or somehow send parts of the query to the linked server for evaluation? Depending on the filters and the availability or quality of the statistics on both servers, the optimizer may pick different operations for the join (merge or nested loop).
In your case, it has decided that the local table has fewer rows than the target and requests the target row that correspons to each of the local rows.
This behavior and ways to improve performance are described in Linked Server behavior when used on JOIN clauses
The obvious optimizations are to update your statistics and add a WHERE statement that will filter the rows returned from the remote table.
Another optimization is to return only the columns you need from the remote server, instead of selecting *

Do SQL bind parameters affect performance?

Suppose I have a table called Projects with a column called Budget with a standard B-Tree index. The table has 50,000 projects, and only 1% of them have a Budget of over one million. If I ran the SQL Query:
SELECT * From Projects WHERE Budget > 1000000;
The planner will use an index range scan on Budget to get the rows off the heap table. However, if I use the query:
SELECT * From Projects WHERE Budget > 50;
The planner will most likely do a sequential scan on the table, as it will know this query will end up returning most or all rows anyway and there's no reason to load all the pages of the index into memory.
Now, let's say I run the query:
SELECT * From Projects WHERE Budget > :budget;
Where :budget is a bind parameter passed into my database. From what I've read, the query as above will be cached, and no data on cardinality can be inferred. In fact, most databases will just assume an even distribution and the cached query plan will reflect that. This surprised me, as usually when you read about the benefits of bind parameters it's on the subject of preventing SQL injection attacks.
Obviously, this could improve performance if the resulting query plan would be the same, as a new plan wouldn't have to be compiled, but could also hurt performance if the values of :budget greatly varied.
My Question: Why are bind parameters not resolved before the query plan is generated and cached? Shouldn't modern databases strive to generate the best plan for the query, which should mean looking at the value for each parameter and getting accurate index stats?
Note: This question probably doesn't apply to mySql as mySql doesn't cache SQL plans. However, I'm interested in why this is the case on Postgres, Oracle and MS SQL.
For Oracle specifically, it depends.
For quite some time (at least 9i), Oracle has supported bind variable peeking. That means that the first time a query is executed, the optimizer peeks at the value of the bind variable and bases its cardinality estimates on the value of that first bind variable. That makes sense in cases where most of the executions of a query are going to have bind variable values that return similarly sized results. If 99% of the queries are using small budget values, it is highly likely that the first execution will use a small value and thus the cached query plan will be appropriate for small bind variable values. Of course, that means that when you do specify a large bind variable value (or, worse, if you get lucky and the first execution is with a large value) you'll get less than optimal query plans.
If you are using 11g, Oracle can use adaptive cursor sharing. This allows the optimizer to maintain multiple query plans for a single query and to pick the appropriate plan based on the bind variable values. That can get rather complicated over time, though. If you have a query with N bind variables, the optimizer has to figure out how to partition that N-dimensional space into different query plans for different bind variable values in order to figure out when and whether to re-optimize a query for a new set of bind variable values and when to simply reuse an earlier plan. A lot of that work ends up being done at night during the nightly maintenance window in order to avoid incurring those costs during the productive day. But that also brings up issues about how much freedom the DBA wants to give the database to evolve plans over time vs how much the DBA wants to control plans so that the database doesn't suddenly start picking a poor plan that causes some major system to slow to a crawl on a random day.
This surprised me, as usually when you read about the benefits of bind parameters it's on the subject of preventing SQL injection attacks.
Don't confuse parameterized queries with prepared statements. Both offer parameterization, but prepared statements offer the additional caching of the query plan.
Why are bind parameters not resolved before the query plan is generated and cached?
Because sometimes generating the query plan is an expensive step. Prepared statements allow you to amortize the cost of query planning.
However, if all you're looking for is SQL injection protection, don't use prepared statements. Use parameterized queries.
For example, in PHP, you can use http://php.net/pg_query_params to execute a parameterized query WITHOUT caching the query plan; meanwhile http://php.net/pg_prepare and http://php.net/pg_execute are used to cache a plan for a prepared statement and later execute it.
Edit: 9.2 apparently changes the way prepared statements are planned

detect cartesian product or other non sensible queries

I'm working on a product which gives users a lot of "flexibility" to create sql, ie they can easily set up queries that can bring the system to it's knees with over inclusive where clauses.
I would like to be able to warn users when this is potentially the case and I'm wondering if there is any known strategy for intelligently analysing queries which can be employed to this end?
I feel your pain. I've been tasked with something similar in the past. It's a constant struggle between users demanding all of the features and functionality of SQL while also complaining that it's too complicated, doesn't help them, doesn't prevent them from doing stupid stuff.
Adding paging into the query won't stop bad queries from being executed, but it will reduce the damage. If you only show the first 50 records returned from SELECT * FROM UNIVERSE and provide the ability to page to the next 50 and so on and so forth, you can avoid out of memory issues and reduce the performance hit.
I don't know if it's appropriate for your data/business domain; but I forcefully add table joins when the user doesn't supply them. If the query contains TABLE A and TABLE B, A.ID needs to equal B.ID; I add it.
If you don't mind writing code that is specific to a database, I know you can get data about a query from the database (Explain Plan in Oracle - http://www.adp-gmbh.ch/ora/explainplan.html). You can execute the plan on their query first, and use the results of that to prompt or warn the user. But the details will vary depending on which DB you are working with.

Adding inner query is not changing the execution plan

Consider the following queries.
select * from contact where firstname like '%some%'
select * from
(select * from contact) as t1
where firstname like '%some%'
The execution plans for both queries are same and executes at same time. But, I was expecting the second query will have a different plan and execute more slowly as it has to select all data from contact and apply filter. It looks like I was wrong.
I am wondering how this is happening?
Database Server : SQL server 2005
The "query optimiser" is what's happening. When you run a query, SQL Server uses a cost-based optimiser to identify what is likely to be the best way to fulfil that request (i.e. it's execution plan). Think about it as a route map from Place A to Place B. There may be many different ways to get from A to B, some will be quicker than others. SQL Server will workout different routes to achieve the end goal of returning the data that satisfies the query and go with one that has an acceptable cost. Note, it doesn't necessarily analyse EVERY possible way, as that would be unnecessarily expensive.
In your case, the optimiser has worked out that those 2 queries can be collapsed down to the same thing, hence you get the same plan.